Clustering methods¶

Note

You can try this notebook in you browser:

Scikit-learn, SciPy and RAPIDS cuML backends¶

Clustergram offers three backends for the computation - scikit-learn and scipy which use CPU and RAPIDS.AI cuML, which uses GPU. Note that all are optional dependencies but you will need at least one of them to generate clustergram.

Let’s load the data on Palmer penguins dataset. See the Introduction for its overview.

import seaborn
from sklearn.preprocessing import scale
from clustergram import Clustergram

df = seaborn.load_dataset('penguins')
data = scale(df.drop(columns=['species', 'island', 'sex']).dropna())

seaborn.set(style='whitegrid')

Using scikit-learn (default):

cgram = Clustergram(range(1, 8), backend="sklearn")
cgram.fit(data)

K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.01631617546081543 seconds.
K=3 fitted in 0.023647308349609375 seconds.
K=4 fitted in 0.026671171188354492 seconds.
K=5 fitted in 0.036658525466918945 seconds.
K=6 fitted in 0.038860321044921875 seconds.
K=7 fitted in 0.04606366157531738 seconds.

Using cuML:

cgram = Clustergram(range(1, 8), backend='cuML')
cgram.fit(data)

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
~/checkouts/readthedocs.org/user_builds/clustergram/conda/v0.5.0/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs)
    161             try:
--> 162                 import cudf
    163             except (ImportError, ModuleNotFoundError):

ModuleNotFoundError: No module named 'cudf'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-3-1fbd92a31667> in <module>
----> 1 cgram = Clustergram(range(1, 8), backend='cuML')
      2 cgram.fit(data)

~/checkouts/readthedocs.org/user_builds/clustergram/conda/v0.5.0/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs)
    162                 import cudf
    163             except (ImportError, ModuleNotFoundError):
--> 164                 raise ImportError(
    165                     "cuML, cuDF and cupy packages are required to use `cuML` backend."
    166                 )

ImportError: cuML, cuDF and cupy packages are required to use `cuML` backend.

data can be all data types supported by the selected backend (including cudf.DataFrame or cupy.ndarray with cuML backend).

Supported methods¶

Clustergram currently supports K-Means, Mini Batch K-Means, Gaussian Mixture Model and SciPy’s hierarchical clustering methods. Note tha GMM and Mini Batch K-Means are supported only for scikit-learn backend and hierarchical methods are supported only for scipy backend.

Using K-Means (default):

cgram = Clustergram(range(1, 8), method='kmeans')
cgram.fit(data)
cgram.plot()

K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.01931285858154297 seconds.
K=3 fitted in 0.024785518646240234 seconds.
K=4 fitted in 0.029566049575805664 seconds.
K=5 fitted in 0.03727555274963379 seconds.
K=6 fitted in 0.04319429397583008 seconds.
K=7 fitted in 0.04481148719787598 seconds.

<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>

Using Mini Batch K-Means, which can provide significant speedup over K-Means:

cgram = Clustergram(range(1, 8), method='minibatchkmeans', batch_size=100)
cgram.fit(data)
cgram.plot()

K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.040726423263549805 seconds.
K=3 fitted in 0.017051219940185547 seconds.
K=4 fitted in 0.016927480697631836 seconds.
K=5 fitted in 0.019779205322265625 seconds.
K=6 fitted in 0.025287628173828125 seconds.
K=7 fitted in 0.02601790428161621 seconds.

<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>

Using Gaussian Mixture Model:

cgram = Clustergram(range(1, 8), method='gmm')
cgram.fit(data)
cgram.plot()

K=1 fitted in 0.015541791915893555 seconds.
K=2 fitted in 0.009796380996704102 seconds.
K=3 fitted in 0.03337550163269043 seconds.
K=4 fitted in 0.02602982521057129 seconds.
K=5 fitted in 0.033798933029174805 seconds.

K=6 fitted in 0.12314462661743164 seconds.
K=7 fitted in 0.07405495643615723 seconds.

<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>

Using Ward’s hierarchical clustering:

cgram = Clustergram(range(1, 8), method='hierarchical', linkage='ward')
cgram.fit(data)
cgram.plot()

<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>

Manual input¶

Alternatively, you can create clustergram using from_data or from_centers methods based on alternative clustering algorithms.

Using Clustergram.from_data which creates cluster centers as mean or median values:

import numpy
import pandas

data = numpy.array([[-1, -1, 0, 10], [1, 1, 10, 2], [0, 0, 20, 4]])
labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})

cgram = Clustergram.from_data(data, labels)
cgram.plot()

<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>

Using Clustergram.from_centers based on explicit cluster centers.:

labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
centers = {
            1: numpy.array([[0, 0, 0]]),
            2: numpy.array([[-1, -1, -1], [1, 1, 1]]),
            3: numpy.array([[-1, -1, -1], [1, 1, 1], [0, 0, 0]]),
        }
cgram = Clustergram.from_centers(centers, labels)
cgram.plot(pca_weighted=False)

<AxesSubplot:xlabel='Number of clusters (k)', ylabel='Mean of the clusters'>

To support PCA weighted plots with clustergram created from centers you also need to pass data:

labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
centers = {
    1: numpy.array([[0, 0]]),
    2: numpy.array([[-1, -1], [1, 1]]),
    3: numpy.array([[-1, -1], [1, 1], [0, 0]]),
}
data = numpy.array([[-1, -1], [1, 1], [0, 0]])
cgram = Clustergram.from_centers(centers, labels, data=data)
cgram.plot(pca_weighted=True)

<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>

Plotting options Additional methods