Clustering methods

Note

You can try this notebook in you browser: Binder

Scikit-learn, SciPy and RAPIDS cuML backends

Clustergram offers three backends for the computation - scikit-learn and scipy which use CPU and RAPIDS.AI cuML, which uses GPU. Note that all are optional dependencies but you will need at least one of them to generate clustergram.

Let’s load the data on Palmer penguins dataset. See the Introduction for its overview.

import seaborn
from sklearn.preprocessing import scale
from clustergram import Clustergram

df = seaborn.load_dataset('penguins')
data = scale(df.drop(columns=['species', 'island', 'sex']).dropna())

seaborn.set(style='whitegrid')

Using scikit-learn (default):

cgram = Clustergram(range(1, 8), backend="sklearn")
cgram.fit(data)
K=1 skipped. Mean computed from data directly.
K=2 fitted in 1.356032133102417 seconds.
K=3 fitted in 1.8639934062957764 seconds.
K=4 fitted in 1.917318344116211 seconds.
K=5 fitted in 2.319411039352417 seconds.
K=6 fitted in 2.563551425933838 seconds.
K=7 fitted in 2.1877193450927734 seconds.

Using cuML:

cgram = Clustergram(range(1, 8), backend='cuML')
cgram.fit(data)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
~/checkouts/readthedocs.org/user_builds/clustergram/conda/v0.6.0/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs)
    166             try:
--> 167                 import cudf
    168             except (ImportError, ModuleNotFoundError):

ModuleNotFoundError: No module named 'cudf'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
/tmp/ipykernel_2114/3660680281.py in <module>
----> 1 cgram = Clustergram(range(1, 8), backend='cuML')
      2 cgram.fit(data)

~/checkouts/readthedocs.org/user_builds/clustergram/conda/v0.6.0/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs)
    167                 import cudf
    168             except (ImportError, ModuleNotFoundError):
--> 169                 raise ImportError(
    170                     "cuML, cuDF and cupy packages are required to use `cuML` backend."
    171                 )

ImportError: cuML, cuDF and cupy packages are required to use `cuML` backend.

data can be all data types supported by the selected backend (including cudf.DataFrame or cupy.ndarray with cuML backend).

Supported methods

Clustergram currently supports K-Means, Mini Batch K-Means, Gaussian Mixture Model and SciPy’s hierarchical clustering methods. Note tha GMM and Mini Batch K-Means are supported only for scikit-learn backend and hierarchical methods are supported only for scipy backend.

Using K-Means (default):

cgram = Clustergram(range(1, 8), method='kmeans')
cgram.fit(data)
cgram.plot()
K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.8113729953765869 seconds.
K=3 fitted in 1.4555156230926514 seconds.
K=4 fitted in 1.8362269401550293 seconds.
K=5 fitted in 1.9503719806671143 seconds.
K=6 fitted in 2.2527010440826416 seconds.
K=7 fitted in 2.2524571418762207 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_8_8.png

Using Mini Batch K-Means, which can provide significant speedup over K-Means:

cgram = Clustergram(range(1, 8), method='minibatchkmeans', batch_size=100)
cgram.fit(data)
cgram.plot()
K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.19777917861938477 seconds.
K=3 fitted in 0.18311476707458496 seconds.
K=4 fitted in 0.015962839126586914 seconds.
K=5 fitted in 0.16828298568725586 seconds.
K=6 fitted in 1.2683145999908447 seconds.
K=7 fitted in 0.6232771873474121 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_10_5.png

Using Gaussian Mixture Model:

cgram = Clustergram(range(1, 8), method='gmm')
cgram.fit(data)
cgram.plot()
K=1 fitted in 0.07065105438232422 seconds.
K=2 fitted in 0.17306876182556152 seconds.
K=3 fitted in 0.1903231143951416 seconds.
K=4 fitted in 0.3606398105621338 seconds.
K=5 fitted in 0.22977614402770996 seconds.
K=6 fitted in 0.24111223220825195 seconds.
K=7 fitted in 0.06499433517456055 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_12_6.png

Using Ward’s hierarchical clustering:

cgram = Clustergram(range(1, 8), method='hierarchical', linkage='ward')
cgram.fit(data)
cgram.plot()
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_14_1.png

Manual input

Alternatively, you can create clustergram using from_data or from_centers methods based on alternative clustering algorithms.

Using Clustergram.from_data which creates cluster centers as mean or median values:

import numpy
import pandas

data = numpy.array([[-1, -1, 0, 10], [1, 1, 10, 2], [0, 0, 20, 4]])
labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})

cgram = Clustergram.from_data(data, labels)
cgram.plot()
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_16_1.png

Using Clustergram.from_centers based on explicit cluster centers.:

labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
centers = {
            1: numpy.array([[0, 0, 0]]),
            2: numpy.array([[-1, -1, -1], [1, 1, 1]]),
            3: numpy.array([[-1, -1, -1], [1, 1, 1], [0, 0, 0]]),
        }
cgram = Clustergram.from_centers(centers, labels)
cgram.plot(pca_weighted=False)
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='Mean of the clusters'>
../_images/methods_18_1.png

To support PCA weighted plots with clustergram created from centers you also need to pass data:

labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
centers = {
    1: numpy.array([[0, 0]]),
    2: numpy.array([[-1, -1], [1, 1]]),
    3: numpy.array([[-1, -1], [1, 1], [0, 0]]),
}
data = numpy.array([[-1, -1], [1, 1], [0, 0]])
cgram = Clustergram.from_centers(centers, labels, data=data)
cgram.plot(pca_weighted=True)
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_20_1.png