Clustering methods

Note

You can try this notebook in you browser: Binder

Scikit-learn, SciPy and RAPIDS cuML backends

Clustergram offers three backends for the computation - scikit-learn and scipy which use CPU and RAPIDS.AI cuML, which uses GPU. Note that all are optional dependencies but you will need at least one of them to generate clustergram.

Let’s load the data on Palmer penguins dataset. See the Introduction for its overview.

import seaborn
from sklearn.preprocessing import scale
from clustergram import Clustergram

df = seaborn.load_dataset('penguins')
data = scale(df.drop(columns=['species', 'island', 'sex']).dropna())

seaborn.set(style='whitegrid')

Using scikit-learn (default):

cgram = Clustergram(range(1, 8), backend="sklearn")
cgram.fit(data)
K=1 skipped. Mean computed from data directly.
K=2 fitted in 1.3132154941558838 seconds.
K=3 fitted in 1.7927002906799316 seconds.
K=4 fitted in 2.0142221450805664 seconds.
K=5 fitted in 1.7174348831176758 seconds.
K=6 fitted in 2.623941659927368 seconds.
K=7 fitted in 2.367375612258911 seconds.

Using cuML:

cgram = Clustergram(range(1, 8), backend='cuML')
cgram.fit(data)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs)
    166             try:
--> 167                 import cudf
    168             except (ImportError, ModuleNotFoundError):

ModuleNotFoundError: No module named 'cudf'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
/tmp/ipykernel_2112/3660680281.py in <module>
----> 1 cgram = Clustergram(range(1, 8), backend='cuML')
      2 cgram.fit(data)

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs)
    167                 import cudf
    168             except (ImportError, ModuleNotFoundError):
--> 169                 raise ImportError(
    170                     "cuML, cuDF and cupy packages are required to use `cuML` backend."
    171                 )

ImportError: cuML, cuDF and cupy packages are required to use `cuML` backend.

data can be all data types supported by the selected backend (including cudf.DataFrame or cupy.ndarray with cuML backend).

Supported methods

Clustergram currently supports K-Means, Mini Batch K-Means, Gaussian Mixture Model and SciPy’s hierarchical clustering methods. Note tha GMM and Mini Batch K-Means are supported only for scikit-learn backend and hierarchical methods are supported only for scipy backend.

Using K-Means (default):

cgram = Clustergram(range(1, 8), method='kmeans')
cgram.fit(data)
cgram.plot()
K=1 skipped. Mean computed from data directly.
K=2 fitted in 1.2612929344177246 seconds.
K=3 fitted in 1.678342342376709 seconds.
K=4 fitted in 1.8191895484924316 seconds.
K=5 fitted in 2.0595266819000244 seconds.
K=6 fitted in 2.7781901359558105 seconds.
K=7 fitted in 2.7572553157806396 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_8_8.png

Using Mini Batch K-Means, which can provide significant speedup over K-Means:

cgram = Clustergram(range(1, 8), method='minibatchkmeans', batch_size=100)
cgram.fit(data)
cgram.plot()
K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.08027267456054688 seconds.
K=3 fitted in 0.05731511116027832 seconds.
K=4 fitted in 0.04260754585266113 seconds.
K=5 fitted in 0.11631941795349121 seconds.
K=6 fitted in 0.17046427726745605 seconds.
K=7 fitted in 0.19886183738708496 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_10_4.png

Using Gaussian Mixture Model:

cgram = Clustergram(range(1, 8), method='gmm')
cgram.fit(data)
cgram.plot()
K=1 fitted in 0.08117246627807617 seconds.
K=2 fitted in 0.16454148292541504 seconds.
K=3 fitted in 0.20287466049194336 seconds.
K=4 fitted in 0.17831110954284668 seconds.
K=5 fitted in 0.2770540714263916 seconds.
K=6 fitted in 0.48028087615966797 seconds.
K=7 fitted in 0.2442922592163086 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_12_6.png

Using Ward’s hierarchical clustering:

cgram = Clustergram(range(1, 8), method='hierarchical', linkage='ward')
cgram.fit(data)
cgram.plot()
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_14_1.png

Manual input

Alternatively, you can create clustergram using from_data or from_centers methods based on alternative clustering algorithms.

Using Clustergram.from_data which creates cluster centers as mean or median values:

import numpy
import pandas

data = numpy.array([[-1, -1, 0, 10], [1, 1, 10, 2], [0, 0, 20, 4]])
labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})

cgram = Clustergram.from_data(data, labels)
cgram.plot()
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_16_1.png

Using Clustergram.from_centers based on explicit cluster centers.:

labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
centers = {
            1: numpy.array([[0, 0, 0]]),
            2: numpy.array([[-1, -1, -1], [1, 1, 1]]),
            3: numpy.array([[-1, -1, -1], [1, 1, 1], [0, 0, 0]]),
        }
cgram = Clustergram.from_centers(centers, labels)
cgram.plot(pca_weighted=False)
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='Mean of the clusters'>
../_images/methods_18_1.png

To support PCA weighted plots with clustergram created from centers you also need to pass data:

labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
centers = {
    1: numpy.array([[0, 0]]),
    2: numpy.array([[-1, -1], [1, 1]]),
    3: numpy.array([[-1, -1], [1, 1], [0, 0]]),
}
data = numpy.array([[-1, -1], [1, 1], [0, 0]])
cgram = Clustergram.from_centers(centers, labels, data=data)
cgram.plot(pca_weighted=True)
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_20_1.png