Clustering methods

Note

You can try this notebook in you browser: Binder

Scikit-learn, SciPy and RAPIDS cuML backends

Clustergram offers three backends for the computation - scikit-learn and scipy which use CPU and RAPIDS.AI cuML, which uses GPU. Note that all are optional dependencies but you will need at least one of them to generate clustergram.

Let’s load the data on Palmer penguins dataset. See the Introduction for its overview.

import seaborn
from sklearn.preprocessing import scale
from clustergram import Clustergram

df = seaborn.load_dataset('penguins')
data = scale(df.drop(columns=['species', 'island', 'sex']).dropna())

seaborn.set(style='whitegrid')

Using scikit-learn (default):

cgram = Clustergram(range(1, 8), backend="sklearn")
cgram.fit(data)
K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.4000694751739502 seconds.
K=3 fitted in 1.2731716632843018 seconds.
K=4 fitted in 0.7346155643463135 seconds.
K=5 fitted in 1.2527844905853271 seconds.
K=6 fitted in 0.8201541900634766 seconds.
K=7 fitted in 3.0236527919769287 seconds.

Using cuML:

cgram = Clustergram(range(1, 8), backend='cuML')
cgram.fit(data)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
~/checkouts/readthedocs.org/user_builds/clustergram/conda/v0.5.1/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs)
    161             try:
--> 162                 import cudf
    163             except (ImportError, ModuleNotFoundError):

ModuleNotFoundError: No module named 'cudf'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-3-1fbd92a31667> in <module>
----> 1 cgram = Clustergram(range(1, 8), backend='cuML')
      2 cgram.fit(data)

~/checkouts/readthedocs.org/user_builds/clustergram/conda/v0.5.1/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs)
    162                 import cudf
    163             except (ImportError, ModuleNotFoundError):
--> 164                 raise ImportError(
    165                     "cuML, cuDF and cupy packages are required to use `cuML` backend."
    166                 )

ImportError: cuML, cuDF and cupy packages are required to use `cuML` backend.

data can be all data types supported by the selected backend (including cudf.DataFrame or cupy.ndarray with cuML backend).

Supported methods

Clustergram currently supports K-Means, Mini Batch K-Means, Gaussian Mixture Model and SciPy’s hierarchical clustering methods. Note tha GMM and Mini Batch K-Means are supported only for scikit-learn backend and hierarchical methods are supported only for scipy backend.

Using K-Means (default):

cgram = Clustergram(range(1, 8), method='kmeans')
cgram.fit(data)
cgram.plot()
K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.5471713542938232 seconds.
K=3 fitted in 1.989260196685791 seconds.
K=4 fitted in 2.003063917160034 seconds.
K=5 fitted in 0.11206221580505371 seconds.
K=6 fitted in 0.791018009185791 seconds.
K=7 fitted in 2.239922285079956 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_8_7.png

Using Mini Batch K-Means, which can provide significant speedup over K-Means:

cgram = Clustergram(range(1, 8), method='minibatchkmeans', batch_size=100)
cgram.fit(data)
cgram.plot()
K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.20737481117248535 seconds.
K=3 fitted in 0.04286766052246094 seconds.
K=4 fitted in 0.19988632202148438 seconds.
K=5 fitted in 0.19666814804077148 seconds.
K=6 fitted in 0.2511608600616455 seconds.
K=7 fitted in 0.2183680534362793 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_10_6.png

Using Gaussian Mixture Model:

cgram = Clustergram(range(1, 8), method='gmm')
cgram.fit(data)
cgram.plot()
K=1 fitted in 0.06161618232727051 seconds.
K=2 fitted in 0.10887432098388672 seconds.
K=3 fitted in 0.20110058784484863 seconds.
K=4 fitted in 0.22536683082580566 seconds.
K=5 fitted in 0.3320627212524414 seconds.
K=6 fitted in 0.06666207313537598 seconds.
K=7 fitted in 0.09185957908630371 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_12_5.png

Using Ward’s hierarchical clustering:

cgram = Clustergram(range(1, 8), method='hierarchical', linkage='ward')
cgram.fit(data)
cgram.plot()
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_14_1.png

Manual input

Alternatively, you can create clustergram using from_data or from_centers methods based on alternative clustering algorithms.

Using Clustergram.from_data which creates cluster centers as mean or median values:

import numpy
import pandas

data = numpy.array([[-1, -1, 0, 10], [1, 1, 10, 2], [0, 0, 20, 4]])
labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})

cgram = Clustergram.from_data(data, labels)
cgram.plot()
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_16_1.png

Using Clustergram.from_centers based on explicit cluster centers.:

labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
centers = {
            1: numpy.array([[0, 0, 0]]),
            2: numpy.array([[-1, -1, -1], [1, 1, 1]]),
            3: numpy.array([[-1, -1, -1], [1, 1, 1], [0, 0, 0]]),
        }
cgram = Clustergram.from_centers(centers, labels)
cgram.plot(pca_weighted=False)
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='Mean of the clusters'>
../_images/methods_18_1.png

To support PCA weighted plots with clustergram created from centers you also need to pass data:

labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
centers = {
    1: numpy.array([[0, 0]]),
    2: numpy.array([[-1, -1], [1, 1]]),
    3: numpy.array([[-1, -1], [1, 1], [0, 0]]),
}
data = numpy.array([[-1, -1], [1, 1], [0, 0]])
cgram = Clustergram.from_centers(centers, labels, data=data)
cgram.plot(pca_weighted=True)
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_20_1.png