Clustering methods

Note

You can try this notebook in you browser: Binder

Scikit-learn, SciPy and RAPIDS cuML backends

Clustergram offers three backends for the computation - scikit-learn and scipy which use CPU and RAPIDS.AI cuML, which uses GPU. Note that all are optional dependencies but you will need at least one of them to generate clustergram.

Let’s load the data on Palmer penguins dataset. See the Introduction for its overview.

import seaborn
from sklearn.preprocessing import scale
from clustergram import Clustergram

df = seaborn.load_dataset('penguins')
data = scale(df.drop(columns=['species', 'island', 'sex']).dropna())

seaborn.set(style='whitegrid')

Using scikit-learn (default):

cgram = Clustergram(range(1, 8), backend="sklearn")
cgram.fit(data)
K=1 skipped. Mean computed from data directly.
K=2 fitted in 1.010213851928711 seconds.
K=3 fitted in 1.8067669868469238 seconds.
K=4 fitted in 1.95149827003479 seconds.
K=5 fitted in 2.2991881370544434 seconds.
K=6 fitted in 2.6459999084472656 seconds.
K=7 fitted in 2.8548080921173096 seconds.

Using cuML:

cgram = Clustergram(range(1, 8), backend='cuML')
cgram.fit(data)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
~/checkouts/readthedocs.org/user_builds/clustergram/conda/stable/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs)
    161             try:
--> 162                 import cudf
    163             except (ImportError, ModuleNotFoundError):

ModuleNotFoundError: No module named 'cudf'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
/tmp/ipykernel_2098/3660680281.py in <module>
----> 1 cgram = Clustergram(range(1, 8), backend='cuML')
      2 cgram.fit(data)

~/checkouts/readthedocs.org/user_builds/clustergram/conda/stable/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs)
    162                 import cudf
    163             except (ImportError, ModuleNotFoundError):
--> 164                 raise ImportError(
    165                     "cuML, cuDF and cupy packages are required to use `cuML` backend."
    166                 )

ImportError: cuML, cuDF and cupy packages are required to use `cuML` backend.

data can be all data types supported by the selected backend (including cudf.DataFrame or cupy.ndarray with cuML backend).

Supported methods

Clustergram currently supports K-Means, Mini Batch K-Means, Gaussian Mixture Model and SciPy’s hierarchical clustering methods. Note tha GMM and Mini Batch K-Means are supported only for scikit-learn backend and hierarchical methods are supported only for scipy backend.

Using K-Means (default):

cgram = Clustergram(range(1, 8), method='kmeans')
cgram.fit(data)
cgram.plot()
K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.7205009460449219 seconds.
K=3 fitted in 1.442162275314331 seconds.
K=4 fitted in 1.7617602348327637 seconds.
K=5 fitted in 2.101316213607788 seconds.
K=6 fitted in 2.345597267150879 seconds.
K=7 fitted in 3.096383571624756 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_8_8.png

Using Mini Batch K-Means, which can provide significant speedup over K-Means:

cgram = Clustergram(range(1, 8), method='minibatchkmeans', batch_size=100)
cgram.fit(data)
cgram.plot()
K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.17685866355895996 seconds.
K=3 fitted in 1.017914056777954 seconds.
K=4 fitted in 0.9397783279418945 seconds.
K=5 fitted in 1.0208313465118408 seconds.
K=6 fitted in 0.981508731842041 seconds.
K=7 fitted in 0.9176547527313232 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_10_7.png

Using Gaussian Mixture Model:

cgram = Clustergram(range(1, 8), method='gmm')
cgram.fit(data)
cgram.plot()
K=1 fitted in 0.08226275444030762 seconds.
K=2 fitted in 0.21777701377868652 seconds.
K=3 fitted in 0.2549471855163574 seconds.
K=4 fitted in 0.032038211822509766 seconds.
K=5 fitted in 0.348191499710083 seconds.
K=6 fitted in 0.30786681175231934 seconds.
K=7 fitted in 0.3150014877319336 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_12_7.png

Using Ward’s hierarchical clustering:

cgram = Clustergram(range(1, 8), method='hierarchical', linkage='ward')
cgram.fit(data)
cgram.plot()
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_14_1.png

Manual input

Alternatively, you can create clustergram using from_data or from_centers methods based on alternative clustering algorithms.

Using Clustergram.from_data which creates cluster centers as mean or median values:

import numpy
import pandas

data = numpy.array([[-1, -1, 0, 10], [1, 1, 10, 2], [0, 0, 20, 4]])
labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})

cgram = Clustergram.from_data(data, labels)
cgram.plot()
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_16_1.png

Using Clustergram.from_centers based on explicit cluster centers.:

labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
centers = {
            1: numpy.array([[0, 0, 0]]),
            2: numpy.array([[-1, -1, -1], [1, 1, 1]]),
            3: numpy.array([[-1, -1, -1], [1, 1, 1], [0, 0, 0]]),
        }
cgram = Clustergram.from_centers(centers, labels)
cgram.plot(pca_weighted=False)
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='Mean of the clusters'>
../_images/methods_18_1.png

To support PCA weighted plots with clustergram created from centers you also need to pass data:

labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
centers = {
    1: numpy.array([[0, 0]]),
    2: numpy.array([[-1, -1], [1, 1]]),
    3: numpy.array([[-1, -1], [1, 1], [0, 0]]),
}
data = numpy.array([[-1, -1], [1, 1], [0, 0]])
cgram = Clustergram.from_centers(centers, labels, data=data)
cgram.plot(pca_weighted=True)
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
../_images/methods_20_1.png