Clustergram
Note
You can try this notebook in you browser:
Clustergram offers three backends for the computation - scikit-learn and scipy which use CPU and RAPIDS.AI cuML, which uses GPU. Note that all are optional dependencies but you will need at least one of them to generate clustergram.
scikit-learn
scipy
cuML
Let’s load the data on Palmer penguins dataset. See the Introduction for its overview.
import seaborn from sklearn.preprocessing import scale from clustergram import Clustergram df = seaborn.load_dataset('penguins') data = scale(df.drop(columns=['species', 'island', 'sex']).dropna()) seaborn.set(style='whitegrid')
Using scikit-learn (default):
cgram = Clustergram(range(1, 8), backend="sklearn") cgram.fit(data)
K=1 skipped. Mean computed from data directly. K=2 fitted in 0.01631617546081543 seconds. K=3 fitted in 0.023647308349609375 seconds. K=4 fitted in 0.026671171188354492 seconds. K=5 fitted in 0.036658525466918945 seconds. K=6 fitted in 0.038860321044921875 seconds. K=7 fitted in 0.04606366157531738 seconds.
Using cuML:
cgram = Clustergram(range(1, 8), backend='cuML') cgram.fit(data)
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) ~/checkouts/readthedocs.org/user_builds/clustergram/conda/v0.5.0/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs) 161 try: --> 162 import cudf 163 except (ImportError, ModuleNotFoundError): ModuleNotFoundError: No module named 'cudf' During handling of the above exception, another exception occurred: ImportError Traceback (most recent call last) <ipython-input-3-1fbd92a31667> in <module> ----> 1 cgram = Clustergram(range(1, 8), backend='cuML') 2 cgram.fit(data) ~/checkouts/readthedocs.org/user_builds/clustergram/conda/v0.5.0/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs) 162 import cudf 163 except (ImportError, ModuleNotFoundError): --> 164 raise ImportError( 165 "cuML, cuDF and cupy packages are required to use `cuML` backend." 166 ) ImportError: cuML, cuDF and cupy packages are required to use `cuML` backend.
data can be all data types supported by the selected backend (including cudf.DataFrame or cupy.ndarray with cuML backend).
data
cudf.DataFrame
cupy.ndarray
Clustergram currently supports K-Means, Mini Batch K-Means, Gaussian Mixture Model and SciPy’s hierarchical clustering methods. Note tha GMM and Mini Batch K-Means are supported only for scikit-learn backend and hierarchical methods are supported only for scipy backend.
Using K-Means (default):
cgram = Clustergram(range(1, 8), method='kmeans') cgram.fit(data) cgram.plot()
K=1 skipped. Mean computed from data directly. K=2 fitted in 0.01931285858154297 seconds. K=3 fitted in 0.024785518646240234 seconds. K=4 fitted in 0.029566049575805664 seconds. K=5 fitted in 0.03727555274963379 seconds. K=6 fitted in 0.04319429397583008 seconds. K=7 fitted in 0.04481148719787598 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
Using Mini Batch K-Means, which can provide significant speedup over K-Means:
cgram = Clustergram(range(1, 8), method='minibatchkmeans', batch_size=100) cgram.fit(data) cgram.plot()
K=1 skipped. Mean computed from data directly. K=2 fitted in 0.040726423263549805 seconds. K=3 fitted in 0.017051219940185547 seconds. K=4 fitted in 0.016927480697631836 seconds. K=5 fitted in 0.019779205322265625 seconds. K=6 fitted in 0.025287628173828125 seconds. K=7 fitted in 0.02601790428161621 seconds.
Using Gaussian Mixture Model:
cgram = Clustergram(range(1, 8), method='gmm') cgram.fit(data) cgram.plot()
K=1 fitted in 0.015541791915893555 seconds. K=2 fitted in 0.009796380996704102 seconds. K=3 fitted in 0.03337550163269043 seconds. K=4 fitted in 0.02602982521057129 seconds. K=5 fitted in 0.033798933029174805 seconds.
K=6 fitted in 0.12314462661743164 seconds. K=7 fitted in 0.07405495643615723 seconds.
Using Ward’s hierarchical clustering:
cgram = Clustergram(range(1, 8), method='hierarchical', linkage='ward') cgram.fit(data) cgram.plot()
Alternatively, you can create clustergram using from_data or from_centers methods based on alternative clustering algorithms.
from_data
from_centers
Using Clustergram.from_data which creates cluster centers as mean or median values:
Clustergram.from_data
import numpy import pandas data = numpy.array([[-1, -1, 0, 10], [1, 1, 10, 2], [0, 0, 20, 4]]) labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]}) cgram = Clustergram.from_data(data, labels) cgram.plot()
Using Clustergram.from_centers based on explicit cluster centers.:
Clustergram.from_centers
labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]}) centers = { 1: numpy.array([[0, 0, 0]]), 2: numpy.array([[-1, -1, -1], [1, 1, 1]]), 3: numpy.array([[-1, -1, -1], [1, 1, 1], [0, 0, 0]]), } cgram = Clustergram.from_centers(centers, labels) cgram.plot(pca_weighted=False)
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='Mean of the clusters'>
To support PCA weighted plots with clustergram created from centers you also need to pass data:
labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]}) centers = { 1: numpy.array([[0, 0]]), 2: numpy.array([[-1, -1], [1, 1]]), 3: numpy.array([[-1, -1], [1, 1], [0, 0]]), } data = numpy.array([[-1, -1], [1, 1], [0, 0]]) cgram = Clustergram.from_centers(centers, labels, data=data) cgram.plot(pca_weighted=True)