Clustergram
Note
You can try this notebook in you browser:
Clustergram offers three backends for the computation - scikit-learn and scipy which use CPU and RAPIDS.AI cuML, which uses GPU. Note that all are optional dependencies but you will need at least one of them to generate clustergram.
scikit-learn
scipy
cuML
Let’s load the data on Palmer penguins dataset. See the Introduction for its overview.
import seaborn from sklearn.preprocessing import scale from clustergram import Clustergram df = seaborn.load_dataset('penguins') data = scale(df.drop(columns=['species', 'island', 'sex']).dropna()) seaborn.set(style='whitegrid')
Using scikit-learn (default):
cgram = Clustergram(range(1, 8), backend="sklearn") cgram.fit(data)
K=1 skipped. Mean computed from data directly.
K=2 fitted in 1.356032133102417 seconds.
K=3 fitted in 1.8639934062957764 seconds.
K=4 fitted in 1.917318344116211 seconds.
K=5 fitted in 2.319411039352417 seconds.
K=6 fitted in 2.563551425933838 seconds.
K=7 fitted in 2.1877193450927734 seconds.
Using cuML:
cgram = Clustergram(range(1, 8), backend='cuML') cgram.fit(data)
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) ~/checkouts/readthedocs.org/user_builds/clustergram/conda/v0.6.0/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs) 166 try: --> 167 import cudf 168 except (ImportError, ModuleNotFoundError): ModuleNotFoundError: No module named 'cudf' During handling of the above exception, another exception occurred: ImportError Traceback (most recent call last) /tmp/ipykernel_2114/3660680281.py in <module> ----> 1 cgram = Clustergram(range(1, 8), backend='cuML') 2 cgram.fit(data) ~/checkouts/readthedocs.org/user_builds/clustergram/conda/v0.6.0/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs) 167 import cudf 168 except (ImportError, ModuleNotFoundError): --> 169 raise ImportError( 170 "cuML, cuDF and cupy packages are required to use `cuML` backend." 171 ) ImportError: cuML, cuDF and cupy packages are required to use `cuML` backend.
data can be all data types supported by the selected backend (including cudf.DataFrame or cupy.ndarray with cuML backend).
data
cudf.DataFrame
cupy.ndarray
Clustergram currently supports K-Means, Mini Batch K-Means, Gaussian Mixture Model and SciPy’s hierarchical clustering methods. Note tha GMM and Mini Batch K-Means are supported only for scikit-learn backend and hierarchical methods are supported only for scipy backend.
Using K-Means (default):
cgram = Clustergram(range(1, 8), method='kmeans') cgram.fit(data) cgram.plot()
K=2 fitted in 0.8113729953765869 seconds.
K=3 fitted in 1.4555156230926514 seconds.
K=4 fitted in 1.8362269401550293 seconds.
K=5 fitted in 1.9503719806671143 seconds.
K=6 fitted in 2.2527010440826416 seconds.
K=7 fitted in 2.2524571418762207 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
Using Mini Batch K-Means, which can provide significant speedup over K-Means:
cgram = Clustergram(range(1, 8), method='minibatchkmeans', batch_size=100) cgram.fit(data) cgram.plot()
K=1 skipped. Mean computed from data directly. K=2 fitted in 0.19777917861938477 seconds.
K=3 fitted in 0.18311476707458496 seconds. K=4 fitted in 0.015962839126586914 seconds. K=5 fitted in 0.16828298568725586 seconds.
K=6 fitted in 1.2683145999908447 seconds.
K=7 fitted in 0.6232771873474121 seconds.
Using Gaussian Mixture Model:
cgram = Clustergram(range(1, 8), method='gmm') cgram.fit(data) cgram.plot()
K=1 fitted in 0.07065105438232422 seconds. K=2 fitted in 0.17306876182556152 seconds.
K=3 fitted in 0.1903231143951416 seconds.
K=4 fitted in 0.3606398105621338 seconds.
K=5 fitted in 0.22977614402770996 seconds.
K=6 fitted in 0.24111223220825195 seconds. K=7 fitted in 0.06499433517456055 seconds.
Using Ward’s hierarchical clustering:
cgram = Clustergram(range(1, 8), method='hierarchical', linkage='ward') cgram.fit(data) cgram.plot()
Alternatively, you can create clustergram using from_data or from_centers methods based on alternative clustering algorithms.
from_data
from_centers
Using Clustergram.from_data which creates cluster centers as mean or median values:
Clustergram.from_data
import numpy import pandas data = numpy.array([[-1, -1, 0, 10], [1, 1, 10, 2], [0, 0, 20, 4]]) labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]}) cgram = Clustergram.from_data(data, labels) cgram.plot()
Using Clustergram.from_centers based on explicit cluster centers.:
Clustergram.from_centers
labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]}) centers = { 1: numpy.array([[0, 0, 0]]), 2: numpy.array([[-1, -1, -1], [1, 1, 1]]), 3: numpy.array([[-1, -1, -1], [1, 1, 1], [0, 0, 0]]), } cgram = Clustergram.from_centers(centers, labels) cgram.plot(pca_weighted=False)
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='Mean of the clusters'>
To support PCA weighted plots with clustergram created from centers you also need to pass data:
labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]}) centers = { 1: numpy.array([[0, 0]]), 2: numpy.array([[-1, -1], [1, 1]]), 3: numpy.array([[-1, -1], [1, 1], [0, 0]]), } data = numpy.array([[-1, -1], [1, 1], [0, 0]]) cgram = Clustergram.from_centers(centers, labels, data=data) cgram.plot(pca_weighted=True)
previous
Plotting options
next
Additional methods