Clustergram
Note
You can try this notebook in you browser:
Clustergram offers three backends for the computation - scikit-learn and scipy which use CPU and RAPIDS.AI cuML, which uses GPU. Note that all are optional dependencies but you will need at least one of them to generate clustergram.
scikit-learn
scipy
cuML
Let’s load the data on Palmer penguins dataset. See the Introduction for its overview.
import seaborn from sklearn.preprocessing import scale from clustergram import Clustergram df = seaborn.load_dataset('penguins') data = scale(df.drop(columns=['species', 'island', 'sex']).dropna()) seaborn.set(style='whitegrid')
Using scikit-learn (default):
cgram = Clustergram(range(1, 8), backend="sklearn") cgram.fit(data)
K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.4000694751739502 seconds.
K=3 fitted in 1.2731716632843018 seconds.
K=4 fitted in 0.7346155643463135 seconds.
K=5 fitted in 1.2527844905853271 seconds.
K=6 fitted in 0.8201541900634766 seconds.
K=7 fitted in 3.0236527919769287 seconds.
Using cuML:
cgram = Clustergram(range(1, 8), backend='cuML') cgram.fit(data)
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) ~/checkouts/readthedocs.org/user_builds/clustergram/conda/v0.5.1/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs) 161 try: --> 162 import cudf 163 except (ImportError, ModuleNotFoundError): ModuleNotFoundError: No module named 'cudf' During handling of the above exception, another exception occurred: ImportError Traceback (most recent call last) <ipython-input-3-1fbd92a31667> in <module> ----> 1 cgram = Clustergram(range(1, 8), backend='cuML') 2 cgram.fit(data) ~/checkouts/readthedocs.org/user_builds/clustergram/conda/v0.5.1/lib/python3.9/site-packages/clustergram/clustergram.py in __init__(self, k_range, backend, method, verbose, **kwargs) 162 import cudf 163 except (ImportError, ModuleNotFoundError): --> 164 raise ImportError( 165 "cuML, cuDF and cupy packages are required to use `cuML` backend." 166 ) ImportError: cuML, cuDF and cupy packages are required to use `cuML` backend.
data can be all data types supported by the selected backend (including cudf.DataFrame or cupy.ndarray with cuML backend).
data
cudf.DataFrame
cupy.ndarray
Clustergram currently supports K-Means, Mini Batch K-Means, Gaussian Mixture Model and SciPy’s hierarchical clustering methods. Note tha GMM and Mini Batch K-Means are supported only for scikit-learn backend and hierarchical methods are supported only for scipy backend.
Using K-Means (default):
cgram = Clustergram(range(1, 8), method='kmeans') cgram.fit(data) cgram.plot()
K=2 fitted in 0.5471713542938232 seconds.
K=3 fitted in 1.989260196685791 seconds.
K=4 fitted in 2.003063917160034 seconds. K=5 fitted in 0.11206221580505371 seconds.
K=6 fitted in 0.791018009185791 seconds.
K=7 fitted in 2.239922285079956 seconds.
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='PCA weighted mean of the clusters'>
Using Mini Batch K-Means, which can provide significant speedup over K-Means:
cgram = Clustergram(range(1, 8), method='minibatchkmeans', batch_size=100) cgram.fit(data) cgram.plot()
K=2 fitted in 0.20737481117248535 seconds. K=3 fitted in 0.04286766052246094 seconds.
K=4 fitted in 0.19988632202148438 seconds. K=5 fitted in 0.19666814804077148 seconds.
K=6 fitted in 0.2511608600616455 seconds.
K=7 fitted in 0.2183680534362793 seconds.
Using Gaussian Mixture Model:
cgram = Clustergram(range(1, 8), method='gmm') cgram.fit(data) cgram.plot()
K=1 fitted in 0.06161618232727051 seconds. K=2 fitted in 0.10887432098388672 seconds.
K=3 fitted in 0.20110058784484863 seconds.
K=4 fitted in 0.22536683082580566 seconds.
K=5 fitted in 0.3320627212524414 seconds. K=6 fitted in 0.06666207313537598 seconds. K=7 fitted in 0.09185957908630371 seconds.
Using Ward’s hierarchical clustering:
cgram = Clustergram(range(1, 8), method='hierarchical', linkage='ward') cgram.fit(data) cgram.plot()
Alternatively, you can create clustergram using from_data or from_centers methods based on alternative clustering algorithms.
from_data
from_centers
Using Clustergram.from_data which creates cluster centers as mean or median values:
Clustergram.from_data
import numpy import pandas data = numpy.array([[-1, -1, 0, 10], [1, 1, 10, 2], [0, 0, 20, 4]]) labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]}) cgram = Clustergram.from_data(data, labels) cgram.plot()
Using Clustergram.from_centers based on explicit cluster centers.:
Clustergram.from_centers
labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]}) centers = { 1: numpy.array([[0, 0, 0]]), 2: numpy.array([[-1, -1, -1], [1, 1, 1]]), 3: numpy.array([[-1, -1, -1], [1, 1, 1], [0, 0, 0]]), } cgram = Clustergram.from_centers(centers, labels) cgram.plot(pca_weighted=False)
<AxesSubplot:xlabel='Number of clusters (k)', ylabel='Mean of the clusters'>
To support PCA weighted plots with clustergram created from centers you also need to pass data:
labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]}) centers = { 1: numpy.array([[0, 0]]), 2: numpy.array([[-1, -1], [1, 1]]), 3: numpy.array([[-1, -1], [1, 1], [0, 0]]), } data = numpy.array([[-1, -1], [1, 1], [0, 0]]) cgram = Clustergram.from_centers(centers, labels, data=data) cgram.plot(pca_weighted=True)