Clustergram
Clustergram is a diagram proposed by Matthias Schonlau in his paper The clustergram: A graph for visualizing hierarchical and nonhierarchical cluster analyses.
In hierarchical cluster analysis, dendrograms are used to visualize how clusters are formed. I propose an alternative graph called a “clustergram” to examine how cluster members are assigned to clusters as the number of clusters increases. This graph is useful in exploratory analysis for nonhierarchical clustering algorithms such as k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.
The clustergram was later implemented in R by Tal Galili, who also gives a thorough explanation of the concept.
This is a Python translation of Tal’s script written for scikit-learn and RAPIDS cuML implementations of K-Means and Gaussian Mixture Model (scikit-learn only) clustering.
scikit-learn
cuML
You can install clustergram from conda or pip:
conda
pip
conda install clustergram -c conda-forge
pip install clustergram
In any case, you still need to install your selected backend (scikit-learn or cuML).
The example of clustergram on Palmer penguins dataset:
import seaborn df = seaborn.load_dataset('penguins')
First we have to select numerical data and scale them.
from sklearn.preprocessing import scale data = scale(df.drop(columns=['species', 'island', 'sex']).dropna())
And then we can simply pass the data to clustergram.
clustergram
from clustergram import Clustergram cgram = Clustergram(range(1, 8)) cgram.fit(data) cgram.plot()
Clustergram.plot() returns matplotlib axis and can be fully customised as any other matplotlib plot.
Clustergram.plot()
seaborn.set(style='whitegrid') cgram.plot( ax=ax, size=0.5, linewidth=0.5, cluster_style={"color": "lightblue", "edgecolor": "black"}, line_style={"color": "red", "linestyle": "-."}, figsize=(12, 8) )
On the y axis, a clustergram can use mean values as in the original paper by Matthias Schonlau or PCA weighted mean values as in the implementation by Tal Galili.
y
cgram = Clustergram(range(1, 8), pca_weighted=True) cgram.fit(data) cgram.plot(figsize=(12, 8))
cgram = Clustergram(range(1, 8), pca_weighted=False) cgram.fit(data) cgram.plot(figsize=(12, 8))
Clustergram offers two backends for the computation - scikit-learn which uses CPU and RAPIDS.AI cuML, which uses GPU. Note that both are optional dependencies, but you will need at least one of them to generate clustergram.
Using scikit-learn (default):
cgram = Clustergram(range(1, 8), backend='sklearn') cgram.fit(data) cgram.plot()
Using cuML:
cgram = Clustergram(range(1, 8), backend='cuML') cgram.fit(data) cgram.plot()
data can be all data types supported by the selected backend (including cudf.DataFrame with cuML backend).
data
cudf.DataFrame
Clustergram currently supports K-Means and Gaussian Mixture Model clustering methods. Note tha GMM is supported only for scikit-learn backend.
Using K-Means (default):
cgram = Clustergram(range(1, 8), method='kmeans') cgram.fit(data) cgram.plot()
Using Gaussian Mixture Model:
cgram = Clustergram(range(1, 8), method='gmm') cgram.fit(data) cgram.plot()
Clustergram.plot() can also plot only a part of the diagram, if you want to focus on a limited range of k.
k
cgram = Clustergram(range(1, 20)) cgram.fit(data) cgram.plot(figsize=(12, 8))
cgram.plot(k_range=range(3, 10), figsize=(12, 8))
You can save both plot and clustergram.Clustergram to a disk.
clustergram.Clustergram
Clustergram.plot() returns matplotlib axis object and as such can be saved as any other plot:
import matplotlib.pyplot as plt cgram.plot() plt.savefig('clustergram.svg')
If you want to save your computed clustergram.Clustergram object to a disk, you can use pickle library:
pickle
import pickle with open('clustergram.pickle','wb') as f: pickle.dump(cgram, f)
Then loading is equally simple:
with open('clustergram.pickle','rb') as f: loaded = pickle.load(f)
Schonlau M. The clustergram: a graph for visualizing hierarchical and non-hierarchical cluster analyses. The Stata Journal, 2002; 2 (4):391-402.
Schonlau M. Visualizing Hierarchical and Non-Hierarchical Cluster Analyses with Clustergrams. Computational Statistics: 2004; 19(1):95-111.
https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/