Clustergram
Note
You can try this notebook in you browser:
This notebooks provides an overview of built-in clustering performance evaluation, ways of accessing individual labels resulting from clustering and saving the object to disk.
Clustergam includes handy wrappers around a selection of clustering performance metrics offered by scikit-learn. Data which were originally computed on GPU are converted to numpy on the fly.
scikit-learn
Let’s load the data and fit clustergram on Palmer penguins dataset. See the Introduction for its overview.
import seaborn from sklearn.preprocessing import scale from clustergram import Clustergram seaborn.set(style='whitegrid') df = seaborn.load_dataset('penguins') data = scale(df.drop(columns=['species', 'island', 'sex']).dropna()) cgram = Clustergram(range(1, 12), verbose=False) cgram.fit(data)
Compute the mean Silhouette Coefficient of all samples. See scikit-learn documentation for details.
cgram.silhouette_score()
2 0.531540 3 0.447219 4 0.399584 5 0.377720 6 0.368665 7 0.335069 8 0.286170 9 0.285263 10 0.279539 11 0.273914 Name: silhouette_score, dtype: float64
Once computed, resulting Series is available as cgram.silhouette. Calling the original method will recompute the score.
cgram.silhouette
cgram.silhouette.plot()
<AxesSubplot:>
Compute the Calinski and Harabasz score, also known as the Variance Ratio Criterion. See scikit-learn documentation for details.
cgram.calinski_harabasz_score()
2 482.191469 3 441.677075 4 400.410025 5 411.175066 6 382.297175 7 352.713169 8 331.796377 9 315.551827 10 298.178981 11 286.976897 Name: calinski_harabasz_score, dtype: float64
Once computed, resulting Series is available as cgram.calinski_harabasz. Calling the original method will recompute the score.
cgram.calinski_harabasz
cgram.calinski_harabasz.plot()
Compute the Davies-Bouldin score. See scikit-learn documentation for details.
cgram.davies_bouldin_score()
2 0.714064 3 0.943553 4 0.944215 5 0.973248 6 0.976604 7 1.036676 8 1.175931 9 1.240331 10 1.201865 11 1.239891 Name: davies_bouldin_score, dtype: float64
Once computed, resulting Series is available as cgram.davies_bouldin. Calling the original method will recompute the score.
cgram.davies_bouldin
cgram.davies_bouldin.plot()
Clustergram stores resulting labels for each of the tested options, which can be accessed as:
cgram.labels
342 rows × 11 columns
If you want to save your computed clustergram.Clustergram object to a disk, you can use pickle library:
clustergram.Clustergram
pickle
import pickle with open('clustergram.pickle','wb') as f: pickle.dump(cgram, f)
with open('clustergram.pickle','rb') as f: loaded = pickle.load(f)
previous
Clustering methods
next
API reference