Clustergram
Note
You can try this notebook in you browser:
This notebooks provides an overview of built-in clustering performance evaluation, ways of accessing individual labels resulting from clustering and saving the object to disk.
Clustergam includes handy wrappers around a selection of clustering performance metrics offered by scikit-learn. Data which were originally computed on GPU are converted to numpy on the fly.
scikit-learn
Let’s load the data and fit clustergram on Palmer penguins dataset. See the Introduction for its overview.
import seaborn from sklearn.preprocessing import scale from clustergram import Clustergram seaborn.set(style='whitegrid') df = seaborn.load_dataset('penguins') data = scale(df.drop(columns=['species', 'island', 'sex']).dropna()) cgram = Clustergram(range(1, 12), verbose=False) cgram.fit(data)
Matplotlib is building the font cache; this may take a moment.
Compute the mean Silhouette Coefficient of all samples. See scikit-learn documentation for details.
cgram.silhouette_score()
2 0.531540 3 0.447219 4 0.399584 5 0.378892 6 0.368646 7 0.331506 8 0.290318 9 0.280132 10 0.282766 11 0.276113 Name: silhouette_score, dtype: float64
Once computed, resulting Series is available as cgram.silhouette. Calling the original method will recompute the score.
cgram.silhouette
cgram.silhouette.plot()
<AxesSubplot:>
Compute the Calinski and Harabasz score, also known as the Variance Ratio Criterion. See scikit-learn documentation for details.
cgram.calinski_harabasz_score()
2 482.191469 3 441.677075 4 400.410025 5 411.242581 6 382.213137 7 352.308247 8 332.865976 9 311.233838 10 300.138241 11 287.334763 Name: calinski_harabasz_score, dtype: float64
Once computed, resulting Series is available as cgram.calinski_harabasz. Calling the original method will recompute the score.
cgram.calinski_harabasz
cgram.calinski_harabasz.plot()
Compute the Davies-Bouldin score. See scikit-learn documentation for details.
cgram.davies_bouldin_score()
2 0.714064 3 0.943553 4 0.944215 5 0.971103 6 0.996196 7 1.068749 8 1.151966 9 1.177526 10 1.223752 11 1.230724 Name: davies_bouldin_score, dtype: float64
Once computed, resulting Series is available as cgram.davies_bouldin. Calling the original method will recompute the score.
cgram.davies_bouldin
cgram.davies_bouldin.plot()
Clustergram stores resulting labels for each of the tested options, which can be accessed as:
cgram.labels
342 rows × 11 columns
If you want to save your computed clustergram.Clustergram object to a disk, you can use pickle library:
clustergram.Clustergram
pickle
import pickle with open('clustergram.pickle','wb') as f: pickle.dump(cgram, f)
with open('clustergram.pickle','rb') as f: loaded = pickle.load(f)