Additional methods

Note

You can try this notebook in you browser: Binder

This notebooks provides an overview of built-in clustering performance evaluation, ways of accessing individual labels resulting from clustering and saving the object to disk.

Clustering performance evaluation

Clustergam includes handy wrappers around a selection of clustering performance metrics offered by scikit-learn. Data which were originally computed on GPU are converted to numpy on the fly.

Let’s load the data and fit clustergram on Palmer penguins dataset. See the Introduction for its overview.

import seaborn
from sklearn.preprocessing import scale
from clustergram import Clustergram

seaborn.set(style='whitegrid')

df = seaborn.load_dataset('penguins')
data = scale(df.drop(columns=['species', 'island', 'sex']).dropna())

cgram = Clustergram(range(1, 12), verbose=False)
cgram.fit(data)
Matplotlib is building the font cache; this may take a moment.
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/tmp/ipykernel_2057/840487560.py in <module>
----> 1 import seaborn
      2 from sklearn.preprocessing import scale
      3 from clustergram import Clustergram
      4 
      5 seaborn.set(style='whitegrid')

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/seaborn/__init__.py in <module>
      1 # Import seaborn objects
----> 2 from .rcmod import *  # noqa: F401,F403
      3 from .utils import *  # noqa: F401,F403
      4 from .palettes import *  # noqa: F401,F403
      5 from .relational import *  # noqa: F401,F403

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/seaborn/rcmod.py in <module>
      5 import matplotlib as mpl
      6 from cycler import cycler
----> 7 from . import palettes
      8 
      9 

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/seaborn/palettes.py in <module>
      7 from .external import husl
      8 
----> 9 from .utils import desaturate, get_color_cycle
     10 from .colors import xkcd_rgb, crayons
     11 

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/seaborn/utils.py in <module>
     12 import matplotlib as mpl
     13 import matplotlib.colors as mplcol
---> 14 import matplotlib.pyplot as plt
     15 from matplotlib.cbook import normalize_kwargs
     16 

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/matplotlib/pyplot.py in <module>
     34 from cycler import cycler
     35 import matplotlib
---> 36 import matplotlib.colorbar
     37 import matplotlib.image
     38 from matplotlib import _api

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/matplotlib/colorbar.py in <module>
     36 
     37 import matplotlib as mpl
---> 38 from matplotlib import _api, collections, cm, colors, contour, ticker
     39 import matplotlib.artist as martist
     40 import matplotlib.patches as mpatches

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/matplotlib/contour.py in <module>
     15 import matplotlib.colors as mcolors
     16 import matplotlib.collections as mcoll
---> 17 import matplotlib.font_manager as font_manager
     18 import matplotlib.text as text
     19 import matplotlib.cbook as cbook

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/matplotlib/font_manager.py in <module>
   1445 
   1446 
-> 1447 fontManager = _load_fontmanager()
   1448 findfont = fontManager.findfont

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/matplotlib/font_manager.py in _load_fontmanager(try_read_cache)
   1439                 _log.debug("Using fontManager instance from %s", fm_path)
   1440                 return fm
-> 1441     fm = FontManager()
   1442     json_dump(fm, fm_path)
   1443     _log.info("generated new fontManager")

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/matplotlib/font_manager.py in __init__(self, size, weight)
   1064             for fontext in ["afm", "ttf"]:
   1065                 for path in [*findSystemFonts(paths, fontext=fontext),
-> 1066                              *findSystemFonts(fontext=fontext)]:
   1067                     try:
   1068                         self.addfont(path)

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/matplotlib/font_manager.py in findSystemFonts(fontpaths, fontext)
    337             if sys.platform == 'darwin':
    338                 fontpaths = [*X11FontDirectories, *OSXFontDirectories]
--> 339             fontfiles.update(get_fontconfig_fonts(fontext))
    340 
    341     elif isinstance(fontpaths, str):

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/matplotlib/font_manager.py in get_fontconfig_fonts(fontext)
    313     """List font filenames known to `fc-list` having the given extension."""
    314     fontext = ['.' + ext for ext in get_fontext_synonyms(fontext)]
--> 315     return [fname for fname in _call_fc_list()
    316             if Path(fname).suffix.lower() in fontext]
    317 

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/site-packages/matplotlib/font_manager.py in _call_fc_list()
    304                 'Matplotlib needs fontconfig>=2.7 to query system fonts.')
    305             return []
--> 306         out = subprocess.check_output(['fc-list', '--format=%{file}\\n'])
    307     except (OSError, subprocess.CalledProcessError):
    308         return []

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/subprocess.py in check_output(timeout, *popenargs, **kwargs)
    422         kwargs['input'] = empty
    423 
--> 424     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
    425                **kwargs).stdout
    426 

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    505     with Popen(*popenargs, **kwargs) as process:
    506         try:
--> 507             stdout, stderr = process.communicate(input, timeout=timeout)
    508         except TimeoutExpired as exc:
    509             process.kill()

~/checkouts/readthedocs.org/user_builds/clustergram/conda/latest/lib/python3.9/subprocess.py in communicate(self, input, timeout)
   1119                 self._stdin_write(input)
   1120             elif self.stdout:
-> 1121                 stdout = self.stdout.read()
   1122                 self.stdout.close()
   1123             elif self.stderr:

KeyboardInterrupt: 

Silhouette score

Compute the mean Silhouette Coefficient of all samples. See scikit-learn documentation for details.

cgram.silhouette_score()
2     0.531540
3     0.447219
4     0.400154
5     0.377720
6     0.372722
7     0.334723
8     0.300173
9     0.289008
10    0.283982
11    0.275596
Name: silhouette_score, dtype: float64

Once computed, resulting Series is available as cgram.silhouette. Calling the original method will recompute the score.

cgram.silhouette.plot()
<AxesSubplot:>
../_images/evaluation_5_1.png

Calinski and Harabasz score

Compute the Calinski and Harabasz score, also known as the Variance Ratio Criterion. See scikit-learn documentation for details.

cgram.calinski_harabasz_score()
2     482.191469
3     441.677075
4     400.410025
5     411.258719
6     382.291616
7     352.464103
8     334.070064
9     315.539143
10    300.957431
11    287.590520
Name: calinski_harabasz_score, dtype: float64

Once computed, resulting Series is available as cgram.calinski_harabasz. Calling the original method will recompute the score.

cgram.calinski_harabasz.plot()
<AxesSubplot:>
../_images/evaluation_9_1.png

Davies-Bouldin score

Compute the Davies-Bouldin score. See scikit-learn documentation for details.

cgram.davies_bouldin_score()
2     0.714064
3     0.943553
4     0.944215
5     0.972408
6     0.948556
7     1.075790
8     1.138705
9     1.233587
10    1.202486
11    1.221233
Name: davies_bouldin_score, dtype: float64

Once computed, resulting Series is available as cgram.davies_bouldin. Calling the original method will recompute the score.

cgram.davies_bouldin.plot()
<AxesSubplot:>
../_images/evaluation_13_1.png

Acessing labels

Clustergram stores resulting labels for each of the tested options, which can be accessed as:

cgram.labels
1 2 3 4 5 6 7 8 9 10 11
0 0 1 1 1 3 4 4 6 7 5 9
1 0 1 1 1 3 4 4 0 7 1 3
2 0 1 1 1 3 4 4 0 7 1 3
3 0 1 1 1 3 4 4 6 0 5 9
4 0 1 1 1 2 5 0 6 0 3 6
... ... ... ... ... ... ... ... ... ... ... ...
337 0 0 0 3 4 3 6 3 1 4 5
338 0 0 0 3 4 3 6 3 1 4 5
339 0 0 0 0 1 1 3 7 5 7 7
340 0 0 0 3 4 3 1 1 8 0 1
341 0 0 0 0 1 1 1 1 8 0 1

342 rows × 11 columns

Saving clustergram

If you want to save your computed clustergram.Clustergram object to a disk, you can use pickle library:

import pickle

with open('clustergram.pickle','wb') as f:
    pickle.dump(cgram, f)
with open('clustergram.pickle','rb') as f:
    loaded = pickle.load(f)