Clustering methods#

Clustergram offers three backends for the computation - scikit-learn and scipy which use CPU and RAPIDS.AI cuML, which uses GPU. Note that all are optional dependencies but you will need at least one of them to generate clustergram.

Scikit-learn, SciPy and RAPIDS cuML backends#

Let’s load the data on Palmer penguins dataset. See the Introduction for its overview.

import seaborn
from sklearn.preprocessing import scale
from clustergram import Clustergram

df = seaborn.load_dataset('penguins')
data = scale(df.drop(columns=['species', 'island', 'sex']).dropna())

seaborn.set(style='whitegrid')

Using scikit-learn (default):

cgram = Clustergram(range(1, 8), n_init=10, backend="sklearn")
cgram.fit(data)
K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.142 seconds.
K=3 fitted in 0.052 seconds.
K=4 fitted in 0.056 seconds.
K=5 fitted in 0.049 seconds.
K=6 fitted in 0.053 seconds.
K=7 fitted in 0.068 seconds.
Clustergram(k_range=range(1, 8), backend='sklearn', method='kmeans', kwargs={'n_init': 10})

Using cuML:

cgram = Clustergram(range(1, 8), backend='cuML')
cgram.fit(data)
Hide code cell output
K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.072 seconds.
K=3 fitted in 0.004 seconds.
K=4 fitted in 0.004 seconds.
K=5 fitted in 0.004 seconds.
K=6 fitted in 0.004 seconds.
K=7 fitted in 0.004 seconds.
Clustergram(k_range=range(1, 8), backend='cuML', method='kmeans', kwargs={})
cgram.plot();
../_images/e5dffac4df7074730cd3214a3d0e74f485151a87701716d60c1d6b246db43e1c.png

data can be all data types supported by the selected backend (including cudf.DataFrame or cupy.ndarray with cuML backend).

Supported methods#

Clustergram currently supports K-Means, Mini Batch K-Means, Gaussian Mixture Model and SciPy’s hierarchical clustering methods. Note tha GMM and Mini Batch K-Means are supported only for scikit-learn backend and hierarchical methods are supported only for scipy backend.

Using K-Means (default):

cgram = Clustergram(range(1, 8), n_init=10, method='kmeans')
cgram.fit(data)
cgram.plot();
K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.055 seconds.
K=3 fitted in 0.066 seconds.
K=4 fitted in 0.046 seconds.
K=5 fitted in 0.052 seconds.
K=6 fitted in 0.058 seconds.
K=7 fitted in 0.065 seconds.
../_images/f8835637c7d2c8eee9a27d90d8ca1a1a0a517c4fe6941d0acbc5f7ba130aabdc.png

Using Mini Batch K-Means, which can provide significant speedup over K-Means:

cgram = Clustergram(range(1, 8), n_init=10, method='minibatchkmeans', batch_size=100)
cgram.fit(data)
cgram.plot();
K=1 skipped. Mean computed from data directly.
K=2 fitted in 0.071 seconds.
K=3 fitted in 0.070 seconds.
K=4 fitted in 0.071 seconds.
K=5 fitted in 0.054 seconds.
K=6 fitted in 0.060 seconds.
K=7 fitted in 0.066 seconds.
../_images/b65ef8534b04b6fa883592397c50e2da65367230906f84d0a69e089b0639dda7.png

Using Gaussian Mixture Model:

cgram = Clustergram(range(1, 8), method='gmm')
cgram.fit(data)
cgram.plot();
K=1 fitted in 0.020 seconds.
K=2 fitted in 0.024 seconds.
K=3 fitted in 0.026 seconds.
K=4 fitted in 0.030 seconds.
K=5 fitted in 0.026 seconds.
K=6 fitted in 0.031 seconds.
K=7 fitted in 0.029 seconds.
../_images/29afcb06bb354118717a3a7a5a5ad2b0c1a40a4471d208819d8385975ad42394.png

Using Ward’s hierarchical clustering:

cgram = Clustergram(range(1, 8), method='hierarchical', linkage='ward')
cgram.fit(data)
cgram.plot();
../_images/d076a6966d71b879cadd9d7a2ad9ddcc80ac581909cda4fa4c39460cbac90f23.png

Manual input#

Alternatively, you can create clustergram using from_data or from_centers methods based on alternative clustering algorithms.

Using Clustergram.from_data which creates cluster centers as mean or median values:

import numpy
import pandas

data = numpy.array([[-1, -1, 0, 10], [1, 1, 10, 2], [0, 0, 20, 4]])
labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})

cgram = Clustergram.from_data(data, labels)
cgram.plot();
../_images/e54edaa4396262576e7cecefeec93e401ea5fbc7e1ff9aa16fbe45c1aa671d08.png

Using Clustergram.from_centers based on explicit cluster centers.:

labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
centers = {
            1: numpy.array([[0, 0, 0]]),
            2: numpy.array([[-1, -1, -1], [1, 1, 1]]),
            3: numpy.array([[-1, -1, -1], [1, 1, 1], [0, 0, 0]]),
        }
cgram = Clustergram.from_centers(centers, labels)
cgram.plot(pca_weighted=False);
../_images/414438f9dadf7ad127ec08c9516e626329d49ae25f9acb2633e0673861e00424.png

To support PCA weighted plots with clustergram created from centers you also need to pass data:

labels = pandas.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
centers = {
    1: numpy.array([[0, 0]]),
    2: numpy.array([[-1, -1], [1, 1]]),
    3: numpy.array([[-1, -1], [1, 1], [0, 0]]),
}
data = numpy.array([[-1, -1], [1, 1], [0, 0]])
cgram = Clustergram.from_centers(centers, labels, data=data)
cgram.plot(pca_weighted=True);
../_images/2bfddc809576c05b58eb4a1a39b1aa5e0114b501c791d4facdb6dcdb8f63c7b1.png