clustergram API reference

clustergram

class clustergram.Clustergram(k_range, backend='sklearn', method='kmeans', verbose=True, **kwargs)

Clustergram class mimicking the interface of clustering class (e.g. KMeans).

Clustergram is a graph used to examine how cluster members are assigned to clusters as the number of clusters increases. This graph is useful in exploratory analysis for nonhierarchical clustering algorithms such as k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.

Clustergram offers two backends for the computation - scikit-learn which uses CPU and RAPIDS.AI cuML, which uses GPU. Note that both are optional dependencies, but you will need at least one of them to generate clustergram.

Parameters
k_rangeiterable

iterable of integer values to be tested as k (number of cluster or components).

backend{‘sklearn’, ‘cuML’} (default ‘sklearn’)

Whether to use sklearn’s implementation of KMeans and PCA or cuML version. sklearn does computation on CPU, cuml on GPU.

method{‘kmeans’, ‘gmm’, ‘minibatchkmeans’} (default ‘kmeans’)

Clustering method.

  • kmeans uses K-Means clustering, either as sklearn.cluster.KMeans or cuml.KMeans.

  • gmm uses Gaussian Mixture Model as sklearn.mixture.GaussianMixture

  • minibatchkmeans uses Mini Batch K-Means as sklearn.cluster.MiniBatchKMeans

Note that gmm and minibatchkmeans are currently supported only with sklearn backend.

verbosebool (default True)

Print progress and time of individual steps.

**kwargs

Additional arguments passed to the model (e.g. KMeans), e.g. random_state.

References

The clustergram: A graph for visualizing hierarchical and nonhierarchical cluster analyses: https://journals.sagepub.com/doi/10.1177/1536867X0200200405

Tal Galili’s R implementation: https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.plot()

Specifying parameters:

>>> c_gram2 = clustergram.Clustergram(
...     range(1, 9), backend="cuML", random_state=0
... )
>>> c_gram2.fit(cudf_data)
>>> c_gram2.plot(figsize=(12, 12))
Attributes
labelsDataFrame

DataFrame with cluster labels for each iteration.

cluster_centersdict

Dictionary with cluster centers for each iteration.

Methods

calinski_harabasz_score()

Compute the Calinski and Harabasz score.

davies_bouldin_score()

Compute the Davies-Bouldin score.

fit(data, **kwargs)

Compute clustering for each k within set range.

plot([ax, size, linewidth, cluster_style, …])

Generate clustergram plot based on cluster centre mean values.

silhouette_score(**kwargs)

Compute the mean Silhouette Coefficient of all samples.

calinski_harabasz_score()

Compute the Calinski and Harabasz score.

See the documentation of sklearn.metrics.calinski_harabasz_score for details.

Once computed, resulting Series is available as Clustergram.calinski_harabasz. Calling the original method will compute the score from the beginning.

Returns
self.calinski_harabaszpd.Series

Notes

The algortihm uses sklearn. With cuML backend, data are converted on the fly.

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.calinski_harabasz_score()
2      23.176629
3      30.643018
4      55.223336
5    3116.435184
6    3899.068689
7    4439.306049
Name: calinski_harabasz_score, dtype: float64

Once computed:

>>> c_gram.calinski_harabasz
2      23.176629
3      30.643018
4      55.223336
5    3116.435184
6    3899.068689
7    4439.306049
Name: calinski_harabasz_score, dtype: float64
davies_bouldin_score()

Compute the Davies-Bouldin score.

See the documentation of sklearn.metrics.davies_bouldin_score for details.

Once computed, resulting Series is available as Clustergram.davies_bouldin. Calling the original method will recompute the score.

Returns
self.davies_bouldinpd.Series

Notes

The algortihm uses sklearn. With cuML backend, data are converted on the fly.

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.davies_bouldin_score()
2    0.249366
3    0.351812
4    0.347580
5    0.055679
6    0.030516
7    0.025207
Name: davies_bouldin_score, dtype: float64

Once computed:

>>> c_gram.davies_bouldin
2    0.249366
3    0.351812
4    0.347580
5    0.055679
6    0.030516
7    0.025207
Name: davies_bouldin_score, dtype: float64
fit(data, **kwargs)

Compute clustering for each k within set range.

Parameters
dataarray-like

Input data to be clustered. It is expected that data are scaled. Can be numpy.array, pandas.DataFrame or their RAPIDS counterparts.

**kwargs

Additional arguments passed to the .fit() method of the model, e.g. sample_weight.

Returns
self

Fitted clustergram.

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
plot(ax=None, size=1, linewidth=1, cluster_style=None, line_style=None, figsize=None, k_range=None, pca_weighted=True, pca_kwargs={})

Generate clustergram plot based on cluster centre mean values.

Parameters
axmatplotlib.pyplot.Artist (default None)

matplotlib axis on which to draw the plot

sizefloat (default 1)

multiplier of the size of a cluster centre indication. Size is determined as 500 / count of observations in a cluster multiplied by size.

linewidthfloat (default 1)

multiplier of the linewidth of a branch. Line width is determined as 50 / count of observations in a branch multiplied by linewidth.

cluster_styledict (default None)

Style options to be passed on to the cluster centre plot, such as color, linewidth, edgecolor or alpha.

line_styledict (default None)

Style options to be passed on to branches, such as color, linewidth, edgecolor or alpha.

figsizetuple of integers (default None)

Size of the resulting matplotlib.figure.Figure. If the argument ax is given explicitly, figsize is ignored.

k_rangeiterable (default None)

iterable of integer values to be plotted. In none, Clustergram.k_range will be used. Has to be a subset of Clustergram.k_range.

pca_weightedbool (default True)

Whether use PCA weighted mean of clusters or standard mean of clusters on y-axis.

pca_kwargsdict (default {})

Additional arguments passed to the PCA object, e.g. svd_solver. Applies only if pca_weighted=True.

Returns
axmatplotlib axis instance

Notes

Before plotting, Clustergram needs to compute the summary values. Those are computed on the first call of each option (pca_weighted=True/False).

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.plot()
silhouette_score(**kwargs)

Compute the mean Silhouette Coefficient of all samples.

See the documentation of sklearn.metrics.silhouette_score for details.

Once computed, resulting Series is available as Clustergram.silhouette. Calling the original method will compute the score from the beginning.

Parameters
**kwargs

Additional arguments passed to the silhouette_score function, e.g. sample_size.

Returns
self.silhouettepd.Series

Notes

The algortihm uses sklearn. With cuML backend, data are converted on the fly.

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.silhouette_score()
2    0.702450
3    0.644272
4    0.767728
5    0.948991
6    0.769985
7    0.575644
Name: silhouette_score, dtype: float64

Once computed:

>>> c_gram.silhouette
2    0.702450
3    0.644272
4    0.767728
5    0.948991
6    0.769985
7    0.575644
Name: silhouette_score, dtype: float64