clustergram API reference¶

clustergram¶

class clustergram.Clustergram(k_range, backend='sklearn', method='kmeans', verbose=True, **kwargs)¶

Clustergram class mimicking the interface of clustering class (e.g. KMeans).

Clustergram is a graph used to examine how cluster members are assigned to clusters as the number of clusters increases. This graph is useful in exploratory analysis for nonhierarchical clustering algorithms such as k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.

Clustergram offers two backends for the computation - scikit-learn which uses CPU and RAPIDS.AI cuML, which uses GPU. Note that both are optional dependencies, but you will need at least one of them to generate clustergram.

Parameters

k_rangeiterable

iterable of integer values to be tested as k (number of cluster or components).

backend{‘sklearn’, ‘cuML’} (default ‘sklearn’)

Whether to use sklearn’s implementation of KMeans and PCA or cuML version. sklearn does computation on CPU, cuml on GPU.

method{‘kmeans’, ‘gmm’, ‘minibatchkmeans’} (default ‘kmeans’)

Clustering method.

kmeans uses K-Means clustering, either as sklearn.cluster.KMeans or cuml.KMeans.
gmm uses Gaussian Mixture Model as sklearn.mixture.GaussianMixture
minibatchkmeans uses Mini Batch K-Means as sklearn.cluster.MiniBatchKMeans

Note that gmm and minibatchkmeans are currently supported only with sklearn backend.

verbosebool (default True)

Print progress and time of individual steps.

**kwargs

Additional arguments passed to the model (e.g. KMeans), e.g. random_state.

References

The clustergram: A graph for visualizing hierarchical and nonhierarchical cluster analyses: https://journals.sagepub.com/doi/10.1177/1536867X0200200405

Tal Galili’s R implementation: https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.plot()

Specifying parameters:

>>> c_gram2 = clustergram.Clustergram(
...     range(1, 9), backend="cuML", random_state=0
... )
>>> c_gram2.fit(cudf_data)
>>> c_gram2.plot(figsize=(12, 12))

Attributes

labelsDataFrame: DataFrame with cluster labels for each iteration.
cluster_centersdict: Dictionary with cluster centers for each iteration.

Methods

`calinski_harabasz_score`()	Compute the Calinski and Harabasz score.
`davies_bouldin_score`()	Compute the Davies-Bouldin score.
`fit`(data, **kwargs)	Compute clustering for each k within set range.
`plot`([ax, size, linewidth, cluster_style, …])	Generate clustergram plot based on cluster centre mean values.
`silhouette_score`(**kwargs)	Compute the mean Silhouette Coefficient of all samples.

calinski_harabasz_score()¶

Compute the Calinski and Harabasz score.

See the documentation of sklearn.metrics.calinski_harabasz_score for details.

Once computed, resulting Series is available as Clustergram.calinski_harabasz. Calling the original method will compute the score from the beginning.

Returns

self.calinski_harabaszpd.Series

Notes

The algortihm uses sklearn. With cuML backend, data are converted on the fly.

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.calinski_harabasz_score()
2      23.176629
3      30.643018
4      55.223336
5    3116.435184
6    3899.068689
7    4439.306049
Name: calinski_harabasz_score, dtype: float64

Once computed:

>>> c_gram.calinski_harabasz
    23.176629
    30.643018
    55.223336
  3116.435184
  3899.068689
  4439.306049
Name: calinski_harabasz_score, dtype: float64

davies_bouldin_score()¶

Compute the Davies-Bouldin score.

See the documentation of sklearn.metrics.davies_bouldin_score for details.

Once computed, resulting Series is available as Clustergram.davies_bouldin. Calling the original method will recompute the score.

Returns

self.davies_bouldinpd.Series

Notes

The algortihm uses sklearn. With cuML backend, data are converted on the fly.

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.davies_bouldin_score()
2    0.249366
3    0.351812
4    0.347580
5    0.055679
6    0.030516
7    0.025207
Name: davies_bouldin_score, dtype: float64

Once computed:

>>> c_gram.davies_bouldin
  0.249366
  0.351812
  0.347580
  0.055679
  0.030516
  0.025207
Name: davies_bouldin_score, dtype: float64

fit(data, **kwargs)¶

Compute clustering for each k within set range.

Parameters

dataarray-like: Input data to be clustered. It is expected that data are scaled. Can be numpy.array, pandas.DataFrame or their RAPIDS counterparts.
**kwargs: Additional arguments passed to the .fit() method of the model, e.g. sample_weight.

Returns

self: Fitted clustergram.

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)

plot(ax=None, size=1, linewidth=1, cluster_style=None, line_style=None, figsize=None, k_range=None, pca_weighted=True, pca_kwargs={})¶

Generate clustergram plot based on cluster centre mean values.

Parameters

axmatplotlib.pyplot.Artist (default None): matplotlib axis on which to draw the plot
sizefloat (default 1): multiplier of the size of a cluster centre indication. Size is determined as 500 / count of observations in a cluster multiplied by size.
linewidthfloat (default 1): multiplier of the linewidth of a branch. Line width is determined as 50 / count of observations in a branch multiplied by linewidth.
cluster_styledict (default None): Style options to be passed on to the cluster centre plot, such as color, linewidth, edgecolor or alpha.
line_styledict (default None): Style options to be passed on to branches, such as color, linewidth, edgecolor or alpha.
figsizetuple of integers (default None): Size of the resulting matplotlib.figure.Figure. If the argument ax is given explicitly, figsize is ignored.
k_rangeiterable (default None): iterable of integer values to be plotted. In none, Clustergram.k_range will be used. Has to be a subset of Clustergram.k_range.
pca_weightedbool (default True): Whether use PCA weighted mean of clusters or standard mean of clusters on y-axis.
pca_kwargsdict (default {}): Additional arguments passed to the PCA object, e.g. svd_solver. Applies only if pca_weighted=True.

Returns

axmatplotlib axis instance

Notes

Before plotting, Clustergram needs to compute the summary values. Those are computed on the first call of each option (pca_weighted=True/False).

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.plot()

silhouette_score(**kwargs)¶

Compute the mean Silhouette Coefficient of all samples.

See the documentation of sklearn.metrics.silhouette_score for details.

Once computed, resulting Series is available as Clustergram.silhouette. Calling the original method will compute the score from the beginning.

Parameters

**kwargs: Additional arguments passed to the silhouette_score function, e.g. sample_size.

Returns

self.silhouettepd.Series

Notes

The algortihm uses sklearn. With cuML backend, data are converted on the fly.

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.silhouette_score()
2    0.702450
3    0.644272
4    0.767728
5    0.948991
6    0.769985
7    0.575644
Name: silhouette_score, dtype: float64

Once computed:

>>> c_gram.silhouette
  0.702450
  0.644272
  0.767728
  0.948991
  0.769985
  0.575644
Name: silhouette_score, dtype: float64

Clustergram