API reference#

class Clustergram(k_range=None, backend=None, method='kmeans', verbose=True, **kwargs)#

Clustergram class mimicking the interface of clustering class (e.g. KMeans).

Clustergram is a graph used to examine how cluster members are assigned to clusters as the number of clusters increases. This graph is useful in exploratory analysis for nonhierarchical clustering algorithms such as k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.

Clustergram offers three backends for the computation - scikit-learn and scipy which use CPU and RAPIDS.AI cuML, which uses GPU. Note that all are optional dependencies but you will need at least one of them to generate clustergram.

Alternatively, you can create clustergram using from_data or from_centers methods based on alternative clustering algorithms.

Parameters:

k_rangeiterable (default None)

iterable of integer values to be tested as k (number of cluster or components). Not required for hierarchical clustering but will be applied if given. It is recommended to always use limited range for hierarchical methods as unlimited clustergram can take a while to compute and for large number of observations is not legible.

backend{‘sklearn’, ‘cuML’, ‘scipy’} (default None)

Specify computational backend. Defaults to sklearn for 'kmeans', 'gmm', and 'minibatchkmeans' methods and to 'scipy' for any of hierarchical clustering methods. 'scipy' uses sklearn for PCA computation if that is required. sklearn does computation on CPU, cuml on GPU.

method{‘kmeans’, ‘gmm’, ‘minibatchkmeans’, ‘hierarchical’} (default ‘kmeans’)

Clustering method.

kmeans uses K-Means clustering, either as sklearn.cluster.KMeans or cuml.KMeans.
gmm uses Gaussian Mixture Model as sklearn.mixture.GaussianMixture
minibatchkmeans uses Mini Batch K-Means as sklearn.cluster.MiniBatchKMeans
hierarchical uses hierarchical/agglomerative clustering as scipy.cluster.hierarchy.linkage. See

Note that gmm and minibatchkmeans are currently supported only with sklearn backend.

verbosebool (default True)

Print progress and time of individual steps.

**kwargs

Additional arguments passed to the model (e.g. KMeans), e.g. random_state. Pass linkage to specify linkage method in case of hierarchical clustering (e.g. linkage='ward'). See the documentation of scipy for details. If method='gmm', you can pass bic=True to store BIC value in Clustergram.bic.

References

The clustergram: A graph for visualizing hierarchical and nonhierarchical cluster analyses: https://journals.sagepub.com/doi/10.1177/1536867X0200200405

Tal Galili’s R implementation: https://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.plot()

Specifying parameters:

>>> c_gram2 = clustergram.Clustergram(
...     range(1, 9), backend="cuML", random_state=0
... )
>>> c_gram2.fit(cudf_data)
>>> c_gram2.plot(figsize=(12, 12))

Attributes:

labels_DataFrame: DataFrame with cluster labels for each option.
cluster_centers_dict: Dictionary with cluster centers for each option.
linkage_numpy.ndarray: Linkage for hierarchical methods.
bic_Series: Bayesian Information Criterion for each option for Gaussian Mixture Model.

Methods

`bokeh`([fig, size, line_width, ...])	Generate interactive clustergram plot based on cluster centre mean values using Bokeh.
`calinski_harabasz_score`()	Compute the Calinski and Harabasz score.
`davies_bouldin_score`()	Compute the Davies-Bouldin score.
`fit`(X[, y])	Compute clustering for each k within set range.
`from_centers`(cluster_centers, labels[, data])	Create clustergram based on cluster centers dictionary and labels DataFrame.
`from_data`(data, labels[, method])	Create clustergram based on data and labels DataFrame.
`plot`([ax, size, linewidth, cluster_style, ...])	Generate clustergram plot based on cluster centre mean values.
`silhouette_score`(**kwargs)	Compute the mean Silhouette Coefficient of all samples.

bokeh(fig=None, size=1, line_width=1, cluster_style=None, line_style=None, figsize=None, pca_weighted=True, pca_kwargs={}, pca_component=1)#

Generate interactive clustergram plot based on cluster centre mean values using Bokeh.

Requires bokeh.

Parameters:

figbokeh.plotting.figure.Figure (default None): bokeh figure on which to draw the plot
sizefloat (default 1): multiplier of the size of a cluster centre indication. Size is determined as 50 / count of observations in a cluster multiplied by size.
line_widthfloat (default 1): multiplier of the linewidth of a branch. Line width is determined as 50 / count of observations in a branch multiplied by line_width.
cluster_styledict (default None): Style options to be passed on to the cluster centre plot, such as color, line_width, line_color or alpha.
line_styledict (default None): Style options to be passed on to branches, such as color, line_width, line_color or alpha.
figsizetuple of integers (default None): Size of the resulting bokeh.plotting.figure.Figure. If the argument figure is given explicitly, figsize is ignored.
pca_weightedbool (default True): Whether use PCA weighted mean of clusters or standard mean of clusters on y-axis.
pca_kwargsdict (default {}): Additional arguments passed to the PCA object, e.g. svd_solver. Applies only if pca_weighted=True.
pca_componentint (default 1): The principal component used to weigh mean of clusters if pca_weighted=True. The PCA computation is cached so it is cheap to compare multiple options. However, if you use pca=1 first, when trying pca=2 the PCA is run again as it computed only for the max pca requested. If you first run plot with pca=2, the second with pca=1 does not trigger computation.

Returns:

figurebokeh figure instance

Notes

Before plotting, Clustergram needs to compute the summary values. Those are computed on the first call of each option (pca_weighted=True/False).

Examples

>>> from bokeh.plotting import show
>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> f = c_gram.bokeh()
>>> show(f)

For the best experience in Jupyter notebooks, specify bokeh output first:

>>> from bokeh.io import output_notebook
>>> from bokeh.plotting import show
>>> output_notebook()

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> f = c_gram.bokeh()
>>> show(f)

calinski_harabasz_score()#

Compute the Calinski and Harabasz score.

See the documentation of sklearn.metrics.calinski_harabasz_score for details.

Once computed, resulting Series is available as Clustergram.calinski_harabasz. Calling the original method will compute the score from the beginning.

Returns:

calinski_harabaszpd.Series

Notes

The algortihm uses sklearn. With cuML backend, data are converted on the fly.

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.calinski_harabasz_score()
2      23.176629
3      30.643018
4      55.223336
5    3116.435184
6    3899.068689
7    4439.306049
Name: calinski_harabasz_score, dtype: float64

Once computed:

>>> c_gram.calinski_harabasz_
    23.176629
    30.643018
    55.223336
  3116.435184
  3899.068689
  4439.306049
Name: calinski_harabasz_score, dtype: float64

davies_bouldin_score()#

Compute the Davies-Bouldin score.

See the documentation of sklearn.metrics.davies_bouldin_score for details.

Once computed, resulting Series is available as Clustergram.davies_bouldin. Calling the original method will recompute the score.

Returns:

davies_bouldinpd.Series

Notes

The algortihm uses sklearn. With cuML backend, data are converted on the fly.

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.davies_bouldin_score()
2    0.249366
3    0.351812
4    0.347580
5    0.055679
6    0.030516
7    0.025207
Name: davies_bouldin_score, dtype: float64

Once computed:

>>> c_gram.davies_bouldin
  0.249366
  0.351812
  0.347580
  0.055679
  0.030516
  0.025207
Name: davies_bouldin_score, dtype: float64

fit(X, y=None, **kwargs)#

Compute clustering for each k within set range.

Parameters:

Xarray-like: Input data to be clustered. It is expected that data are scaled. Can be numpy.array, pandas.DataFrame or their RAPIDS counterparts.
yignored: Not used, present here for API consistency by convention.
**kwargs: Additional arguments passed to the .fit() method of the model, e.g. sample_weight.

Returns:

self: Fitted clustergram.

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)

classmethod from_centers(cluster_centers, labels, data=None)#

Create clustergram based on cluster centers dictionary and labels DataFrame.

Parameters:

cluster_centersdict: dictionary of cluster centers with keys encoding the number of clusters and values being M``x````N arrays where M == key and N == number of variables in the original dataset. Entries should be ordered based on keys.
labelspandas.DataFrame: DataFrame with columns representing cluster labels and rows representing observations. Columns must be equal to cluster_centers keys.
dataarray-like (optional): array used as an input of the clustering algorithm with N columns. Required for plot(pca_weighted=True) plotting option. Otherwise only plot(pca_weighted=False) is available.

Returns:

clustegram.Clustergram

Notes

The algortihm uses sklearn and pandas to generate clustergram. GPU option is not implemented.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> labels = pd.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
>>> labels
   1  2  3
0  0  0  0
1  0  0  2
2  0  1  1
>>> centers = {
...             1: np.array([[0, 0]]),
...             2: np.array([[-1, -1], [1, 1]]),
...             3: np.array([[-1, -1], [1, 1], [0, 0]]),
...         }
>>> cgram = Clustergram.from_centers(centers, labels)
>>> cgram.plot(pca_weighted=False)

>>> data = np.array([[-1, -1], [1, 1], [0, 0]])
>>> cgram = Clustergram.from_centers(centers, labels, data=data)
>>> cgram.plot()

classmethod from_data(data, labels, method='mean')#

Create clustergram based on data and labels DataFrame.

Cluster centers are created as mean values or median values as a groupby function over data using individual labels.

Parameters:

dataarray-like: array used as an input of the clustering algorithm in the (M, N) shape where M == number of observations and N == number of variables
labelspandas.DataFrame: DataFrame with columns representing cluster labels and rows representing observations. Columns must be equal to cluster_centers keys.
method{‘mean’, ‘median’}, default ‘mean’: Method of computation of cluster centres.

Returns:

clustegram.Clustergram

Notes

The algortihm uses sklearn and pandas to generate clustergram. GPU option is not implemented.

Examples

>>> import pandas as pd
>>> import numpy as np
>>> data = np.array([[-1, -1, 0, 10], [1, 1, 10, 2], [0, 0, 20, 4]])
>>> data
array([[-1, -1,  0, 10],
       [ 1,  1, 10,  2],
       [ 0,  0, 20,  4]])
>>> labels = pd.DataFrame({1: [0, 0, 0], 2: [0, 0, 1], 3: [0, 2, 1]})
>>> labels
   1  2  3
0  0  0  0
1  0  0  2
2  0  1  1
>>> cgram = Clustergram.from_data(data, labels)
>>> cgram.plot()

plot(ax=None, size=1, linewidth=1, cluster_style=None, line_style=None, figsize=None, k_range=None, pca_weighted=True, pca_kwargs={}, pca_component=1)#

Generate clustergram plot based on cluster centre mean values.

Parameters:

axmatplotlib.pyplot.Artist (default None): matplotlib axis on which to draw the plot
sizefloat (default 1): multiplier of the size of a cluster centre indication. Size is determined as 500 / count of observations in a cluster multiplied by size.
linewidthfloat (default 1): multiplier of the linewidth of a branch. Line width is determined as 50 / count of observations in a branch multiplied by linewidth.
cluster_styledict (default None): Style options to be passed on to the cluster centre plot, such as color, linewidth, edgecolor or alpha.
line_styledict (default None): Style options to be passed on to branches, such as color, linewidth, edgecolor or alpha.
figsizetuple of integers (default None): Size of the resulting matplotlib.figure.Figure. If the argument ax is given explicitly, figsize is ignored.
k_rangeiterable (default None): iterable of integer values to be plotted. In none, Clustergram.k_range will be used. Has to be a subset of Clustergram.k_range.
pca_weightedbool (default True): Whether use PCA weighted mean of clusters or standard mean of clusters on y-axis.
pca_kwargsdict (default {}): Additional arguments passed to the PCA object, e.g. svd_solver. Applies only if pca_weighted=True.
pca_componentint (default 1): The principal component used to weigh mean of clusters if pca_weighted=True. The PCA computation is cached so it is cheap to compare multiple options. However, if you use pca=1 first, when trying pca=2 the PCA is run again as it computed only for the max pca requested. If you first run plot with pca=2, the second with pca=1 does not trigger the PCA computation.

Returns:

axmatplotlib axis instance

Notes

Before plotting, Clustergram needs to compute the summary values. Those are computed on the first call of each option (pca_weighted=True/False).

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.plot()

silhouette_score(**kwargs)#

Compute the mean Silhouette Coefficient of all samples.

See the documentation of sklearn.metrics.silhouette_score for details.

Once computed, resulting Series is available as Clustergram.silhouette. Calling the original method will compute the score from the beginning.

Parameters:

**kwargs: Additional arguments passed to the silhouette_score function, e.g. sample_size.

Returns:

silhouettepd.Series

Notes

The algortihm uses sklearn. With cuML backend, data are converted on the fly.

Examples

>>> c_gram = clustergram.Clustergram(range(1, 9))
>>> c_gram.fit(data)
>>> c_gram.silhouette_score()
2    0.702450
3    0.644272
4    0.767728
5    0.948991
6    0.769985
7    0.575644
Name: silhouette_score, dtype: float64

Once computed:

>>> c_gram.silhouette_
  0.702450
  0.644272
  0.767728
  0.948991
  0.769985
  0.575644
Name: silhouette_score, dtype: float64

API reference

Contents

API reference#