Clustering Methods

Clustering groups assets with similar return patterns, enabling diversified portfolio construction, sector analysis, and outlier detection. This page covers four complementary approaches available in the puffin.unsupervised module.

K-Means Clustering

K-means partitions assets into K groups by minimizing within-cluster variance. It is fast, simple, and a good starting point for asset grouping.

from puffin.unsupervised import cluster_assets, cluster_summary

# Cluster into 3 groups
labels = cluster_assets(returns, n_clusters=3, method='kmeans')
print(labels)
# [0, 0, 1, 1, 2, 2]  # Cluster ID for each asset

# Get cluster statistics
summary = cluster_summary(returns, labels)
print(summary)
#    cluster  n_assets  mean_return  volatility  sharpe_ratio  assets
# 0        0         2         0.15        0.25          0.60  AAPL, GOOGL
# 1        1         2         0.12        0.20          0.60  MSFT, AMZN
# 2        2         2         0.08        0.30          0.27  TSLA, ...

K-means uses Euclidean distance on return vectors. For correlation-based grouping, consider transforming the correlation matrix into a distance matrix first: d = sqrt(2 * (1 - corr)).

Finding Optimal Number of Clusters

Choosing K is critical. Too few clusters miss structure; too many overfit noise.

from puffin.unsupervised import optimal_clusters

# Use silhouette method
optimal_k = optimal_clusters(returns, max_k=10, method='silhouette')
print(f"Optimal clusters: {optimal_k}")

# Or use elbow method
optimal_k_elbow = optimal_clusters(returns, max_k=10, method='elbow')

Silhouette method: Measures how similar an asset is to its own cluster versus other clusters. Higher is better. A silhouette score above 0.5 indicates well-separated clusters.

Elbow method: Finds the point where adding more clusters doesn’t reduce within-cluster variance much. Look for the “elbow” in the inertia curve.

Always validate clusters qualitatively. If tech stocks end up in the same cluster as utilities, the features or distance metric may need adjustment.

Hierarchical Clustering

Hierarchical clustering builds a tree (dendrogram) of nested clusters. Unlike K-means, it does not require specifying K upfront – you can cut the tree at any level.

from puffin.unsupervised import hierarchical_cluster, plot_dendrogram
import matplotlib.pyplot as plt

# Cluster with Ward linkage
labels = hierarchical_cluster(returns, method='ward', n_clusters=4)

# Visualize dendrogram
fig = plot_dendrogram(returns, method='ward')
plt.show()

Linkage Methods

The linkage method determines how distances between clusters are computed when merging.

Method Strategy Best for
ward Minimizes within-cluster variance General purpose (default)
complete Maximum distance between clusters Compact, equal-sized clusters
average Average distance between clusters Balanced approach
single Minimum distance between clusters Detecting elongated clusters

Hierarchical clustering is used internally by the Hierarchical Risk Parity (HRP) portfolio optimizer covered in Part 5. The dendrogram determines the quasi-diagonal ordering of the covariance matrix.

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters of arbitrary shape and automatically identifies outliers. It does not require specifying the number of clusters.

from puffin.unsupervised import dbscan_cluster

# eps: Maximum distance between neighbors
# min_samples: Minimum cluster size
labels = dbscan_cluster(returns, eps=0.5, min_samples=3)

# Label -1 = noise/outliers
outliers = returns.columns[labels == -1]
print(f"Outliers: {outliers.tolist()}")

When to use DBSCAN:

  • Irregular cluster shapes (not all clusters are spherical)
  • Want to identify outliers automatically
  • Don’t know number of clusters in advance

DBSCAN is sensitive to eps and min_samples. Use the k-distance graph to choose eps: compute the distance to each point’s k-th nearest neighbor, sort, and look for an elbow.

Tuning DBSCAN Parameters

Parameter Effect of increasing Guidance
eps Larger clusters, fewer outliers Start with the elbow of the k-distance plot
min_samples Denser clusters required, more outliers Typically set to 2 * n_features

Gaussian Mixture Models (GMM)

GMM assigns soft cluster membership (probabilities) rather than hard labels. Each cluster is modeled as a multivariate Gaussian distribution. This is useful when assets may belong to multiple groups.

from puffin.unsupervised import gmm_cluster

labels, probabilities = gmm_cluster(returns, n_components=3, covariance_type='full')

print("Hard labels:", labels)
# [0, 0, 1, 2, 2]

print("Soft probabilities:")
print(probabilities)
#        Cluster 0  Cluster 1  Cluster 2
# AAPL        0.85       0.10       0.05
# GOOGL       0.80       0.15       0.05
# MSFT        0.10       0.85       0.05

Covariance Types

The covariance type controls the flexibility of each cluster’s shape.

Type Description Parameters Best for
full Each cluster has own covariance matrix Most flexible Small to medium datasets
tied All clusters share one covariance matrix Moderate When clusters have similar shape
diag Diagonal covariances (features independent) Fewer parameters High-dimensional data
spherical Single variance per cluster Fewest parameters Isotropic clusters

Use BIC (Bayesian Information Criterion) to select both the number of components and the covariance type. Lower BIC indicates a better model.

Cluster Correlation

After assigning cluster labels, analyze the correlation structure between clusters to verify they capture distinct behaviors.

from puffin.unsupervised import cluster_assets, cluster_correlation

labels = cluster_assets(returns, n_clusters=3, method='kmeans')
corr_matrix = cluster_correlation(returns, labels)

print(corr_matrix)
#             Cluster 0  Cluster 1  Cluster 2
# Cluster 0       0.75       0.35       0.20
# Cluster 1       0.35       0.80       0.25
# Cluster 2       0.20       0.25       0.70

# High diagonal = within-cluster correlation
# Low off-diagonal = clusters are distinct

Interpreting the matrix:

  • High diagonal values (e.g., 0.75-0.80): Assets within each cluster move together – clusters are internally cohesive.
  • Low off-diagonal values (e.g., 0.20-0.35): Clusters move independently – good for diversification.
  • If off-diagonal values are high, consider increasing K or switching methods.

Method Comparison

Method Requires K? Soft labels? Finds outliers? Scalability
K-Means Yes No No Excellent
Hierarchical Optional No No Moderate
DBSCAN No No Yes Good
GMM Yes Yes No Moderate

Source Code

Browse the implementation: puffin/unsupervised/