Clustering Methods
Clustering groups assets with similar return patterns, enabling diversified portfolio construction, sector analysis, and outlier detection. This page covers four complementary approaches available in the puffin.unsupervised module.
K-Means Clustering
K-means partitions assets into K groups by minimizing within-cluster variance. It is fast, simple, and a good starting point for asset grouping.
from puffin.unsupervised import cluster_assets, cluster_summary
# Cluster into 3 groups
labels = cluster_assets(returns, n_clusters=3, method='kmeans')
print(labels)
# [0, 0, 1, 1, 2, 2] # Cluster ID for each asset
# Get cluster statistics
summary = cluster_summary(returns, labels)
print(summary)
# cluster n_assets mean_return volatility sharpe_ratio assets
# 0 0 2 0.15 0.25 0.60 AAPL, GOOGL
# 1 1 2 0.12 0.20 0.60 MSFT, AMZN
# 2 2 2 0.08 0.30 0.27 TSLA, ...
K-means uses Euclidean distance on return vectors. For correlation-based grouping, consider transforming the correlation matrix into a distance matrix first:
d = sqrt(2 * (1 - corr)).
Finding Optimal Number of Clusters
Choosing K is critical. Too few clusters miss structure; too many overfit noise.
from puffin.unsupervised import optimal_clusters
# Use silhouette method
optimal_k = optimal_clusters(returns, max_k=10, method='silhouette')
print(f"Optimal clusters: {optimal_k}")
# Or use elbow method
optimal_k_elbow = optimal_clusters(returns, max_k=10, method='elbow')
Silhouette method: Measures how similar an asset is to its own cluster versus other clusters. Higher is better. A silhouette score above 0.5 indicates well-separated clusters.
Elbow method: Finds the point where adding more clusters doesn’t reduce within-cluster variance much. Look for the “elbow” in the inertia curve.
Always validate clusters qualitatively. If tech stocks end up in the same cluster as utilities, the features or distance metric may need adjustment.
Hierarchical Clustering
Hierarchical clustering builds a tree (dendrogram) of nested clusters. Unlike K-means, it does not require specifying K upfront – you can cut the tree at any level.
from puffin.unsupervised import hierarchical_cluster, plot_dendrogram
import matplotlib.pyplot as plt
# Cluster with Ward linkage
labels = hierarchical_cluster(returns, method='ward', n_clusters=4)
# Visualize dendrogram
fig = plot_dendrogram(returns, method='ward')
plt.show()
Linkage Methods
The linkage method determines how distances between clusters are computed when merging.
| Method | Strategy | Best for |
|---|---|---|
ward |
Minimizes within-cluster variance | General purpose (default) |
complete |
Maximum distance between clusters | Compact, equal-sized clusters |
average |
Average distance between clusters | Balanced approach |
single |
Minimum distance between clusters | Detecting elongated clusters |
Hierarchical clustering is used internally by the Hierarchical Risk Parity (HRP) portfolio optimizer covered in Part 5. The dendrogram determines the quasi-diagonal ordering of the covariance matrix.
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters of arbitrary shape and automatically identifies outliers. It does not require specifying the number of clusters.
from puffin.unsupervised import dbscan_cluster
# eps: Maximum distance between neighbors
# min_samples: Minimum cluster size
labels = dbscan_cluster(returns, eps=0.5, min_samples=3)
# Label -1 = noise/outliers
outliers = returns.columns[labels == -1]
print(f"Outliers: {outliers.tolist()}")
When to use DBSCAN:
- Irregular cluster shapes (not all clusters are spherical)
- Want to identify outliers automatically
- Don’t know number of clusters in advance
DBSCAN is sensitive to
epsandmin_samples. Use the k-distance graph to chooseeps: compute the distance to each point’s k-th nearest neighbor, sort, and look for an elbow.
Tuning DBSCAN Parameters
| Parameter | Effect of increasing | Guidance |
|---|---|---|
eps |
Larger clusters, fewer outliers | Start with the elbow of the k-distance plot |
min_samples |
Denser clusters required, more outliers | Typically set to 2 * n_features |
Gaussian Mixture Models (GMM)
GMM assigns soft cluster membership (probabilities) rather than hard labels. Each cluster is modeled as a multivariate Gaussian distribution. This is useful when assets may belong to multiple groups.
from puffin.unsupervised import gmm_cluster
labels, probabilities = gmm_cluster(returns, n_components=3, covariance_type='full')
print("Hard labels:", labels)
# [0, 0, 1, 2, 2]
print("Soft probabilities:")
print(probabilities)
# Cluster 0 Cluster 1 Cluster 2
# AAPL 0.85 0.10 0.05
# GOOGL 0.80 0.15 0.05
# MSFT 0.10 0.85 0.05
Covariance Types
The covariance type controls the flexibility of each cluster’s shape.
| Type | Description | Parameters | Best for |
|---|---|---|---|
full |
Each cluster has own covariance matrix | Most flexible | Small to medium datasets |
tied |
All clusters share one covariance matrix | Moderate | When clusters have similar shape |
diag |
Diagonal covariances (features independent) | Fewer parameters | High-dimensional data |
spherical |
Single variance per cluster | Fewest parameters | Isotropic clusters |
Use BIC (Bayesian Information Criterion) to select both the number of components and the covariance type. Lower BIC indicates a better model.
Cluster Correlation
After assigning cluster labels, analyze the correlation structure between clusters to verify they capture distinct behaviors.
from puffin.unsupervised import cluster_assets, cluster_correlation
labels = cluster_assets(returns, n_clusters=3, method='kmeans')
corr_matrix = cluster_correlation(returns, labels)
print(corr_matrix)
# Cluster 0 Cluster 1 Cluster 2
# Cluster 0 0.75 0.35 0.20
# Cluster 1 0.35 0.80 0.25
# Cluster 2 0.20 0.25 0.70
# High diagonal = within-cluster correlation
# Low off-diagonal = clusters are distinct
Interpreting the matrix:
- High diagonal values (e.g., 0.75-0.80): Assets within each cluster move together – clusters are internally cohesive.
- Low off-diagonal values (e.g., 0.20-0.35): Clusters move independently – good for diversification.
- If off-diagonal values are high, consider increasing K or switching methods.
Method Comparison
| Method | Requires K? | Soft labels? | Finds outliers? | Scalability |
|---|---|---|---|---|
| K-Means | Yes | No | No | Excellent |
| Hierarchical | Optional | No | No | Moderate |
| DBSCAN | No | No | Yes | Good |
| GMM | Yes | Yes | No | Moderate |
Source Code
Browse the implementation: puffin/unsupervised/