Unsupervised Learning for Trading
Overview
Unsupervised learning discovers hidden patterns in data without labeled targets. In trading, it helps with:
- Dimensionality reduction: PCA identifies dominant market factors
- Asset grouping: Clustering finds assets with similar behavior
- Risk factor extraction: Data-driven alternatives to traditional factor models
- Regime detection: Identifying market states automatically
Unlike supervised learning, unsupervised methods don’t predict future values directly. Instead, they reveal structure that informs portfolio construction and risk management.
Chapter 13 of Machine Learning for Algorithmic Trading covers PCA, clustering, eigenportfolios, and hierarchical risk parity. This part implements those techniques in the
puffin.unsupervisedmodule.
Methods Taxonomy
graph TD
A[Unsupervised Learning] --> B[Dimensionality Reduction]
A --> C[Clustering]
A --> D[Factor Models]
B --> B1[PCA]
B --> B2[Eigenportfolios]
B --> B3[Variance Explained]
C --> C1[K-Means]
C --> C2[Hierarchical]
C --> C3[DBSCAN]
C --> C4[GMM]
D --> D1[Factor Extraction]
D --> D2[Factor Attribution]
D --> D3[Factor Timing]
B1 --> E[Portfolio Construction]
C1 --> E
D1 --> E
classDef root fill:#1a3a5c,stroke:#0d2137,color:#e8e0d4
classDef reduction fill:#2d5016,stroke:#1a3a1a,color:#e8e0d4
classDef cluster fill:#6b2d5b,stroke:#3d1a35,color:#e8e0d4
classDef factor fill:#8b4513,stroke:#5c2e0d,color:#e8e0d4
classDef output fill:#2d4a6b,stroke:#1a2d42,color:#e8e0d4
class A root
class B,B1,B2,B3 reduction
class C,C1,C2,C3,C4 cluster
class D,D1,D2,D3 factor
class E output
linkStyle default stroke:#4a5568,stroke-width:2px
Chapter Contents
| Sub-page | Topics |
|---|---|
| PCA & Eigenportfolios | Principal component analysis, variance explained, eigenportfolios, return reconstruction |
| Clustering Methods | K-means, hierarchical clustering, DBSCAN, Gaussian mixture models, cluster correlation |
| Data-Driven Risk Factors | Factor extraction, exposures, attribution, variance decomposition, mimicking portfolios, factor timing |
Common Pitfalls
Keep these in mind when applying unsupervised methods to financial data.
-
Overfitting with too many components: Use explained variance threshold (e.g., 95%) rather than arbitrary numbers.
-
Ignoring time structure: PCA treats all observations equally. For time series, consider rolling windows or exponential weighting.
-
Interpreting PCA factors: Principal components are linear combinations of assets. They’re mathematically optimal but not always economically meaningful.
-
Cluster instability: Small changes in data can flip cluster labels. Use hierarchical clustering or GMM for more stable results.
-
Correlation vs causation: Clustering finds correlation, not causation. Assets may cluster due to omitted variables.
Exercises
-
Load S&P 500 returns and apply PCA. How many components explain 90% of variance?
-
Cluster tech stocks (AAPL, GOOGL, MSFT, etc.) using k-means. Do the clusters make sense?
-
Extract 3 data-driven risk factors from a portfolio. What percentage of variance is factor vs specific?
-
Use DBSCAN to find outlier assets in a sector. Which stocks are flagged?
-
Construct a factor-mimicking portfolio for the first principal component. How does it compare to equal-weighted?
Summary
- PCA reduces dimensionality and identifies dominant market modes
- Eigenportfolios provide interpretable portfolios from principal components
- K-means groups assets with similar behavior
- Hierarchical clustering reveals asset relationships via dendrograms
- DBSCAN finds irregular clusters and outliers
- GMM provides soft cluster membership probabilities
- Data-driven factors extract risk factors directly from returns
- Factor attribution decomposes returns into common and specific components
Unsupervised learning complements supervised models by revealing structure in unlabeled data. Use it for portfolio construction, risk management, and regime detection.
Notebook: Run the examples interactively in
ml_models.ipynb
Related Chapters
- Part 11: Tree Ensembles – Tree models consume PCA-reduced features and cluster labels as inputs
- Part 4: Alpha Factors – PCA and clustering reduce the high-dimensional factor space to its dominant drivers
- Part 5: Portfolio Optimization – Eigenportfolios and clustering feed directly into portfolio construction methods
- Part 19: Autoencoders – Autoencoders extend unsupervised dimensionality reduction with neural networks
Source Code
Browse the implementation: puffin/unsupervised/
Next Steps
Part 13 explores NLP for trading: sentiment analysis, news processing, and text-based signals.