Unsupervised Learning for Trading

Overview

Unsupervised learning discovers hidden patterns in data without labeled targets. In trading, it helps with:

Dimensionality reduction: PCA identifies dominant market factors
Asset grouping: Clustering finds assets with similar behavior
Risk factor extraction: Data-driven alternatives to traditional factor models
Regime detection: Identifying market states automatically

Unlike supervised learning, unsupervised methods don’t predict future values directly. Instead, they reveal structure that informs portfolio construction and risk management.

Chapter 13 of Machine Learning for Algorithmic Trading covers PCA, clustering, eigenportfolios, and hierarchical risk parity. This part implements those techniques in the puffin.unsupervised module.

Methods Taxonomy

graph TD
    A[Unsupervised Learning] --> B[Dimensionality Reduction]
    A --> C[Clustering]
    A --> D[Factor Models]

    B --> B1[PCA]
    B --> B2[Eigenportfolios]
    B --> B3[Variance Explained]

    C --> C1[K-Means]
    C --> C2[Hierarchical]
    C --> C3[DBSCAN]
    C --> C4[GMM]

    D --> D1[Factor Extraction]
    D --> D2[Factor Attribution]
    D --> D3[Factor Timing]

    B1 --> E[Portfolio Construction]
    C1 --> E
    D1 --> E

    classDef root fill:#1a3a5c,stroke:#0d2137,color:#e8e0d4
    classDef reduction fill:#2d5016,stroke:#1a3a1a,color:#e8e0d4
    classDef cluster fill:#6b2d5b,stroke:#3d1a35,color:#e8e0d4
    classDef factor fill:#8b4513,stroke:#5c2e0d,color:#e8e0d4
    classDef output fill:#2d4a6b,stroke:#1a2d42,color:#e8e0d4

    class A root
    class B,B1,B2,B3 reduction
    class C,C1,C2,C3,C4 cluster
    class D,D1,D2,D3 factor
    class E output

    linkStyle default stroke:#4a5568,stroke-width:2px

Chapter Contents

Sub-page	Topics
PCA & Eigenportfolios	Principal component analysis, variance explained, eigenportfolios, return reconstruction
Clustering Methods	K-means, hierarchical clustering, DBSCAN, Gaussian mixture models, cluster correlation
Data-Driven Risk Factors	Factor extraction, exposures, attribution, variance decomposition, mimicking portfolios, factor timing

Common Pitfalls

Keep these in mind when applying unsupervised methods to financial data.

Overfitting with too many components: Use explained variance threshold (e.g., 95%) rather than arbitrary numbers.
Ignoring time structure: PCA treats all observations equally. For time series, consider rolling windows or exponential weighting.
Interpreting PCA factors: Principal components are linear combinations of assets. They’re mathematically optimal but not always economically meaningful.
Cluster instability: Small changes in data can flip cluster labels. Use hierarchical clustering or GMM for more stable results.
Correlation vs causation: Clustering finds correlation, not causation. Assets may cluster due to omitted variables.

Exercises

Load S&P 500 returns and apply PCA. How many components explain 90% of variance?
Cluster tech stocks (AAPL, GOOGL, MSFT, etc.) using k-means. Do the clusters make sense?
Extract 3 data-driven risk factors from a portfolio. What percentage of variance is factor vs specific?
Use DBSCAN to find outlier assets in a sector. Which stocks are flagged?
Construct a factor-mimicking portfolio for the first principal component. How does it compare to equal-weighted?

Summary

PCA reduces dimensionality and identifies dominant market modes
Eigenportfolios provide interpretable portfolios from principal components
K-means groups assets with similar behavior
Hierarchical clustering reveals asset relationships via dendrograms
DBSCAN finds irregular clusters and outliers
GMM provides soft cluster membership probabilities
Data-driven factors extract risk factors directly from returns
Factor attribution decomposes returns into common and specific components

Unsupervised learning complements supervised models by revealing structure in unlabeled data. Use it for portfolio construction, risk management, and regime detection.

Notebook: Run the examples interactively in ml_models.ipynb

Part 11: Tree Ensembles – Tree models consume PCA-reduced features and cluster labels as inputs
Part 4: Alpha Factors – PCA and clustering reduce the high-dimensional factor space to its dominant drivers
Part 5: Portfolio Optimization – Eigenportfolios and clustering feed directly into portfolio construction methods
Part 19: Autoencoders – Autoencoders extend unsupervised dimensionality reduction with neural networks

Source Code

Browse the implementation: puffin/unsupervised/

Next Steps

Part 13 explores NLP for trading: sentiment analysis, news processing, and text-based signals.

Unsupervised Learning for Trading

Overview

Methods Taxonomy

Chapter Contents

Common Pitfalls

Exercises

Summary

Related Chapters

Source Code

Next Steps

Table of contents