PCA & Eigenportfolios

Principal Component Analysis (PCA) finds directions of maximum variance in return data. The first principal component captures the most variation, the second captures the next most (orthogonal to the first), and so on. Eigenportfolios are the portfolio-weight interpretations of these components.

PCA is the most widely used dimensionality reduction technique in quantitative finance. It underpins statistical risk models at firms like MSCI Barra and Axioma.

MarketPCA: Basic Usage

The MarketPCA class wraps scikit-learn’s PCA with finance-specific convenience methods.

from puffin.unsupervised import MarketPCA
import pandas as pd

# Load returns data
returns = pd.read_csv("returns.csv", index_col=0, parse_dates=True)

# Fit PCA
pca = MarketPCA(n_components=5)
pca.fit(returns)

# Check explained variance
print(pca.explained_variance_ratio)
# [0.35, 0.18, 0.12, 0.09, 0.06]

# How many components for 95% variance?
print(pca.n_components_95)
# 8

The number of components needed for 95% variance depends on the asset universe. A diversified set of 500 stocks may need 15-20 components; a single-sector set may need only 3-5.

Transform Returns

Project returns onto principal components to work in a lower-dimensional space.

# Project returns onto principal components
transformed = pca.transform(returns)
# Shape: (n_days, n_components)

# Or fit and transform in one step
transformed = pca.fit_transform(returns)

Each column of transformed is a time series of factor returns for one principal component. These are uncorrelated by construction, which simplifies downstream analysis.

Variance Explained Plot

Visualize how much variance each component captures and how quickly cumulative variance grows.

import matplotlib.pyplot as plt

plot_data = pca.explained_variance_plot()
print(plot_data)
#    component  variance_explained  cumulative_variance
# 0          1                0.35                 0.35
# 1          2                0.18                 0.53
# 2          3                0.12                 0.65

# Plot cumulative variance
plt.plot(plot_data["component"], plot_data["cumulative_variance"], marker='o')
plt.axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Variance Explained")
plt.legend()
plt.show()

A steep initial curve that flattens quickly indicates that a few dominant factors drive most of the variation – typical for equity markets where the “market factor” dominates.

Eigenportfolios

Eigenportfolios are portfolios formed from principal components. The first eigenportfolio represents the dominant market mode (often market-wide movement). Higher-order eigenportfolios capture sector rotations, value/growth tilts, or other structural patterns.

# Extract top 3 eigenportfolios
portfolios = pca.eigenportfolios(returns, n=3)
print(portfolios)
#          AAPL  GOOGL  MSFT  AMZN  TSLA
# PC1      0.22   0.21  0.20  0.19  0.18
# PC2      0.35   0.15  0.25  0.15  0.10
# PC3      0.10   0.30  0.15  0.25  0.20

# Weights sum to 1 (long-only by default)
print(portfolios.sum(axis=1))
# PC1    1.0
# PC2    1.0
# PC3    1.0

Interpreting eigenportfolios: PC1 typically has roughly equal weights across all assets (the “market portfolio”). PC2 often separates assets into two groups (e.g., growth vs value), making it a long-short portfolio.

Reconstructing Returns

You can approximate returns using only the top N components. This separates the systematic signal from idiosyncratic noise.

# Reconstruct using top 3 components
reconstructed = pca.reconstruct(returns, n_components=3)

# Compare original vs reconstructed
original_vol = returns.std().mean()
reconstructed_vol = reconstructed.std().mean()
print(f"Original: {original_vol:.4f}, Reconstructed: {reconstructed_vol:.4f}")

The reconstruction error (difference between original and reconstructed returns) represents the idiosyncratic component – the portion of returns not explained by common factors. This is useful for:

Risk decomposition: Separating systematic vs specific risk
Denoising: Removing noise from covariance matrices
Anomaly detection: Large reconstruction errors signal unusual stock-specific events

Practical Tips

Consideration	Recommendation
Number of components	Use the 95% cumulative variance threshold as a starting point
Standardization	Always standardize returns before PCA (mean-zero, unit-variance)
Rolling PCA	Re-estimate periodically (e.g., quarterly) to capture structural changes
Outliers	Robust PCA variants exist for data with outliers; consider winsorizing first

Source Code

Browse the implementation: puffin/unsupervised/