Random Forests

Random Forests build multiple decision trees using bootstrap sampling and random feature selection, then aggregate their predictions. This section reviews decision tree fundamentals, walks through the RandomForestTrader implementation, and covers hyperparameter tuning for financial data.

Decision Trees Review

Decision trees split the feature space into regions and make predictions based on the majority class (classification) or mean value (regression) in each region.

Key Concepts

Splitting criteria: Information gain (classification) or variance reduction (regression)
Tree depth: Controls model complexity
Leaf size: Minimum samples required in a leaf node
Pruning: Removing branches to prevent overfitting

Limitations of Single Trees

High variance: Small changes in data can produce very different trees
Overfitting: Deep trees memorize training data
Instability: Sensitive to data changes

Ensemble methods address these limitations by combining multiple trees.

A single decision tree almost always overfits financial data. Never use a standalone tree for production trading signals – always prefer an ensemble approach.

How Random Forests Work

Random Forests combine two sources of randomness to produce a diverse set of trees:

Bootstrap sampling: Each tree is trained on a random sample (with replacement) of the data
Random feature selection: At each split, only a random subset of features is considered
Aggregation: Predictions are averaged (regression) or voted (classification)

This process reduces variance while maintaining low bias. Because each tree sees a different view of the data, the ensemble’s errors tend to cancel out rather than compound.

The name “Random Forest” comes from two sources of randomness: random row sampling (bootstrap) and random column sampling (feature subsets). Together these ensure that individual trees are decorrelated.

Implementation with Puffin

The RandomForestTrader class wraps scikit-learn’s RandomForestClassifier and RandomForestRegressor with trading-specific defaults and utilities.

import pandas as pd
import numpy as np
from puffin.ensembles import RandomForestTrader
from puffin.data import YFinanceProvider

# Load data
client = YFinanceProvider()
df = client.get_stock_prices("AAPL", start="2020-01-01", end="2023-12-31")

# Create features
df["return_5d"] = df["close"].pct_change(5)
df["return_20d"] = df["close"].pct_change(20)
df["volatility_20d"] = df["close"].pct_change().rolling(20).std()
df["rsi"] = compute_rsi(df["close"], window=14)

# Create target: next 5-day return
df["forward_return"] = df["close"].pct_change(5).shift(-5)
df["signal"] = (df["forward_return"] > 0).astype(int)

# Prepare data
features = df[["return_5d", "return_20d", "volatility_20d", "rsi"]].dropna()
target = df.loc[features.index, "signal"]

# Train model
model = RandomForestTrader(task="classification", random_state=42)
model.fit(features, target, n_estimators=100, max_depth=10)

# Cross-validate
cv_results = model.cross_validate(features, target)
print(f"Cross-validation accuracy: {cv_results['mean_score']:.3f} ± {cv_results['std_score']:.3f}")

# Feature importance
importance = model.feature_importance()
print("\nFeature Importance:")
print(importance)

Understanding the Output

The feature_importance() method returns a pandas Series sorted by importance. For Random Forests, this is based on the mean decrease in impurity (Gini importance) across all trees. For a more robust measure, use SHAP values (covered in the SHAP Interpretation page).

Hyperparameter Tuning

Key hyperparameters for Random Forests:

Parameter	Description	Typical Range	Trading Guidance
`n_estimators`	Number of trees	100 – 1000	Higher is better but slower; 200–500 is a good range
`max_depth`	Maximum tree depth	5 – 15	Keep shallow (5–10) to avoid overfitting noisy data
`min_samples_split`	Minimum samples to split a node	10 – 50	Higher values add regularization
`max_features`	Features considered at each split	`"sqrt"`, `"log2"`, or float	`"sqrt"` is the default; try `0.3`–`0.5` for wider search

For financial data, err on the side of more regularization. Shallow trees (max_depth=5--8) with higher min_samples_split (20+) tend to generalize better to unseen market regimes.

Grid Search Example

from sklearn.model_selection import TimeSeriesSplit

# Define parameter grid
param_grid = {
    "n_estimators": [100, 200, 500],
    "max_depth": [5, 8, 10],
    "min_samples_split": [10, 20, 50],
    "max_features": ["sqrt", 0.5],
}

# Use time-series cross-validation (not random CV)
tscv = TimeSeriesSplit(n_splits=5)

best_params = model.tune_hyperparameters(
    features, target,
    param_grid=param_grid,
    cv=tscv,
)
print("Best parameters:", best_params)

Always use TimeSeriesSplit (or a similar walk-forward approach) for financial cross-validation. Standard k-fold CV leaks future information into training data, inflating performance estimates.

When to Use Random Forests

Random Forests are an excellent starting point for any trading ML project:

Baseline model: Quick to train, minimal tuning, robust results
Feature selection: Use importance scores to identify useful predictors before training more complex models
Stable ensembles: Less sensitive to hyperparameter choices than gradient boosting
Parallel training: Trees are independent and can be trained in parallel

For higher accuracy with more careful tuning, consider the Gradient Boosting methods covered in the next section.

Source Code

Browse the implementation: puffin/ensembles/