Stacked LSTM & GRU
Stacking multiple recurrent layers enables a model to learn hierarchical representations of sequential data. Lower layers capture short-term, fine-grained patterns (e.g., intraday momentum), while upper layers learn higher-level structure (e.g., multi-day trends). This section also introduces the Gated Recurrent Unit (GRU) – a simpler alternative to LSTM – and covers the practical aspects of regularization, hyperparameter tuning, and evaluation.
Stacked LSTM Architectures
Why Stack Layers?
A single LSTM layer maps an input sequence directly to a hidden representation. When you stack multiple layers, each layer receives the hidden-state sequence from the layer below, allowing the network to build progressively more abstract features.
- Hierarchical Features: Lower layers learn simple patterns, upper layers learn complex patterns
- Better Abstraction: Each layer transforms representations into more useful forms
- Improved Performance: Often (but not always) better results on complex tasks
Diminishing returns set in quickly. Two or three stacked layers is typical for financial time series. Going deeper usually increases overfitting risk without meaningful accuracy gains.
The StackedLSTM Class
The StackedLSTM class in Puffin allows specifying different hidden dimensions at each layer, creating a funnel-shaped architecture that progressively compresses information.
import torch
import torch.nn as nn
from puffin.deep.rnn import StackedLSTM
# Create a 3-layer stacked LSTM with decreasing hidden dims
model = StackedLSTM(
input_dim=1,
hidden_dims=[128, 64, 32],
output_dim=1,
dropout=0.2
)
# Test forward pass
x = torch.randn(16, 20, 1) # batch_size=16, seq_len=20, features=1
output = model(x)
print(f"Output shape: {output.shape}") # torch.Size([16, 1])
Building a Custom Stacked Architecture
For full control, you can compose individual LSTM layers manually with inter-layer dropout:
from puffin.deep.rnn import LSTMNet
class DeepLSTM(nn.Module):
def __init__(self, input_dim, hidden_dims=[128, 64, 32], output_dim=1):
super(DeepLSTM, self).__init__()
# Create multiple LSTM layers with different hidden dims
self.lstm_layers = nn.ModuleList()
for i, hidden_dim in enumerate(hidden_dims):
layer_input_dim = input_dim if i == 0 else hidden_dims[i - 1]
self.lstm_layers.append(
nn.LSTM(layer_input_dim, hidden_dim, 1, batch_first=True)
)
self.dropout = nn.Dropout(0.2)
self.fc = nn.Linear(hidden_dims[-1], output_dim)
def forward(self, x):
out = x
for lstm in self.lstm_layers:
out, _ = lstm(out)
out = self.dropout(out)
out = self.fc(out[:, -1, :])
return out
# Instantiate and test
model = DeepLSTM(input_dim=5, hidden_dims=[128, 64, 32])
x = torch.randn(16, 30, 5) # 5 input features
output = model(x)
print(f"Output shape: {output.shape}") # torch.Size([16, 1])
GRU: Gated Recurrent Unit
The GRU, introduced by Cho et al. (2014), simplifies the LSTM by combining the forget and input gates into a single update gate and merging the cell state and hidden state into one vector. This reduces the parameter count and often speeds up training.
GRU Architecture
Gates:
- Update Gate (z_t): Controls how much of the past information to keep
- Reset Gate (r_t): Controls how much of the past information to forget when computing the candidate
Mathematical Formulation:
Update Gate: z_t = σ(W_z · [h_{t-1}, x_t])
Reset Gate: r_t = σ(W_r · [h_{t-1}, x_t])
Candidate: h̃_t = tanh(W · [r_t * h_{t-1}, x_t])
Hidden State: h_t = (1 - z_t) * h_{t-1} + z_t * h̃_t
The GRU update gate serves double duty. When z_t is close to 1, the cell copies forward the candidate (like an LSTM input gate). When z_t is close to 0, the cell copies forward the previous hidden state (like an LSTM forget gate keeping everything).
LSTM vs GRU: When to Use Which?
| Feature | LSTM | GRU |
|---|---|---|
| Parameters | More (3 gates + cell state) | Fewer (2 gates) |
| Training Speed | Slower | Faster |
| Memory Usage | Higher | Lower |
| Performance | Better on complex tasks | Comparable on many tasks |
| When to Use | Long sequences, complex patterns | Shorter sequences, faster training needed |
Using GRUNet and TradingGRU
Puffin provides GRUNet (the raw module) and TradingGRU (the high-level wrapper), mirroring the LSTM API exactly.
from puffin.deep.rnn import GRUNet, TradingGRU
# Low-level GRU module
gru_model = GRUNet(input_dim=1, hidden_dim=64, num_layers=2, dropout=0.2)
x = torch.randn(16, 20, 1)
output = gru_model(x)
print(f"GRUNet output shape: {output.shape}") # torch.Size([16, 1])
# High-level TradingGRU wrapper
import yfinance as yf
ticker = yf.Ticker("AAPL")
df = ticker.history(period="2y")
prices = df['Close'].values
gru = TradingGRU()
history = gru.fit(
prices,
lookback=20,
epochs=50,
lr=0.001
)
# Make predictions
gru_predictions = gru.predict(prices, steps=5)
print(f"GRU predictions: {gru_predictions}")
Comparing LSTM and GRU Side by Side
from puffin.deep.rnn import TradingLSTM, TradingGRU
# Train both on the same data
lstm = TradingLSTM()
lstm_history = lstm.fit(prices, lookback=20, epochs=50, lr=0.001)
gru = TradingGRU()
gru_history = gru.fit(prices, lookback=20, epochs=50, lr=0.001)
# Compare final validation losses
print(f"LSTM val_loss: {lstm_history['val_loss'][-1]:.6f}")
print(f"GRU val_loss: {gru_history['val_loss'][-1]:.6f}")
# Compare predictions
lstm_preds = lstm.predict(prices, steps=5)
gru_preds = gru.predict(prices, steps=5)
print(f"LSTM predictions: {lstm_preds}")
print(f"GRU predictions: {gru_preds}")
Preventing Overfitting
Financial time series are noisy and non-stationary, making overfitting one of the biggest practical challenges. The following techniques help:
Dropout
Dropout randomly zeroes hidden units during training, preventing co-adaptation.
from puffin.deep.rnn import LSTMNet
# Apply dropout between LSTM layers
model = LSTMNet(hidden_dim=64, dropout=0.3)
Early Stopping
Stop training when validation loss stops improving. The EarlyStopping callback in Puffin handles this automatically.
from puffin.deep.training import EarlyStopping
early_stop = EarlyStopping(patience=5, restore_best_weights=True)
# In a manual training loop:
for epoch in range(100):
train_loss = train_one_epoch(model, train_loader)
val_loss = validate(model, val_loader)
if early_stop(val_loss, model):
print(f"Early stopping at epoch {epoch}")
break
Weight Decay (L2 Regularization)
Add a penalty proportional to the squared magnitude of model weights:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
Data Augmentation for Time Series
Adding small Gaussian noise to training sequences creates augmented samples that improve generalization:
import numpy as np
def add_noise(series, noise_level=0.01):
"""Add small noise to create augmented samples."""
noise = np.random.randn(len(series)) * noise_level * series.std()
return series + noise
Be careful with time series augmentation. Unlike images, even small perturbations can change the direction of returns and corrupt labels. Keep noise levels very low (0.5-1% of standard deviation).
Hyperparameter Tuning
Systematic hyperparameter search helps find the best architecture for your data.
from sklearn.model_selection import ParameterGrid
from puffin.deep.rnn import TradingLSTM
# Define hyperparameter grid
param_grid = {
'hidden_dim': [32, 64, 128],
'num_layers': [1, 2, 3],
'dropout': [0.1, 0.2, 0.3],
'lr': [0.001, 0.0001],
'lookback': [10, 20, 30]
}
best_params = None
best_val_loss = float('inf')
for params in ParameterGrid(param_grid):
lstm = TradingLSTM()
history = lstm.fit(
prices,
lookback=params['lookback'],
epochs=30,
lr=params['lr']
)
val_loss = history['val_loss'][-1]
if val_loss < best_val_loss:
best_val_loss = val_loss
best_params = params
print(f"Best parameters: {best_params}")
print(f"Best validation loss: {best_val_loss:.4f}")
For large grids, consider random search or Bayesian optimization (e.g., Optuna) instead of exhaustive grid search. This reduces computation while still covering the important regions of hyperparameter space.
Evaluation and Backtesting
Walk-Forward Evaluation
Standard train/test splits can be misleading for time series. Walk-forward evaluation retrains the model on expanding windows to simulate realistic deployment.
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from puffin.deep.rnn import TradingLSTM
# Train model on first 800 observations
lstm = TradingLSTM()
lstm.fit(prices[:800], lookback=20, epochs=50)
# Walk-forward prediction on test set
test_prices = prices[800:]
predictions = []
actuals = []
for i in range(20, len(test_prices)):
pred = lstm.predict(test_prices[:i], steps=1)[0]
predictions.append(pred)
actuals.append(test_prices[i])
predictions = np.array(predictions)
actuals = np.array(actuals)
# Regression metrics
mse = mean_squared_error(actuals, predictions)
mae = mean_absolute_error(actuals, predictions)
r2 = r2_score(actuals, predictions)
print(f"MSE: {mse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R-squared: {r2:.4f}")
# Direction accuracy (more important for trading)
direction_correct = np.sign(predictions[1:] - actuals[:-1]) == np.sign(actuals[1:] - actuals[:-1])
direction_accuracy = direction_correct.mean()
print(f"Direction Accuracy: {direction_accuracy:.2%}")
Direction accuracy above 52-53% can be profitable after transaction costs, depending on the magnitude of correct and incorrect predictions. Always compute both regression metrics and directional metrics when evaluating a trading model.
Simple Backtesting Strategy
def backtest_lstm_strategy(prices, lstm, initial_capital=10000):
"""Backtest a simple LSTM-based trading strategy."""
capital = initial_capital
position = 0
trades = []
for i in range(100, len(prices) - 1):
pred = lstm.predict(prices[i - 100:i], steps=1)[0]
current_price = prices[i]
predicted_return = (pred - current_price) / current_price
if predicted_return > 0.005 and position == 0:
position = capital / current_price
capital = 0
trades.append(('BUY', i, current_price, position))
elif predicted_return < -0.005 and position > 0:
capital = position * current_price
trades.append(('SELL', i, current_price, capital))
position = 0
# Close position if still open
if position > 0:
capital = position * prices[-1]
total_return = (capital - initial_capital) / initial_capital
return {
'final_capital': capital,
'total_return': total_return,
'trades': trades
}
results = backtest_lstm_strategy(prices, lstm)
print(f"Final Capital: ${results['final_capital']:.2f}")
print(f"Total Return: {results['total_return']:.2%}")
print(f"Number of Trades: {len(results['trades'])}")
Source Code
- StackedLSTM:
puffin/deep/rnn.py - GRUNet and TradingGRU:
puffin/deep/rnn.py - EarlyStopping and training_loop:
puffin/deep/training.py