PPO & Trading Environment
This section covers two tightly coupled topics: the custom Gymnasium trading environment that all RL agents use, and Proximal Policy Optimization (PPO) — a state-of-the-art policy gradient method that naturally handles continuous action spaces for position sizing.
PPO was introduced by OpenAI in 2017 and has become the default algorithm for many RL applications. Unlike DQN, PPO directly optimizes the policy (actor) and value function (critic) simultaneously, making it well-suited for continuous control problems.
Custom Trading Environment
The TradingEnvironment class implements the Gymnasium interface, providing a realistic simulation for training and evaluating RL agents.
Environment Features
- Observation Space: Market features concatenated with portfolio state (position, cash, unrealized P&L)
- Action Space: Discrete (buy/hold/sell) or continuous (position sizing from -1 to +1)
- Reward Types: P&L, Sharpe ratio, or risk-adjusted returns
- Commission: Configurable transaction costs
- Position Limits: Maximum position size constraint
Basic Usage
from puffin.rl.trading_env import TradingEnvironment
import numpy as np
# Simple price series
prices = np.random.randn(1000).cumsum() + 100
# Create environment with default settings
env = TradingEnvironment(
prices=prices,
initial_cash=100000,
commission=0.001, # 0.1% per trade
max_position=1.0, # 100% portfolio max
reward_type='pnl',
discrete_actions=True # buy/hold/sell
)
# Standard Gymnasium interface
obs, info = env.reset()
print(f"Initial observation shape: {obs.shape}")
print(f"Info: {info}")
# Take actions
for _ in range(10):
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
break
env.render()
Advanced Configuration with Features
For more realistic agents, pass pre-computed technical indicators as additional features.
from puffin.rl.trading_env import TradingEnvironment
import pandas as pd
import numpy as np
# Load real market data
data = pd.read_csv('data/market_data.csv')
prices = data['close'].values
# Create technical indicators as features
features = np.column_stack([
data['close'].pct_change().fillna(0).values, # returns
(data['close'].rolling(20).mean() / data['close']).fillna(1), # SMA ratio
data['close'].pct_change().rolling(20).std().fillna(0).values, # volatility
data['volume'].rolling(20).mean().fillna(0).values # avg volume
])
# Create environment with continuous actions for position sizing
env = TradingEnvironment(
prices=prices,
features=features,
initial_cash=100000,
commission=0.001,
max_position=1.0,
reward_type='risk_adjusted',
discrete_actions=False, # Continuous: position size in [-1, 1]
lookback=5 # Include 5-step history in observations
)
print(f"Observation space: {env.observation_space}")
print(f"Action space: {env.action_space}")
Always include transaction costs (
commission > 0). Without them, agents learn to trade on every tick, which is unrealistic and produces misleading backtest results.
Reward Functions
The choice of reward function dramatically affects learned behavior. The environment supports three built-in options.
from puffin.rl.trading_env import TradingEnvironment
# P&L reward: simple profit/loss per step
# Pro: Easy to interpret. Con: Encourages excessive risk.
env_pnl = TradingEnvironment(prices, reward_type='pnl')
# Sharpe reward: risk-adjusted returns over a rolling window
# Pro: Penalizes volatility. Con: Requires enough history for stable estimate.
env_sharpe = TradingEnvironment(prices, reward_type='sharpe')
# Risk-adjusted reward: volatility penalty + drawdown penalty
# Pro: Most robust for live trading. Con: More hyperparameters to tune.
env_risk = TradingEnvironment(prices, reward_type='risk_adjusted')
Start with
reward_type='sharpe'as a reasonable default. Switch torisk_adjustedif the agent takes excessive drawdowns during evaluation.
Custom Environment Extensions
Extend TradingEnvironment to add stop-loss logic, multi-asset support, or custom rewards.
from puffin.rl.trading_env import TradingEnvironment
class TradingEnvWithStopLoss(TradingEnvironment):
"""Trading environment with automatic stop-loss."""
def __init__(self, *args, stop_loss_pct=0.05, **kwargs):
super().__init__(*args, **kwargs)
self.stop_loss_pct = stop_loss_pct
def step(self, action):
obs, reward, terminated, truncated, info = super().step(action)
# Terminate episode if cumulative loss exceeds threshold
if info['cumulative_return'] < -self.stop_loss_pct:
terminated = True
reward -= 10 # Penalty for hitting stop-loss
return obs, reward, terminated, truncated, info
# Use the extended environment
env = TradingEnvWithStopLoss(
prices=prices,
stop_loss_pct=0.05, # 5% max drawdown
reward_type='sharpe'
)
Proximal Policy Optimization (PPO)
PPO is a policy gradient method that directly optimizes the policy (mapping states to action probabilities) rather than learning Q-values. Its key innovation is a clipped surrogate objective that prevents destructively large policy updates.
Why PPO for Trading?
- Stability: The clipped objective bounds how much the policy can change per update
- Sample Efficiency: Reuses data through multiple optimization epochs per batch
- Continuous Actions: Naturally handles position sizing without discretization
- Proven Performance: Consistently performs well across diverse RL benchmarks
PPOTrader with stable-baselines3
The PPOTrader class wraps stable-baselines3’s PPO implementation with trading-specific defaults and evaluation methods.
from puffin.rl.ppo_agent import PPOTrader, TradingCallback
from puffin.rl.trading_env import TradingEnvironment
import numpy as np
# Create trading environment
prices = np.random.randn(2000).cumsum() + 100
env = TradingEnvironment(
prices=prices,
discrete_actions=True,
reward_type='sharpe'
)
# Create PPO agent
agent = PPOTrader(
env=env,
policy='MlpPolicy',
learning_rate=3e-4,
n_steps=2048, # Steps per rollout before update
batch_size=64, # Minibatch size for optimization
n_epochs=10, # Optimization epochs per rollout
gamma=0.99, # Discount factor
gae_lambda=0.95, # GAE lambda for advantage estimation
clip_range=0.2, # PPO clipping parameter
ent_coef=0.01, # Entropy bonus for exploration
verbose=1
)
# Train with callback for logging
callback = TradingCallback(verbose=1, log_freq=1000)
agent.train(total_timesteps=100000, callback=callback)
# Save model
agent.save('models/ppo_trading.zip')
Evaluating PPO Agent
from puffin.rl.ppo_agent import PPOTrader
from puffin.rl.trading_env import TradingEnvironment
import numpy as np
# Load trained agent
agent = PPOTrader.load_from_path('models/ppo_trading.zip', env)
# Evaluate on unseen test data
test_prices = np.random.randn(500).cumsum() + 100
test_env = TradingEnvironment(prices=test_prices)
results = agent.evaluate(test_env, n_episodes=10, deterministic=True)
print(f"Mean Reward: {results['mean_reward']:.2f}")
print(f"Mean Return: {results['mean_return']:.2%}")
print(f"Mean Trades: {results['mean_trades']:.1f}")
Custom Policy Network
For more control over the agent’s architecture, define a custom feature extractor.
from puffin.rl.ppo_agent import PPOTrader
from puffin.rl.trading_env import TradingEnvironment
from stable_baselines3.common.torch_layers import BaseFeaturesExtractor
import torch.nn as nn
import gymnasium as gym
class CustomTradingNetwork(BaseFeaturesExtractor):
"""Custom feature extractor with dropout for regularization."""
def __init__(self, observation_space: gym.Space, features_dim: int = 128):
super().__init__(observation_space, features_dim)
input_dim = observation_space.shape[0]
self.network = nn.Sequential(
nn.Linear(input_dim, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, features_dim),
nn.ReLU()
)
def forward(self, observations):
return self.network(observations)
# Use custom network with PPO
policy_kwargs = dict(
features_extractor_class=CustomTradingNetwork,
features_extractor_kwargs=dict(features_dim=128),
)
agent = PPOTrader(
env=env,
policy='MlpPolicy',
policy_kwargs=policy_kwargs,
learning_rate=3e-4
)
Comprehensive Agent Evaluation
Evaluate and compare all agent types side by side to understand their strengths and weaknesses.
from puffin.rl.evaluation import (
evaluate_agent,
compare_agents,
plot_episode_rewards,
plot_drawdown,
plot_comparison
)
from puffin.rl.dqn import DQNAgent, DDQNAgent
from puffin.rl.ppo_agent import PPOTrader
from puffin.rl.trading_env import TradingEnvironment
import matplotlib.pyplot as plt
import numpy as np
# Create environment
env = TradingEnvironment(prices, features)
# Train multiple agents
dqn = DQNAgent(state_dim=4, action_dim=3)
dqn.train(env, episodes=300)
ddqn = DDQNAgent(state_dim=4, action_dim=3)
ddqn.train(env, episodes=300)
ppo = PPOTrader(env)
ppo.train(total_timesteps=50000)
# Compare all agents
agents = {'DQN': dqn, 'DDQN': ddqn, 'PPO': ppo}
comparison = compare_agents(agents, env, n_episodes=50)
print(comparison)
# Plot comparison
fig = plot_comparison(comparison, metric='mean_return')
plt.savefig('results/agent_comparison.png')
plt.show()
Evaluation Metrics
from puffin.rl.evaluation import evaluate_agent
results = evaluate_agent(
agent=agent,
env=env,
n_episodes=100,
deterministic=True
)
print(f"Mean Reward: {results['mean_reward']:.2f}")
print(f"Std Reward: {results['std_reward']:.2f}")
print(f"Cumulative P&L: {results['cumulative_pnl']:.2f}")
print(f"Mean Return: {results['mean_return']:.2%}")
print(f"Sharpe Ratio: {results['sharpe']:.2f}")
print(f"Max Drawdown: {results['max_drawdown']:.2%}")
print(f"Win Rate: {results['win_rate']:.2%}")
Best Practices
Feature Engineering
Good features are critical for RL agent performance. Combine price-based and volume-based signals.
import pandas as pd
import numpy as np
def create_features(data):
"""Create comprehensive feature set for RL agents."""
features = pd.DataFrame()
# Price-based
features['returns'] = data['close'].pct_change()
features['log_returns'] = np.log(data['close'] / data['close'].shift(1))
# Moving averages
features['sma_20'] = data['close'].rolling(20).mean()
features['sma_50'] = data['close'].rolling(50).mean()
features['sma_ratio'] = features['sma_20'] / features['sma_50']
# Volatility
features['volatility'] = data['close'].pct_change().rolling(20).std()
# Volume
features['volume_ma'] = data['volume'].rolling(20).mean()
features['volume_ratio'] = data['volume'] / features['volume_ma']
return features.bfill().values
Training Strategies
from puffin.rl.dqn import DQNAgent
from puffin.rl.trading_env import TradingEnvironment
import numpy as np
# Curriculum learning: start simple, increase complexity
simple_env = TradingEnvironment(prices[:500])
agent = DQNAgent(state_dim, action_dim=3)
agent.train(simple_env, episodes=100)
complex_env = TradingEnvironment(prices, features=features)
agent.train(complex_env, episodes=400)
# Early stopping on validation performance
best_reward = -np.inf
patience = 50
patience_counter = 0
for episode in range(1000):
# ... training loop ...
avg_reward = np.mean(recent_rewards)
if avg_reward > best_reward:
best_reward = avg_reward
patience_counter = 0
agent.save('best_model.pt')
else:
patience_counter += 1
if patience_counter >= patience:
print("Early stopping triggered")
break
Always split data into train/validation/test sets. Train on the first 60%, validate during training on the next 20%, and report final results only on the last 20%. Never tune hyperparameters on test data.
Validation and Testing
from puffin.rl.evaluation import evaluate_agent
from puffin.rl.trading_env import TradingEnvironment
# Split data temporally (never shuffle time series!)
train_prices = prices[:int(0.6 * len(prices))]
val_prices = prices[int(0.6 * len(prices)):int(0.8 * len(prices))]
test_prices = prices[int(0.8 * len(prices)):]
# Train
train_env = TradingEnvironment(train_prices)
agent.train(train_env, episodes=500)
# Validate during training
val_env = TradingEnvironment(val_prices)
val_results = evaluate_agent(agent, val_env, n_episodes=10)
print(f"Validation return: {val_results['mean_return']:.2%}")
# Final test on held-out data
test_env = TradingEnvironment(test_prices)
test_results = evaluate_agent(agent, test_env, n_episodes=20)
print(f"Test return: {test_results['mean_return']:.2%}")
print(f"Test Sharpe: {test_results['sharpe']:.2f}")
Risk Management
from puffin.rl.trading_env import TradingEnvironment
# Limit position size to prevent concentration risk
env = TradingEnvironment(
prices=prices,
max_position=0.3, # Max 30% of portfolio
commission=0.001,
reward_type='risk_adjusted' # Penalizes volatility and drawdowns
)
Combine position limits with risk-adjusted rewards for the most robust agents. An agent trained with
risk_adjustedrewards andmax_position=0.3will naturally learn conservative position sizing.
Source Code
| File | Description |
|---|---|
puffin/rl/trading_env.py |
TradingEnvironment Gymnasium environment |
puffin/rl/ppo_agent.py |
PPOTrader and TradingCallback classes |
puffin/rl/evaluation.py |
evaluate_agent, compare_agents, and plotting utilities |
tests/rl/test_q_learning.py |
Integration tests for RL agents and environment |
References
- Schulman et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347
- stable-baselines3 documentation
- Brockman et al. (2016). OpenAI Gym. arXiv:1606.01540