Benchmark Comparison
Overview
A strategy that returns 15% annually sounds impressive – until you learn the S&P 500 returned 20% over the same period. Benchmark comparison is the practice of measuring your strategy’s performance relative to a passive alternative so you can determine whether the complexity of active management is justified.
The BenchmarkComparison class in Puffin computes the standard set of relative performance metrics and provides visualization tools for cumulative returns, excess returns, and rolling statistics.
The most common benchmark for US equity strategies is the SPDR S&P 500 ETF (SPY). For sector-specific or international strategies, choose a benchmark that reflects the investable universe of your strategy.
Core Metrics
Before diving into code, it helps to understand what each metric tells you:
| Metric | Formula | Interpretation |
|---|---|---|
| Alpha | mean(strategy) - beta * mean(benchmark) |
Excess return not explained by market exposure |
| Beta | cov(strategy, benchmark) / var(benchmark) |
Sensitivity to benchmark moves |
| Information Ratio | mean(excess) / std(excess) |
Risk-adjusted outperformance (higher is better) |
| Tracking Error | std(excess) |
Volatility of the return difference |
| Correlation | corr(strategy, benchmark) |
How closely strategy follows benchmark |
| Outperformance | sum(strategy) - sum(benchmark) |
Total cumulative excess return |
An information ratio above 0.5 is generally considered good; above 1.0 is exceptional. Most active managers struggle to sustain an IR above 0.5 after costs.
Basic Comparison
Pass strategy and benchmark return series to compare() to get all metrics in a single dictionary.
from puffin.monitor.benchmark import BenchmarkComparison
import pandas as pd
bc = BenchmarkComparison()
# Your strategy returns
strategy_returns = pd.Series([0.01, 0.02, -0.01, 0.03, 0.015])
# Benchmark returns (e.g., SPY)
benchmark_returns = pd.Series([0.005, 0.015, -0.005, 0.02, 0.01])
# Compare
metrics = bc.compare(strategy_returns, benchmark_returns)
print(f"Alpha: {metrics['alpha']:.4f}")
print(f"Beta: {metrics['beta']:.4f}")
print(f"Information Ratio: {metrics['ir']:.4f}")
print(f"Tracking Error: {metrics['tracking_error']:.4f}")
print(f"Correlation: {metrics['correlation']:.4f}")
print(f"Outperformance: {metrics['outperformance']:.2f}%")
Both series must be aligned on the same index (dates). The compare() method drops rows where either series has NaN, but misaligned indices will silently produce wrong results. Always verify alignment before calling compare().
Interpreting the Results
Alpha and Beta
Alpha and beta come from the capital asset pricing model (CAPM). Beta tells you how much your strategy moves when the benchmark moves one percent. Alpha tells you how much excess return your strategy generates after adjusting for that market exposure.
# A beta of 1.2 means:
# If SPY goes up 1%, your strategy goes up ~1.2%
# If SPY goes down 1%, your strategy goes down ~1.2%
# A positive alpha means:
# Your strategy earns more than what beta alone would predict
Information Ratio and Tracking Error
The information ratio normalises your excess return by its volatility. A high IR means you are consistently beating the benchmark, not just occasionally getting lucky. The tracking error tells you how “different” your strategy is from the benchmark.
# High tracking error + high IR = actively different strategy that works
# High tracking error + low IR = actively different but not adding value
# Low tracking error + any IR = closet indexing (not much differentiation)
Visualization
The plot_comparison() method produces a four-panel figure: cumulative returns, excess returns, rolling correlation, and a metrics summary table.
import matplotlib.pyplot as plt
# Plot comparison
fig = bc.plot_comparison(strategy_returns, benchmark_returns)
plt.show()
The comparison plot includes:
- Cumulative returns – strategy vs benchmark equity curves on the same axes
- Excess returns – bar chart of per-period outperformance or underperformance
- Rolling correlation – 20-period rolling window showing regime changes
- Performance metrics – text summary of all computed statistics
Save the figure to disk for inclusion in client reports or strategy review decks: fig.savefig('benchmark_comparison.png', dpi=150, bbox_inches='tight').
Rolling Metrics
Point-in-time metrics can be misleading because they depend on the start and end dates. Rolling metrics show how alpha, beta, and the information ratio evolve over time, helping you detect strategy decay or regime changes.
# Calculate rolling performance metrics
rolling = bc.rolling_metrics(
strategy_returns,
benchmark_returns,
window=60 # 60-day rolling window
)
print(rolling.tail())
Plotting Rolling Alpha
import matplotlib.pyplot as plt
rolling['alpha'].plot(figsize=(12, 6))
plt.title('Rolling 60-Day Alpha')
plt.ylabel('Alpha')
plt.axhline(y=0, color='black', linestyle='--', alpha=0.3)
plt.grid(True, alpha=0.3)
plt.show()
Plotting Rolling Beta
rolling['beta'].plot(figsize=(12, 6))
plt.title('Rolling 60-Day Beta')
plt.ylabel('Beta')
plt.axhline(y=1, color='black', linestyle='--', alpha=0.3)
plt.grid(True, alpha=0.3)
plt.show()
A rolling beta that drifts significantly from its historical average may indicate that your strategy is changing its market exposure, which could be intentional (dynamic hedging) or a bug.
Integration with P&L Tracker
In practice, you derive your strategy returns from the PnLTracker and fetch benchmark returns from a data provider.
from puffin.monitor.pnl import PnLTracker
from puffin.monitor.benchmark import BenchmarkComparison
import pandas as pd
# Assume tracker has accumulated history
tracker = PnLTracker(initial_cash=100_000.0)
# ... trades and updates happen here ...
# Derive daily strategy returns from P&L history
strategy_returns = tracker.daily_pnl() / tracker.initial_cash
# Fetch benchmark returns (e.g., from yfinance)
# benchmark_returns = get_spy_returns()
benchmark_returns = pd.Series([0.005, 0.01, -0.003, 0.008, 0.012])
bc = BenchmarkComparison()
metrics = bc.compare(strategy_returns, benchmark_returns)
print(f"Alpha: {metrics['alpha']:.4f}")
print(f"Sharpe vs Benchmark (IR): {metrics['ir']:.2f}")
System Health Monitoring
Beyond benchmark comparison, Puffin includes a SystemHealth class for monitoring the operational health of your data feeds and broker connections.
Data Feed Health
from puffin.monitor.health import SystemHealth
# Create health monitor with alert callback
def alert_callback(message, level):
print(f"[{level.upper()}] {message}")
# Could send email, Slack message, etc.
health = SystemHealth(alert_callback=alert_callback)
# Check data feed
class DataProvider:
def get_latest_timestamp(self):
from datetime import datetime, timedelta
return datetime.now() - timedelta(seconds=30)
provider = DataProvider()
status = health.check_data_feed(provider)
print(f"Data feed status: {status['status']}")
print(f"Latency: {status['latency']:.1f}s")
Broker Connection Health
# Check broker connection
class Broker:
def is_connected(self):
return True
def last_heartbeat(self):
from datetime import datetime
return datetime.now()
broker = Broker()
status = health.check_broker_connection(broker)
print(f"Broker status: {status['status']}")
print(f"Connected: {status['connected']}")
Custom Alerts and Overall Health
# Send custom alerts
health.alert("Low liquidity detected", level='warning')
health.alert("Order execution failed", level='error')
health.alert("Daily loss limit reached", level='critical')
# Get overall system health
overall = health.get_overall_health()
print(f"Overall status: {overall['status']}")
print(f"Checks: {overall['checks']}")
Always wire up an alert_callback in production. The default behaviour is to log only; without a callback you will miss critical alerts if you are not watching the log stream.
Best Practices
- Benchmark Selection
- Choose a benchmark that reflects your investable universe
- For long-short strategies, use a risk-free rate or zero as the benchmark
- For sector strategies, use the corresponding sector ETF
- Rolling Windows
- Use a window large enough for statistical significance (60+ observations)
- Track rolling metrics daily and flag significant regime changes
- Monitor rolling correlation for unexpected decorrelation events
- System Health
- Check data feed and broker health on every bar
- Set up alert notifications (email, Slack, SMS) for production
- Monitor data feed latency – stale data leads to stale signals
- Track broker connection status and heartbeat intervals
Source Code
The benchmark comparison and system health implementations live in the following files:
puffin/monitor/benchmark.py–BenchmarkComparisonwithcompare(),plot_comparison(), androlling_metrics()puffin/monitor/health.py–SystemHealthwith data feed checks, broker checks, and alerting