Data Providers
Overview
A data provider is any service that delivers market data to your trading system – historical bars, real-time quotes, or streaming trades. Different providers have different strengths: yfinance is free but limited to delayed daily data, while Alpaca offers real-time streaming suitable for live trading. The key design decision is to abstract the data source behind a common interface so you can swap providers without rewriting your strategies.
The DataProvider Interface
We define an abstract base class that every data provider must implement. This is the Strategy pattern – it lets us add new data sources (Polygon.io, Binance, Interactive Brokers, etc.) without modifying any consuming code:
from puffin.data.provider import DataProvider
The interface looks like this:
from abc import ABC, abstractmethod
from datetime import datetime
import pandas as pd
class DataProvider(ABC):
"""Base class for market data providers."""
@abstractmethod
def fetch_historical(
self,
symbols: str | list[str],
start: str | datetime,
end: str | datetime | None = None,
interval: str = "1d",
) -> pd.DataFrame:
"""Fetch historical OHLCV data.
Args:
symbols: Ticker symbol(s) to fetch.
start: Start date.
end: End date (defaults to today).
interval: Bar interval (e.g., '1m', '5m', '1h', '1d').
Returns:
DataFrame with columns: Open, High, Low, Close, Volume.
MultiIndex (Date, Symbol) for multiple symbols.
"""
@abstractmethod
def get_supported_assets(self) -> list[str]:
"""Return list of supported asset types."""
def stream_realtime(self, symbols: list[str], callback):
"""Stream real-time market data. Optional."""
raise NotImplementedError
This pattern is called the Strategy pattern (not to be confused with trading strategies). It lets us add new data sources (Polygon.io, Binance, etc.) without modifying existing code.
Every provider guarantees that fetch_historical returns a pandas DataFrame with OHLCV columns (Open, High, Low, Close, Volume). When you request multiple symbols, the result uses a MultiIndex of (Date, Symbol) so you can easily slice by ticker.
The stream_realtime method is optional – providers that do not support streaming will raise NotImplementedError. This keeps the interface honest: you can check at runtime whether a provider supports real-time data rather than discovering the limitation in production.
Provider Architecture
flowchart TB
A[DataProvider ABC] --> B[YFinanceProvider]
A --> C[AlpacaProvider]
A --> D[Future: Polygon, Binance, ...]
B --> E["fetch_historical()"]
C --> E
C --> F["stream_realtime()"]
E --> G["DataFrame (OHLCV)"]
F --> H["Callback (tick data)"]
classDef abc fill:#1a3a5c,stroke:#0d2137,color:#e8e0d4
classDef impl fill:#2d5016,stroke:#1a3a1a,color:#e8e0d4
classDef method fill:#6b2d5b,stroke:#4a1e3f,color:#e8e0d4
classDef output fill:#8b4513,stroke:#5c2e0d,color:#e8e0d4
class A abc
class B,C,D impl
class E,F method
class G,H output
Fetching Data with yfinance
The simplest provider uses yfinance – free, no API key required, and covers equities, ETFs, indices, and crypto:
from puffin.data import YFinanceProvider
provider = YFinanceProvider()
# Single ticker
aapl = provider.fetch_historical("AAPL", start="2023-01-01", end="2024-01-01")
print(aapl.head())
The returned DataFrame looks like this:
Open High Low Close Volume
Date
2023-01-03 130.279999 130.899994 124.169998 125.070000 112117500
2023-01-04 126.889999 128.660004 125.080002 126.360001 89113600
2023-01-05 127.129997 127.769997 124.760002 125.019997 80962700
...
You can also fetch multiple tickers in a single call:
# Multiple tickers
data = provider.fetch_historical(
["AAPL", "MSFT", "GOOGL"],
start="2023-01-01",
end="2024-01-01"
)
yfinance is rate-limited and occasionally returns incomplete data. Always validate what you receive before using it in a backtest. The caching layer (covered in the next chapter) eliminates redundant API calls and protects you from rate limits.
Interval Options
The interval parameter controls bar granularity:
| Interval | Description | Max History |
|---|---|---|
1m |
1-minute bars | ~7 days |
5m |
5-minute bars | ~60 days |
1h |
Hourly bars | ~730 days |
1d |
Daily bars | Full history |
1wk |
Weekly bars | Full history |
Intraday data from yfinance has limited history. If you need months of minute-level data, you will need a paid provider like Polygon.io or Alpaca’s historical API.
Real-Time Data with Alpaca
For paper and live trading, you need real-time data. The AlpacaProvider connects to Alpaca’s WebSocket API and delivers streaming bars, quotes, and trades:
from puffin.data.alpaca_provider import AlpacaProvider
provider = AlpacaProvider() # Reads API keys from .env
# Stream real-time bars
def on_update(symbol, price, volume, timestamp):
print(f"{symbol}: ${price:.2f} ({volume} shares) at {timestamp}")
provider.stream_realtime(["AAPL", "MSFT"], callback=on_update)
The AlpacaProvider also implements fetch_historical, so you can use the same interface for both historical backtesting and live trading:
# Historical data via Alpaca (useful for verifying data quality against yfinance)
alpaca_data = provider.fetch_historical("AAPL", start="2023-01-01", end="2024-01-01")
Alpaca requires API keys. Create a free paper trading account at alpaca.markets and store your keys in a .env file. Never commit API keys to version control.
Comparing Providers
| Feature | YFinanceProvider | AlpacaProvider |
|---|---|---|
| Cost | Free | Free (paper), paid (live) |
| API key required | No | Yes |
| Historical data | Full history (daily) | 5+ years |
| Real-time streaming | No | Yes (WebSocket) |
| Asset coverage | Global equities, ETFs, crypto | US equities, crypto |
| Rate limits | Yes (undocumented) | 200 req/min |
| Best for | Research, backtesting | Paper trading, live trading |
Adding a Custom Provider
To add a new data source, simply subclass DataProvider:
from puffin.data.provider import DataProvider
class PolygonProvider(DataProvider):
def fetch_historical(self, symbols, start, end, interval="1d"):
# Your Polygon.io API logic here
...
def get_supported_assets(self):
return ["equity", "option", "crypto", "forex"]
def stream_realtime(self, symbols, callback):
# WebSocket streaming logic
...
No other code in the system needs to change. Your strategies, backtester, and risk manager all work through the DataProvider interface.
Exercises
- Fetch 5 years of daily SPY data using
YFinanceProviderand examine the data quality – how many missing days are there? Are there any outlier returns? - Write a
CSVProviderthat reads OHLCV data from local CSV files. Implementfetch_historicalandget_supported_assets. - Fetch the same ticker and date range from both
YFinanceProviderandAlpacaProvider(if you have an API key). Compare the close prices – are they identical? What accounts for the differences?
Summary
- The
DataProviderabstract interface decouples your trading system from any specific data source YFinanceProvidergives free historical data with no setup – ideal for research and backtestingAlpacaProvideradds real-time streaming for paper and live trading- New providers can be added by subclassing
DataProviderwithout changing existing code
Next Steps
With data flowing into your system, the next question is: where do you store it? In the next chapter, you will build a caching and storage layer that eliminates redundant API calls and persists data efficiently.