Part 2: Data Pipeline
Every trading system starts with data. Before you can build strategies, train models, or execute trades, you need a reliable pipeline that fetches market data, stores it efficiently, and cleans it for downstream consumption. A poorly built data pipeline leads to stale prices, missing bars, and ultimately wrong trading decisions.
This section walks you through building a production-quality data pipeline from scratch. You will implement a pluggable provider interface that abstracts away the data source, a caching layer that eliminates redundant API calls, and a preprocessing module that handles the messy realities of real-world market data.
Data Pipeline Architecture
flowchart LR
A[Market Source] --> B{DataProvider}
B --> C[YFinanceProvider]
B --> D[AlpacaProvider]
C --> E[SQLite Cache]
D --> E
E --> F[Preprocessing]
F --> G[Missing Values]
F --> H[Outlier Detection]
F --> I[Validation]
G --> J[Clean DataFrame]
H --> J
I --> J
J --> K[Strategies & Models]
classDef source fill:#1a3a5c,stroke:#0d2137,color:#e8e0d4
classDef provider fill:#2d5016,stroke:#1a3a1a,color:#e8e0d4
classDef cache fill:#6b2d5b,stroke:#4a1e3f,color:#e8e0d4
classDef process fill:#8b4513,stroke:#5c2e0d,color:#e8e0d4
classDef output fill:#2d5050,stroke:#1a3a3a,color:#e8e0d4
class A source
class B,C,D provider
class E cache
class F,G,H,I process
class J,K output
Chapters
- Data Providers – The DataProvider interface, yfinance for free historical data, and Alpaca for real-time streaming
- Caching & Storage – SQLite cache for fast lookups, HDF5 and Parquet for long-term storage with MarketDataStore
- Preprocessing – Handling missing values, outlier detection, data validation, and the preprocessing pipeline
Notebook: Run the examples interactively in
data_pipeline.ipynb
Related Chapters
- Part 3: Alternative Data – Alternative data feeds into the pipeline as a non-traditional data source
- Part 4: Alpha Factors – Alpha factors consume clean data produced by the pipeline
- Part 7: Backtesting – The backtesting engine relies on historical data delivered by the pipeline
Source Code
Browse the implementation: puffin/data/