Synthetic Data with GANs

Overview

Generative Adversarial Networks (GANs) generate realistic synthetic data by training two neural networks in competition. In trading, GANs create synthetic market data for backtesting, data augmentation, and stress testing – addressing the fundamental challenge that historical data is always finite while the space of possible market outcomes is vast.

Chapter 21 of Machine Learning for Algorithmic Trading covers GANs for synthetic time-series data, including TimeGAN. This part implements those techniques in the puffin.deep.gan module.

Why Synthetic Data for Trading?

Financial ML suffers from a chronic data shortage. Markets generate one day of data per trading day – you cannot accelerate that. GANs offer a principled approach to creating additional realistic samples:

Data augmentation: Generate additional training examples for rare events (crashes, flash crashes, extreme volatility)
Stress testing: Test strategies under hypothetical scenarios not seen in the historical record
Privacy: Share synthetic data without exposing proprietary positions or alpha signals
Research: Experiment with different market dynamics to test strategy robustness

Synthetic data is only as good as its evaluation. Always validate generated samples against real data using the statistical tests covered in the evaluation sub-page.

GAN Training Pipeline

graph TD
    A[Historical Market Data] --> B[Preprocessing & Normalization]
    B --> C[Sequence Windowing]
    C --> D[Embedder]
    D --> E[Latent Space]

    F[Random Noise z] --> G[Generator]
    G --> H[Fake Latent Sequences]

    E --> I[Supervisor]
    I --> J[Temporal Dynamics]

    H --> K[Discriminator]
    E --> K
    K --> L{Real or Fake?}

    L -->|Feedback| G
    L -->|Feedback| K

    E --> M[Recovery Network]
    M --> N[Synthetic Market Data]

    classDef data fill:#2d5016,stroke:#1a3a1a,color:#e8e0d4
    classDef network fill:#1a3a5c,stroke:#0d2137,color:#e8e0d4
    classDef latent fill:#6b2d5b,stroke:#3d1a35,color:#e8e0d4
    classDef output fill:#8b4513,stroke:#5c2e0d,color:#e8e0d4

    class A,B,C data
    class D,G,I,K,M network
    class E,F,H,J latent
    class L,N output

    linkStyle default stroke:#4a5568,stroke-width:2px

Chapter Contents

Sub-page	Topics
GAN Architecture Fundamentals	Generator, discriminator, adversarial training, data augmentation, training stability
TimeGAN for Financial Data	Temporal dependencies, embedder/recovery/supervisor architecture, scenario generation
Synthetic Data Evaluation	Distribution comparison, autocorrelation structure, PCA analysis, full evaluation pipeline

The Adversarial Game

At its core, a GAN is a two-player minimax game:

Generator G(z) maps random noise to fake data samples
Discriminator D(x) outputs the probability that x is real (not generated)
G tries to maximize D’s error rate; D tries to minimize it
At equilibrium, G produces samples indistinguishable from real data

from puffin.deep.gan import GAN

# The generator and discriminator train together
gan = GAN(latent_dim=10, data_dim=20)
history = gan.train(real_data, epochs=100, batch_size=64, lr=0.0002)

# At convergence: D(G(z)) ~ 0.5 (discriminator cannot tell real from fake)

Monitor both generator and discriminator loss during training. If the discriminator loss collapses to zero, the generator has likely suffered mode collapse.

Common Pitfalls

Mode collapse: Generator produces limited variety of samples. Use diverse training data, adjust learning rates, or try Wasserstein GAN.
Training instability: Losses oscillate wildly. Lower learning rate, add gradient clipping, use spectral normalization.
Poor quality samples: Synthetic data doesn’t match real data statistics. More training epochs, better architecture, more training data.
Overfitting: GAN memorizes training data instead of learning the distribution. Use more diverse training data, regularization, or early stopping.

Best Practices

Always normalize input data before training (StandardScaler)
Use conservative learning rates (0.0001 to 0.0003)
Balance discriminator and generator update frequency
Validate synthetic data quality with statistical tests before use
Check for mode collapse by visualizing sample diversity with PCA

Exercises

Train a basic GAN on daily return data for 10 stocks. Compare the marginal distributions of real and synthetic returns using KS tests.
Use TimeGAN to generate 50 synthetic 30-day market scenarios. Are the autocorrelation structures preserved?
Augment a small training set (500 samples) with GAN-generated data and measure the impact on a RandomForest classifier’s accuracy.
Generate stress-test scenarios by conditioning on high-volatility periods. How do your strategies perform?
Apply PCA to both real and synthetic data. Do they share the same principal component structure?

Summary

Standard GAN: Cross-sectional market data generation using adversarial training
TimeGAN: Time series with temporal dependencies via embedder, supervisor, and recovery networks
Evaluation: Statistical tests (KS test, autocorrelation, PCA) validate synthetic data quality

Key applications: data augmentation for rare events, stress testing and scenario analysis, privacy-preserving data sharing, and research experimentation.

Notebook: Run the examples interactively in deep_learning.ipynb

Part 19: Autoencoders – Variational autoencoders introduce generative latent-space modeling that GANs extend
Part 16: Deep Learning Fundamentals – Foundational neural network training techniques used by both the generator and discriminator
Part 7: Backtesting – Synthetic data generated by GANs can augment backtesting with stress-test scenarios

Source Code

Browse the implementation: puffin/deep/gan.py

Next Steps

Part 21 explores Deep Reinforcement Learning: Q-learning, DQN, and trading agents that learn optimal policies through interaction with market environments.

Synthetic Data with GANs

Overview

Why Synthetic Data for Trading?

GAN Training Pipeline

Chapter Contents

The Adversarial Game

Common Pitfalls

Best Practices

Exercises

Summary

Related Chapters

Source Code

Next Steps

Table of contents