Q-Learning Fundamentals

Q-learning is a foundational reinforcement learning algorithm that learns action-value functions through temporal difference learning. It works by maintaining a table of Q-values — one entry for every state-action pair — and updating them toward better estimates as the agent interacts with the environment.

Q-learning is an off-policy algorithm: it learns about the optimal policy regardless of the exploration strategy used during training. This makes it sample-efficient compared to on-policy methods like SARSA.

The Bellman Equation

Q-learning updates the Q-table using the Bellman optimality equation:

Q(s,a) <- Q(s,a) + alpha * [r + gamma * max_a' Q(s',a') - Q(s,a)]

Where:

alpha: Learning rate controlling update magnitude (typically 0.01–0.1)
gamma: Discount factor weighting future vs immediate rewards (typically 0.95–0.99)
r: Immediate reward received after taking action a in state s
s’: Next state after the transition
max_a’ Q(s’,a’): Best estimated future value from the next state

The term r + gamma * max_a' Q(s',a') - Q(s,a) is the temporal difference (TD) error — the gap between the current estimate and a better one.

QLearningAgent

The QLearningAgent class provides a complete tabular Q-learning implementation with epsilon-greedy exploration and configurable decay schedules.

from puffin.rl.q_learning import QLearningAgent
import numpy as np
import gymnasium as gym

# Create simple discrete environment
env = gym.make('FrozenLake-v1', is_slippery=False)

# Initialize Q-learning agent
agent = QLearningAgent(
    n_states=16,           # Number of discrete states
    n_actions=4,           # Number of actions
    lr=0.1,                # Learning rate
    gamma=0.99,            # Discount factor
    epsilon=1.0,           # Initial exploration rate
    epsilon_decay=0.995    # Decay factor per episode
)

# Train the agent
rewards = agent.train(env, episodes=1000, verbose=True)

# Get learned policy (best action per state)
policy = agent.get_policy()
print("Learned policy:", policy)

# Save the Q-table for later use
agent.save("q_table.npy")

Start with a high epsilon (1.0) for full exploration and decay it slowly. If epsilon decays too fast, the agent may converge to a suboptimal policy before discovering better actions.

Epsilon-Greedy Exploration

The epsilon-greedy strategy balances exploration and exploitation:

With probability epsilon, choose a random action (explore)
With probability 1 - epsilon, choose the action with highest Q-value (exploit)

Epsilon typically decays over training so the agent explores broadly early on and refines its policy later:

from puffin.rl.q_learning import QLearningAgent

# Aggressive exploration early, refined exploitation later
agent = QLearningAgent(
    n_states=100,
    n_actions=3,
    epsilon=1.0,           # Start with 100% exploration
    epsilon_decay=0.998,   # Slow decay
    epsilon_min=0.01       # Never fully stop exploring
)

# After 1000 episodes: epsilon ~ 0.01 * 0.998^1000 ~ 0.135
# After 2000 episodes: epsilon ~ 0.01 * 0.998^2000 ~ 0.018

State Discretization

For continuous state spaces (like market prices), observations must be converted to discrete indices. The discretize_state function bins each dimension independently and returns a combined index.

from puffin.rl.q_learning import discretize_state
import numpy as np

# Define bins for each observation dimension
obs = np.array([0.52, 1.23, -0.45])
bins = [
    np.linspace(0, 1, 10),    # 10 bins for dimension 1
    np.linspace(0, 2, 10),    # 10 bins for dimension 2
    np.linspace(-1, 1, 10)    # 10 bins for dimension 3
]

# Convert to discrete state index
state_index = discretize_state(obs, bins)
print(f"Discrete state: {state_index}")

The number of states grows exponentially with dimensions. With 10 bins per dimension and 5 dimensions, you get 10^5 = 100,000 states. For high-dimensional observations, use DQN instead of tabular Q-learning.

Trading with Q-Learning

Applying Q-learning to trading requires wrapping the TradingEnvironment with a discretization layer. The agent learns a policy over discretized price bins.

from puffin.rl.q_learning import QLearningAgent, discretize_state
from puffin.rl.trading_env import TradingEnvironment
import numpy as np
import pandas as pd

# Load price data
prices = pd.read_csv('data/prices.csv')['close'].values

# Create bins for price discretization using percentiles
price_bins = [np.percentile(prices, q) for q in range(0, 101, 10)]
bins = [np.array(price_bins)]

# Create wrapper for discretization
class DiscreteWrapper:
    """Wraps a continuous environment with state discretization."""

    def __init__(self, env, bins):
        self.env = env
        self.bins = bins
        self.action_space = env.action_space
        self.observation_space = env.observation_space

    def reset(self):
        obs, info = self.env.reset()
        return discretize_state(obs[:1], self.bins), info

    def step(self, action):
        obs, reward, terminated, truncated, info = self.env.step(action)
        discrete_obs = discretize_state(obs[:1], self.bins)
        return discrete_obs, reward, terminated, truncated, info

# Create trading environment with discrete actions (buy/hold/sell)
base_env = TradingEnvironment(prices, discrete_actions=True)
env = DiscreteWrapper(base_env, bins)

# Train Q-learning agent
n_states = (len(bins[0]) - 1)
agent = QLearningAgent(
    n_states=n_states,
    n_actions=3,        # 0=sell, 1=hold, 2=buy
    lr=0.1,
    gamma=0.99,
    epsilon=1.0,
    epsilon_decay=0.995
)
rewards = agent.train(env, episodes=500)

print(f"Mean reward (last 100): {np.mean(rewards[-100:]):.2f}")

Q-Table Inspection

After training, the Q-table reveals what the agent has learned. Each row is a state, each column is an action, and the values represent expected cumulative rewards.

from puffin.rl.q_learning import QLearningAgent
import numpy as np

# After training...
q_table = agent.q_table

# Find states where the agent strongly prefers one action
for state in range(agent.n_states):
    q_values = q_table[state]
    best_action = np.argmax(q_values)
    action_names = ['sell', 'hold', 'buy']

    if np.max(q_values) > 0:
        print(f"State {state}: best={action_names[best_action]}, "
              f"Q={q_values}")

Inspect the Q-table after training to verify the agent has learned sensible policies. If all Q-values are near zero, the agent may need more training episodes or a different learning rate.

Limitations of Tabular Q-Learning

Tabular Q-learning has fundamental constraints that motivate Deep Q-Networks:

Curse of dimensionality: The Q-table size is n_states * n_actions. With continuous or high-dimensional observations, the table becomes impractically large.
No generalization: Similar states are treated independently. A Q-value learned for state 42 says nothing about state 43, even if they represent nearly identical market conditions.
Discretization artifacts: Binning continuous data introduces quantization error. Fine bins create too many states; coarse bins lose information.
Slow convergence: Every state-action pair must be visited multiple times. In large state spaces, many pairs are rarely or never encountered.

These limitations lead naturally to function approximation — using neural networks to estimate Q-values, which is the subject of the next section on DQN.

Source Code

File	Description
`puffin/rl/q_learning.py`	`QLearningAgent` class and `discretize_state` function
`puffin/rl/trading_env.py`	`TradingEnvironment` Gymnasium wrapper
`tests/rl/test_q_learning.py`	Unit tests for Q-learning agent

References

Sutton & Barto (2018). Reinforcement Learning: An Introduction, Chapter 6: Temporal-Difference Learning
Watkins & Dayan (1992). Q-learning. Machine Learning, 8(3-4), 279–292