Deep Q Networks (DQN)
Deep Q Networks (DQN) combine Reinforcement Learning with deep neural networks to learn optimal policies directly from high-dimensional sensory input. Introduced by DeepMind in 2013, DQN was the first algorithm to successfully learn control policies from raw pixel input, famously achieving human-level performance on Atari games.
From Q-Learning to Deep Q-Learning
Q-Learning Recap
Q-learning learns a Q-function that estimates the expected return of taking action in state and then following the optimal policy. The Q-value is updated using the Bellman equation:
Where:
- is the learning rate
- is the immediate reward
- is the discount factor
- is the next state
Traditional Q-learning stores Q-values in a table, which becomes impractical when the state space is large or continuous (e.g., images with millions of pixels).
The Deep Learning Solution
DQN replaces the Q-table with a neural network that approximates the Q-function:
Where represents the network’s weights. The network takes a state as input and outputs Q-values for all possible actions.
graph LR
S[State s] --> NN[Neural Network]
NN --> Q1[Q value action 1]
NN --> Q2[Q value action 2]
NN --> Q3[Q value action n]
Key Innovations
Naively combining neural networks with Q-learning is unstable. DQN introduced two crucial techniques to make it work:
1. Experience Replay
Instead of learning from consecutive experiences (which are highly correlated), DQN stores experiences in a replay buffer and samples random mini-batches for training.
Each experience is a tuple:
Benefits:
- Breaks correlation between consecutive samples
- Reuses experiences multiple times (data efficiency)
- Smooths out learning over many past behaviours
2. Target Network
DQN uses two networks:
- Online network () — updated every step
- Target network () — a periodic copy, updated every steps
The target for updating the Q-network becomes:
Benefits:
- Stabilises training by keeping the target fixed temporarily
- Prevents the “moving target” problem where updates chase a constantly changing goal
The DQN Algorithm
Initialise replay buffer D
Initialise Q-network with random weights θ
Initialise target network with weights θ⁻ = θ
For each episode:
Initialise state s
For each step:
With probability ε select random action a
Otherwise select a = argmax_a Q(s, a; θ)
Execute action a, observe reward r and next state s'
Store transition (s, a, r, s') in replay buffer D
Sample random mini-batch of transitions from D
For each transition, compute target:
y = r (if s' is terminal)
y = r + γ max_a' Q(s', a'; θ⁻) (otherwise)
Update θ by gradient descent on (y - Q(s, a; θ))²
Every C steps: θ⁻ ← θ
ε-Greedy Exploration
DQN uses ε-greedy exploration to balance exploitation and exploration:
- With probability : take a random action (explore)
- With probability : take the best action according to Q (exploit)
Typically, starts high (e.g., 1.0) and decays over time to a small value (e.g., 0.1), allowing more exploration early in training and more exploitation later.
Loss Function
The network is trained to minimise the mean squared error between predicted Q-values and target Q-values:
Where:
Network Architecture (Atari)
For Atari games, the original DQN used:
| Layer | Description |
|---|---|
| Input | 84×84×4 (4 stacked grayscale frames) |
| Conv1 | 32 filters, 8×8, stride 4, ReLU |
| Conv2 | 64 filters, 4×4, stride 2, ReLU |
| Conv3 | 64 filters, 3×3, stride 1, ReLU |
| FC | 512 units, ReLU |
| Output | One Q-value per action |
Stacking 4 frames allows the network to infer motion and velocity.
Limitations
- Overestimation bias — Q-learning tends to overestimate Q-values (addressed by Double DQN)
- Sample inefficiency — requires millions of frames to learn
- Discrete actions only — doesn’t handle continuous action spaces
- No prioritisation — treats all experiences equally (addressed by Prioritised Experience Replay)
Extensions and Improvements
| Variant | Key Idea |
|---|---|
| Double DQN | Uses online network to select actions, target network to evaluate (reduces overestimation) |
| Dueling DQN | Separates state value and action advantage estimation |
| Prioritised Experience Replay | Samples important transitions more frequently |
| Rainbow | Combines multiple improvements into one agent |
Applications
- Game playing (Atari, board games)
- Robotics control
- Resource management
- Recommendation systems
- Autonomous vehicles
See Also
- Reinforcement Learning — foundational concepts (states, actions, rewards, policy)