Deep Q Networks (DQN)

Deep Q Networks (DQN) combine Reinforcement Learning with deep neural networks to learn optimal policies directly from high-dimensional sensory input. Introduced by DeepMind in 2013, DQN was the first algorithm to successfully learn control policies from raw pixel input, famously achieving human-level performance on Atari games.

From Q-Learning to Deep Q-Learning

Q-Learning Recap

Q-learning learns a Q-function $Q(s, a)$ that estimates the expected return of taking action $a$ in state $s$ and then following the optimal policy. The Q-value is updated using the Bellman equation:

$Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]$

Where:

$\alpha$ is the learning rate
$r$ is the immediate reward
$\gamma$ is the discount factor
$s'$ is the next state

Traditional Q-learning stores Q-values in a table, which becomes impractical when the state space is large or continuous (e.g., images with millions of pixels).

The Deep Learning Solution

DQN replaces the Q-table with a neural network that approximates the Q-function:

$Q(s, a; \theta) \approx Q^*(s, a)$

Where $\theta$ represents the network’s weights. The network takes a state as input and outputs Q-values for all possible actions.

graph LR
    S[State s] --> NN[Neural Network]
    NN --> Q1[Q value action 1]
    NN --> Q2[Q value action 2]
    NN --> Q3[Q value action n]

Key Innovations

Naively combining neural networks with Q-learning is unstable. DQN introduced two crucial techniques to make it work:

1. Experience Replay

Instead of learning from consecutive experiences (which are highly correlated), DQN stores experiences in a replay buffer and samples random mini-batches for training.

Each experience is a tuple: $(s, a, r, s')$

Benefits:

Breaks correlation between consecutive samples
Reuses experiences multiple times (data efficiency)
Smooths out learning over many past behaviours

2. Target Network

DQN uses two networks:

Online network ( $\theta$ ) — updated every step
Target network ( $\theta^-$ ) — a periodic copy, updated every $C$ steps

The target for updating the Q-network becomes:

$y = r + \gamma \max_{a'} Q(s', a'; \theta^-)$

Benefits:

Stabilises training by keeping the target fixed temporarily
Prevents the “moving target” problem where updates chase a constantly changing goal

The DQN Algorithm

Initialise replay buffer D
Initialise Q-network with random weights θ
Initialise target network with weights θ⁻ = θ

For each episode:
    Initialise state s
    For each step:
        With probability ε select random action a
        Otherwise select a = argmax_a Q(s, a; θ)

        Execute action a, observe reward r and next state s'
        Store transition (s, a, r, s') in replay buffer D

        Sample random mini-batch of transitions from D

        For each transition, compute target:
            y = r                           (if s' is terminal)
            y = r + γ max_a' Q(s', a'; θ⁻)  (otherwise)

        Update θ by gradient descent on (y - Q(s, a; θ))²

        Every C steps: θ⁻ ← θ

ε-Greedy Exploration

DQN uses ε-greedy exploration to balance exploitation and exploration:

With probability $\varepsilon$ : take a random action (explore)
With probability $1 - \varepsilon$ : take the best action according to Q (exploit)

Typically, $\varepsilon$ starts high (e.g., 1.0) and decays over time to a small value (e.g., 0.1), allowing more exploration early in training and more exploitation later.

Loss Function

The network is trained to minimise the mean squared error between predicted Q-values and target Q-values:

$L(\theta) = \mathbb{E}\left[ \left( y - Q(s, a; \theta) \right)^2 \right]$

Where: $y = r + \gamma \max_{a'} Q(s', a'; \theta^-)$

Network Architecture (Atari)

For Atari games, the original DQN used:

Layer	Description
Input	84×84×4 (4 stacked grayscale frames)
Conv1	32 filters, 8×8, stride 4, ReLU
Conv2	64 filters, 4×4, stride 2, ReLU
Conv3	64 filters, 3×3, stride 1, ReLU
FC	512 units, ReLU
Output	One Q-value per action

Stacking 4 frames allows the network to infer motion and velocity.

Limitations

Overestimation bias — Q-learning tends to overestimate Q-values (addressed by Double DQN)
Sample inefficiency — requires millions of frames to learn
Discrete actions only — doesn’t handle continuous action spaces
No prioritisation — treats all experiences equally (addressed by Prioritised Experience Replay)

Extensions and Improvements

Variant	Key Idea
Double DQN	Uses online network to select actions, target network to evaluate (reduces overestimation)
Dueling DQN	Separates state value and action advantage estimation
Prioritised Experience Replay	Samples important transitions more frequently
Rainbow	Combines multiple improvements into one agent

Applications

Game playing (Atari, board games)
Robotics control
Resource management
Recommendation systems
Autonomous vehicles