Deep Q Networks (DQN)

Deep Q Networks (DQN) combine Reinforcement Learning with deep neural networks to learn optimal policies directly from high-dimensional sensory input. Introduced by DeepMind in 2013, DQN was the first algorithm to successfully learn control policies from raw pixel input, famously achieving human-level performance on Atari games.

From Q-Learning to Deep Q-Learning

Q-Learning Recap

Q-learning learns a Q-function Q(s,a)Q(s, a) that estimates the expected return of taking action aa in state ss and then following the optimal policy. The Q-value is updated using the Bellman equation:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]

Where:

  • α\alpha is the learning rate
  • rr is the immediate reward
  • γ\gamma is the discount factor
  • ss' is the next state

Traditional Q-learning stores Q-values in a table, which becomes impractical when the state space is large or continuous (e.g., images with millions of pixels).

The Deep Learning Solution

DQN replaces the Q-table with a neural network that approximates the Q-function:

Q(s,a;θ)Q(s,a)Q(s, a; \theta) \approx Q^*(s, a)

Where θ\theta represents the network’s weights. The network takes a state as input and outputs Q-values for all possible actions.

graph LR
    S[State s] --> NN[Neural Network]
    NN --> Q1[Q value action 1]
    NN --> Q2[Q value action 2]
    NN --> Q3[Q value action n]

Key Innovations

Naively combining neural networks with Q-learning is unstable. DQN introduced two crucial techniques to make it work:

1. Experience Replay

Instead of learning from consecutive experiences (which are highly correlated), DQN stores experiences in a replay buffer and samples random mini-batches for training.

Each experience is a tuple: (s,a,r,s)(s, a, r, s')

Benefits:

  • Breaks correlation between consecutive samples
  • Reuses experiences multiple times (data efficiency)
  • Smooths out learning over many past behaviours

2. Target Network

DQN uses two networks:

  • Online network (θ\theta) — updated every step
  • Target network (θ\theta^-) — a periodic copy, updated every CC steps

The target for updating the Q-network becomes:

y=r+γmaxaQ(s,a;θ)y = r + \gamma \max_{a'} Q(s', a'; \theta^-)

Benefits:

  • Stabilises training by keeping the target fixed temporarily
  • Prevents the “moving target” problem where updates chase a constantly changing goal

The DQN Algorithm

Initialise replay buffer D
Initialise Q-network with random weights θ
Initialise target network with weights θ⁻ = θ

For each episode:
    Initialise state s
    For each step:
        With probability ε select random action a
        Otherwise select a = argmax_a Q(s, a; θ)

        Execute action a, observe reward r and next state s'
        Store transition (s, a, r, s') in replay buffer D

        Sample random mini-batch of transitions from D

        For each transition, compute target:
            y = r                           (if s' is terminal)
            y = r + γ max_a' Q(s', a'; θ⁻)  (otherwise)

        Update θ by gradient descent on (y - Q(s, a; θ))²

        Every C steps: θ⁻ ← θ

ε-Greedy Exploration

DQN uses ε-greedy exploration to balance exploitation and exploration:

  • With probability ε\varepsilon: take a random action (explore)
  • With probability 1ε1 - \varepsilon: take the best action according to Q (exploit)

Typically, ε\varepsilon starts high (e.g., 1.0) and decays over time to a small value (e.g., 0.1), allowing more exploration early in training and more exploitation later.

Loss Function

The network is trained to minimise the mean squared error between predicted Q-values and target Q-values:

L(θ)=E[(yQ(s,a;θ))2]L(\theta) = \mathbb{E}\left[ \left( y - Q(s, a; \theta) \right)^2 \right]

Where: y=r+γmaxaQ(s,a;θ)y = r + \gamma \max_{a'} Q(s', a'; \theta^-)

Network Architecture (Atari)

For Atari games, the original DQN used:

LayerDescription
Input84×84×4 (4 stacked grayscale frames)
Conv132 filters, 8×8, stride 4, ReLU
Conv264 filters, 4×4, stride 2, ReLU
Conv364 filters, 3×3, stride 1, ReLU
FC512 units, ReLU
OutputOne Q-value per action

Stacking 4 frames allows the network to infer motion and velocity.

Limitations

  • Overestimation bias — Q-learning tends to overestimate Q-values (addressed by Double DQN)
  • Sample inefficiency — requires millions of frames to learn
  • Discrete actions only — doesn’t handle continuous action spaces
  • No prioritisation — treats all experiences equally (addressed by Prioritised Experience Replay)

Extensions and Improvements

VariantKey Idea
Double DQNUses online network to select actions, target network to evaluate (reduces overestimation)
Dueling DQNSeparates state value and action advantage estimation
Prioritised Experience ReplaySamples important transitions more frequently
RainbowCombines multiple improvements into one agent

Applications

  • Game playing (Atari, board games)
  • Robotics control
  • Resource management
  • Recommendation systems
  • Autonomous vehicles

See Also

-
-