Reinforcement Learning is an exciting field of Machine Learning that’s attracting a lot of attention and popularity. An important reason for this popularity is due to breakthroughs in Reinforcement Learning where computer algorithms such as Alpha Go and OpenAI Five have been able to achieve human level performance on games such as Go and Dota 2. One of the core concepts in Reinforcement Learning is the Deep Q-Learning algorithm. Naturally, a lot of us want to learn more about the algorithms behind these impressive accomplishments. In this tutorial, we’ll be sharing a minimal Deep Q-Network implementation (minDQN) meant as a practical guide to help new learners code their own Deep Q-Networks.

- Reinforcement Learning Background
- The CartPole OpenAI Gym Environment
- Vanilla Q-Learning
- Deep Q-Learning
- Tips and Tricks
- Deep Q-Network Coding Implementation
- Resources

Reinforcement Learning can broadly be separated into two groups: **model free** and **model based** RL algorithms. Model free RL algorithms don’t learn a model of their environment’s transition function to make predictions of future states and rewards. **Q-Learning, Deep Q-Networks, **and** Policy Gradient** methods are model-free algorithms because they don’t create a model of the environment’s transition function.

The CartPole environment is a simple environment where the objective is to move a cart left or right in order to balance an upright pole for as long as possible. The state space is described with 4 values representing Cart Position, Cart Velocity, Pole Angle, and Pole Velocity at the Tip. The action space is described with 2 values (0 or 1) allowing the car to either move left or right at each time step.

## The Q-Learning Algorithm

1. Initialize your Q-table

2. Choose an action using the Epsilon-Greedy Exploration Strategy

3. Update the Q-table using the Bellman Equation

The **Q-table** is a simple data structure that we use to keep track of the states, actions, and their expected rewards. More specifically, the Q-table maps a state-action pair to a **Q-value** (the estimated optimal future value) which the agent will learn. At the start of the Q-Learning algorithm, the Q-table is initialized to all zeros indicating that the agent doesn’t know anything about the world. As the agent tries out different actions at different states through trial and error, the agent learns each state-action pair’s expected reward and updates the Q-table with the new Q-value. Using trial and error to learn about the world is called **Exploration**.

One of the goals of the Q-Learning algorithm is to learn the Q-Value for a new environment. The **Q-Value** is the maximum expected reward an agent can reach by taking a given action A from the state S. After an agent has learned the Q-value of each state-action pair, the agent at state **S** maximizes its expected reward by choosing the action **A** with the highest expected reward. Explicitly choosing the best known action at a state is called **Exploitation**.

A common strategy for tackling the **exploration-exploitation tradeoff** is the **Epsilon Greedy Exploration Strategy**.

- At every time step when it’s time to choose an action, roll a dice
- If the dice has a probability less than epsilon, choose a random action
- Otherwise take the best known action at the agent’s current state

Note that at the beginning of the algorithm, every step the agent takes will be random which is useful to help the agent learn about the environment it’s in. As the agent takes more and more steps, the value of epsilon decreases and the agent starts to try existing known good actions more and more. Note that epsilon is initialized to 1 meaning every step is random at the start. Near the end of the training process, the agent will be exploring much less and exploiting much more.

The **Bellman Equation** tells us how to update our Q-table after each step we take. To summarize this equation, the agent updates the current perceived value with the estimated optimal future reward which assumes that the agent takes the best current known action. In an implementation, the agent will search through all the actions for a particular state and choose the state-action pair with the highest corresponding Q-value.

**S** = the State or Observation

**A** = the Action the agent takes

**R** = the Reward from taking an Action

**t** = the time step

**Ɑ** = the Learning Rate

**ƛ** = the discount factor which causes rewards to lose their value over time so more immediate rewards are valued more highly

**Vanilla Q-Learning**: A table maps each state-action pair to its corresponding Q-value

**Deep Q-Learning**: A Neural Network maps input states to (action, Q-value) pairs

## The Deep Q-Network Algorithm

- Initialize your Main and Target neural networks
- Choose an action using the Epsilon-Greedy Exploration Strategy
- Update your network weights using the Bellman Equation

A core difference between Deep Q-Learning and Vanilla Q-Learning is the implementation of the Q-table. Critically, Deep Q-Learning replaces the regular Q-table with a neural network. Rather than mapping a state-action pair to a q-value, a neural network maps input states to (action, Q-value) pairs.

One of the interesting things about Deep Q-Learning is that the learning process uses 2 neural networks. These networks have the same architecture but different weights. Every N steps, the weights from the **main network** are copied to the **target network**. Using both of these networks leads to more stability in the learning process and helps the algorithm to learn more effectively. In our implementation, the main network weights replace the target network weights every 100 steps.

The main and target neural networks map input states to an (action, q-value) pair. In this case, each output node (representing an action) contains the action’s q-value as a floating point number. Note that the output nodes do not represent a probability distribution so they will not add up to 1. For the example above, one action has a Q-value of 8 and the other action has a Q-value of 5.

**def** agent(state_shape, action_shape):

learning_rate = 0.001

init = tf.keras.initializers.HeUniform()

model = keras.Sequential()

model.add(keras.layers.Dense(24, input_shape=state_shape, activation='relu', kernel_initializer=init))

model.add(keras.layers.Dense(12, activation='relu', kernel_initializer=init))

model.add(keras.layers.Dense(action_shape, activation='linear', kernel_initializer=init))

model.compile(loss=tf.keras.losses.Huber(), optimizer=tf.keras.optimizers.Adam(lr=learning_rate), metrics=['accuracy'])

**return** model

In our implementation, the main and target networks are quite simple consisting of 3 densely connected layers with Relu activation functions. The most notable features are that we use He uniform initialization as well as the Huber loss function to achieve better performance.

In the Epsilon-Greedy Exploration strategy, the agent chooses a random action with probability **epsilon** and exploits the best known action with probability **1 — epsilon**.

Both the Main model and the Target model map input states to output actions. These output actions actually represent the model’s predicted Q-value. In this case, the action that has the largest predicted Q-value is the best known action at that state.

After choosing an action, it’s time for the agent to perform the action and update the Main and Target networks according to the Bellman equation. Deep Q-Learning agents use Experience Replay to learn about their environment and update the Main and Target networks.

To summarize, the **main network** samples and trains on a batch of past experiences every 4 steps. The main network weights are then copied to the **target network** weights every 100 steps.

**Experience Replay** is the act of storing and replaying game states (the state, action, reward, next_state) that the RL algorithm is able to learn from. Experience Replay can be used in **Off-Policy** algorithms to learn in an offline fashion. Off-policy methods are able to update the algorithm’s parameters using saved and stored information from previously taken actions. Deep Q-Learning uses Experience Replay to learn in small **batches** in order to avoid skewing the dataset distribution of different states, actions, rewards, and next_states that the neural network will see. Importantly, the agent doesn’t need to train after each step. In our implementation, we use Experience Replay to train on small batches once every 4 steps rather than every single step. We found this trick to really help speed up our Deep Q-Learning implementation.

Just like with vanilla Q-Learning, the agent still needs to update our model weights according to the Bellman Equation.

From the original Bellman equation in Figure 3, we want to replicate the **Temporal Difference target** operation using our neural network rather than using a Q-table. Note that the **target** network and not the main network is used to calculate the Temporal Difference target. Assuming that the temporal difference target operation produces a value of 9 in the example above, we can update the **main** network weights by assigning 9 to the target q-value and fitting our **main** network weights to the new target values.

Our Deep Q-Network implementation needed a few tricks before the agent started to learn to solve the CartPole problem effectively. Here are some of the tips and tricks that really helped.

- Having the right model parameter update frequency is important. If you update model weights too often (e.g. after every step), the algorithm will learn very slowly when not much has changed. In this case, we perform main model weight updates every 4 steps which helps the algorithm to run significantly faster.
- Setting the correct frequency to copy weights from the Main Network to the Target Network also helps improve learning performance. We initially tried to update the Target Network every N episodes which turned out to be less stable because the episodes can have a different number of steps. Sometimes there would be 10 steps in an episode and other times there could be 200 steps. We found that updating the Target Network every 100 steps seemed to work relatively well.
- Using the Huber loss function rather than the Mean Squared Error loss function also helps the agent to learn. The Huber loss function weighs outliers less than the Mean Squared Error loss function.
- The right initialization strategy seems to help. In this case, we use He Initialization for initializing network weights. He Initialization is a good initialization strategy for networks that use the Relu activation function.

Putting it all together, you can find our minimal Deep Q-Network implementation solving the CartPole problem here. This implementation uses Tensorflow and Keras and should generally run in less than 15 minutes.