Teaching a robot how to walk with Q-Learning 


Today, we’re gonna learn how to create a virtual agent that discovers how to interact with the world. The technique we’re going to use is called Q-Learning, and it’s super cool.

The agent, the state and the goal

Let’s take a look at our agent!

Henry, the robot

This is Henry, he’s a young virtual robot who has a dream: travel the world. The problem is he doesn’t know much about the world. In fact, he doesn’t event know how to move! He only knows his GPS location, the position of its feet, and if they are on the ground.

What Henry perceives of the world

These elements can be split into two parts: the state, and the goal. The state of our agent is the ensemble of the information relative to his body, while the goal is what the agent wants to increase. In our case, the agent want to use its feet to change its position.

State: [Right foot extension, Right foot on ground, Left foot extension, Left foot on ground]
Goal: [Position]

The actions

Fortunately, Henry is capable of performing actions. Even though he cannot perceive much of the world, he know what he is capable of doing, and that depends on his state. For example, when both his feet are on the ground, he can lift one. And when one foot is in the air, he can move it forward or backward.

The actions available to our robot, depending on his state. NB: In reality, the diagram is a bit more complex, as the robot takes into account the extension of the foot.

Here’s Henry performing random actions


To reach his goal, the agent follows a set of instructions called a policy. These instructions can be learnt by the agent or written down manually by humans.

Following a hard-coded policy

Let’s write down a simple policy for our agent to follow.

A simple, hard-coded policy

The agent following the simple policy. Look at him go!

This looks promising, although the robot is loosing a lot of time lifting and putting down his feet. It’s not really efficient.

In fact, we rarely write good policies manually. This is either because we don’t know any (e.g. complex task), or because they are often not robust (if the agent somehow starts with the left foot up, the policy above would immediately fail as the agent cannot execute the desired action).


To help our agent fulfil his dream, we will use a reinforcement learning technique called Q-Learning to help the robot learn a robust and efficientpolicy. This technique consists in attributing a number, or Q-Value, to the event of performing a certain action when the agent is in a certain state. This value represents the progress made with this action.

Some State-Action pairs associated with their Q-Value

This value is determined by a reward function, indicating whether the action had positive or negative impact on reaching the goal. In our case, if Henry moves away from where he was, it’s good, if he’s going back, it’s bad.

Our agent’s basic reward function

The key point here is that it’s not the action itself that has value, but rather the fact of performing an action when the agent is in a specific state.

With the knowledge of all Q-Values, the agent could, at each step, pick the action which would bring the best reward depending on his state. This would be his policy.

The issue here is that Henry doesn’t have a clue of what the Q-Values are! This means he is not able to pick an action that will bring him closer to his goal!

Fortunately, we have a way of making him discover the Q-Values by himself.

Learning the Q-Values with the Epsilon-Greedy algorithm

Training the agent

To learn the Q-Values of its State-Action pairs, Henry must try these actions by himself. The problem is that there might be billions of possible combinations, and the program cannot try them all. We want Henry to try as many combinations as possible, but focus on the best ones.

To achieve that, we are going to use the Epsilon-Greedy algorithm to train the robot. It works like this: at each step, the robot has a probability epsilon of performing a random available action, otherwise it picks the best action according to the Q-Values it knows. The result of the action is used to updatethe Q-Values.

Updating the Q-Values and the problem of distant rewards

A simple way to update the Q-Values would be to replace the value the robot has in memory, by the value he has just experienced by making the action. However, this brings a really problematic issue: the robot cannot see more than one step in the future. This makes the robot blind to any future reward.

Lifting a foot is necessary to walk, but how could Henry discover that it is a good action, when it doesn’t bring any immediate reward?

The solution to this problem is given by the Bellman equation, which proposes a way to account for future rewards.

Bellman equation

First, instead of replacing the old value by the new value, the old value fades away at a certain rate (alpha, the learning rate). This enables the robot to take into account noise. Some actions might work sometimes, and sometimes not. With this progressive evolution of the Q-Value, one faulty reward doesn’t mess up the whole system.

Also, the new value is calculated using not only the immediate reward, but also the expected maximal value. This value consists of the best possible reward we expect to receive from the actions available. This has a dramaticimpact on the effectiveness of the learning process. With this, the rewards are propagated back in time.

With this change, the Q-Value of raising a foot has become positive, as it benefits from the future optimal expected value of taking a step forward.

Q-Values updated with Bellman equation. Note how putting a foot down now is seen as positive instead of neutral, as is benefits from the reward of taking the next step.

Exploration vs Exploitation dilemma

By playing on the value epsilon, we are facing the dilemma of Exploration vs Exploitation. An agent with a high epsilon will mostly try random actions (exploration), while an agent with a low epsilon will rarely try new actions (exploitation). We need to find a sweet spot that will enable our robot to trymany new things, without wasting too much time on unpromising lead.

Two agents, trained with different epsilons, following their best policy.

Here, the agent that focused more on exploration discovered a very effective technique to move. Contrarily, the yellow robot doesn’t take full advantage of the extension of his feet because it didn’t spend enough time trying random actions. Note that a balance must be found, as spending too much time on exploration prevents the agent from learning complex policies.


Look at that! Henry’s travelling the world at an astonishing pace! He managed to find by itself a better way to move than the simple policy we gave him, plus he can recover from small errors, as from any state he only has to follow his trusted Q-Values to guide him!

You can play with the code here:


🎉 You’ve reached the end! I hope you enjoyed this article. If you did, please like it, share it, subscribe to the newsletter, send me pizzas, follow me on medium, or do whatever you feel like doing! 🎉

If you like Artificial Intelligence, subscribe to the newsletter to receive updates on articles and much more!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Website Powered by WordPress.com.

Up ↑