O'Reilly logo
live online training icon Live Online training

Introduction to Reinforcement Learning

enter image description here

Why reinforcement learning is changing Machine Learning

Leonardo De Marchi

In this workshop we will see how it’s possible to use Reinforcement Learning to solve real-world problems. Reinforcement Learning recently had an incredible progress in industry and established itself as one of the best techniques for sequential decision making and control policies. We are now just scratching the surface on its applications.

In this workshop we will start exploring its fundamentals, like the Multi Armed Bandit, a technique to build knowledge on the best next move an agent can perform in an unknown environment while maximizing the reward the agent can get. We will then continue by building on these concepts and exploring more complex techniques that will find the best behavioural policy for our agent. Progressing with the course we will see how it’s possible to model more and more complex environments by using more advanced methods. This will be accomplished using methods that explicitly calculate the total reward the agent can get, like Q-learning, and methods that focus on finding the best policy without explicitly calculating the rewards, like gradient methods.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • The difference between different Reinforcement Learning methods.
  • The advantages and disadvantages of each model and how they can be successfully applied in different scenarios.
  • The theory behind the RL algorithms.

And you’ll be able to:

  • Understand where it’s possible to use RL algorithms.
  • Write RL models in Python.

This training course is for you because...

  • You’re someone who is passionate about Artificial Intelligence and innovation.
  • You are a Data Scientist or an Engineer working with Machine Learning algorithms and wants to progress your career with innovative techniques.
  • You want to understand the fundamentals of Reinforcement Learning concepts.


  • Some Python knowledge, enough to be able to understand code and familiarity with the data science stack (specifically, numpy, Tensorflow and Keras).

Recommended follow-up:

About your instructor

  • Leonardo De Marchi holds a Master in Artificial intelligence and has worked as a Data Scientist in the sport world, with clients such as New York Knicks and Manchester United, and with large social networks, like Justgiving.

    He now works as Lead Data Scientist in Badoo, the largest dating site with over 360 million users, he is also the lead instructor at ideai.io, a company specialized in Deep Learning and Machine Learning training and a contractor for the European Commission.


The timeframes are only estimates and may vary according to how the class is progressing

Introduction to Reinforcement Learning (30 minutes)

  • Presentation: Introduction to the session and overview of basic Reinforcement Learning algorithms. We will define the basic Reinforcement Learning problem, an agent that wants to learn a policy that maximises its total reward. (20 minutes)
  • Poll: What do you hope to get out of today's course? (5 minutes)
  • Q&A (5 Minutes)

Bandit methods (35 minutes)

  • Poll: What do you know about Bandit methods?
  • Presentation: Introducing bandit methods, a simple algorithm that balance exploration and exploitation in an unknown environment. During exploration the algorithm learn the best action to take, while during exploitation uses his learning to maximize the reward. This approach can be taken in a can be applied in many different scenarios like optimizing payments or marketing efforts. (20 minutes)
  • Poll: How are you planning to use Bandit methods?
  • Exercise: Complete a bandit algorithm with Thompson sampling (10 minutes)
  • Q&A (5 minutes)
  • Break (5 minutes)

Monte Carlo and tree search (35 minutes)

  • Presentation: Introducing Monte Carlo methods. These methods sample sequences of states, actions, and rewards from interactions with an environment to find the optimal policy. It’s the evolution of MAB and it can be used in several applications. It was even used in DeepMind’s AlphaGo to defeat the human Go champions. (20 minutes)
  • Exercise: Complete a Monte Carlo method application (10 minutes)
  • Q&A (5 minutes)

Time Difference methods and SARSA (35 minutes)

  • Presentation: TD methods solve the credit assignment problem. Many times we see the effect of an action some time after it’s performed. Our algorithms needs to figure out which action should be accredited for the feedback the agents receives from the environment. These methods can be used for optimization like allocating cars between different dealerships locations to maximize profits. (20 minutes)
  • Poll: How would you use the SARSA method?
  • Exercise: We will solve a video game in OpenAI’s gym using the SARSA algorithm, a TD method that estimates all future rewards of each action took in a specific state. (10 minutes)
  • Q&A (5 minutes)
  • Break (5 minutes)

Q-learning (35 minutes)

  • Presentation: We will present the theory behind Q-learning, another Temporal Difference methods that estimate the state-action reward matrix. Contrary to SARSA, it’s not forced to follow the optimal policy that it estimates. (20 minutes)
  • Poll: Where do you think Q-learning can be used?
  • Exercise: We will solve a video game in OpenAI’s gym using the Q-learning algorithm (10 minutes)
  • Q&A (5 minutes)

Gradient methods (35 minutes)

  • Presentation: Theory behind gradient methods. These methods don’t estimate the rewards, their goal is to find the optimal policy that the agent should follow. These methods are the only option if there are a lot of possible states and it would be too expensive to compute the action-state matrix. An application can be finding the optimal policy of a robotic arm. (20 minutes)
  • Poll: When are gradient methods useful?
  • Exercise: Complete an exercise with the REINFORCE algorithm, a gradient method that uses gradient descent to find the best parameters for the policy function. (10 minutes)
  • Q&A (5 minutes)

Summary of the session(10 minutes)

  • Presentation: Recap of the session and overview of other algorithms (10m)
  • Q&A (5 minutes)