O'Reilly logo
live online training icon Live Online training

Active Learning: How to choose the right data records for your model

Powered by Jupyter logo

Building Machine Learning Models with Less Labels

Jennifer Prendki

In the course, you will start by learning that investing time in building the right training set is usually a much safer bet than optimizing your model. Traditionally, data scientists have been trained to collect data prior to fully understanding the task at hand or its implication because a lot of time was required to gather sufficiently enough historical data; with the advent of Big Data this has changed even though the old habits die hard. In the course, you will learn how to think about your problem differently and build your training set and your model concurrently with Active Learning; you’ll also discover how Active Learning can help dramatically reduce the amount of labels required for your models to reach the desired accuracy.

Active Learning is gaining traction within the industry among companies with a high data labeling bill, but most data scientists have no working knowledge of the field. There is very little literature on how to build an Active Learning strategy in the real world, as most papers translate research performed within academical circles which never address a general methodology, but rather case-by-case reviews of different scenarios which don’t really help in real life.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • Why building a training set while building your model is the way to go
  • How Active Learning works, and why it is attractive
  • The challenges and limitations of Active Learning

And you’ll be able to:

  • Identify if a specific use case and dataset are appropriate for Active Learning
  • Understand the differences between the different Active Learning approaches
  • Design a querying strategy appropriate for their own task (if relevant) to:
  • efficiently reduce the amount of labels required to train a model
  • minimize the chances of inducing a bias
  • Visualize and compare the performance or various querying strategies
  • Understand the advantages but also the risks related to using Active Learning, and Semi-Supervised Learning in general, and know how to mitigate those risks

This training course is for you because...

  • You’re a data scientist/ML scientist and need high-quality training data for your model(s).
  • You’re a data science manager facing budgetary restrictions and you need to reduce your data labeling costs.
  • You’re a product or a project manager managing the budget for a data science/ML team and you need to understand current market offerings regarding data labeling, possibly with the goal to reduce your labeling costs.


  • Working knowledge of Python
  • Extensive understanding of Supervised Machine Learning; ideally, you routinely build models using a common ML framework such as scikit-learn or TensorFlow
  • Prior experience using Jupyter Notebooks will be useful, but not essential

About your instructor

  • Jennifer Prendki is the VP of Machine Learning for Figure Eight, a human-in-the-loop machine learning and artificial intelligence company that uses human intelligence to do simple tasks such as transcribing text or annotating images to train machine learning algorithms. She currently leads the Machine Learning department and manages a team of Machine Learning experts that develop Machine Learning-assisted data annotation solutions for any type of data on the market which we combine with Human-in-the-Loop to validate the output of those annotation algorithms and guarantee reliable labels.

    Prior to working at Figure Eight, Jennifer ran the Human Evaluation team at Walmart Labs. In this role, she was tasked for labeling data with a very limited budget. She knew she needed to think about a new way to solve the problem and started designing her first version of an active learning solution. Jennifer has given many talks and ran workshops and webinars on the importance of Human-in-the-Loop labeling as well as on Active Learning in a wide range of conferences for the last few years, and she has established herself as one of the few specialists on the topic within the industry. She has been on both side of the fence (both a customer and a supplier of an Active Learning solution), she understands in depth the challenges, limitations, but also the huge opportunity that Active Learning brings to the data scientists of today and of tomorrow.


The timeframes are only estimates and may vary according to how the class is progressing

Set-Up: Build preliminary classifier on 200 images (5 minutes)

  • The provided sample is deterministic, seed provided in script

Segment 0: supervised learning, an increasing amount of data (20 minutes)

We provide the function to draw the learning curve

  • Ask students to draw a learning curve where data is incrementally added (with random sampling instead of “smart” strategy)
  • When does it plateau?
  • What is a good incremental amount? (size of the pools?)
  • How sensitive is the learning curve to the seed?


Segment 1: uncertainty sampling, confidence level-based (20 minutes)

  • The script has a function for pooling strategy where querying strategy is left to fill
  • Let attendees write querying strategy function (it is deterministic!)
  • Draw learning curve: is it better than the supervised learning one?

Segment 2: uncertainty sampling, margin-based (20 minutes)

  • Let attendees write querying strategy function (it is deterministic!)
  • Draw learning curve: is it better than the first two?


Segment 3: Query-by-Committee - preliminary (20 minutes)

  • In this segment, you will be provided with 2 additional classifiers, and as a first step, we will compare the learning curves of each classifiers against an ensemble classifier built by combining all three of them. How much does an ensemble method help the learning curve?

Segment 4: Query-by-Committee (20 minutes)

  • Armed with three classifiers, we can now build query-by-committee querying strategies and compare them to each other.
  • What threshold works best for you?
  • Draw histogram to visualize the degree of disagreement of the different models
  • Write querying strategy for QbC
  • Draw the learning curve and compare

Segment 5: Make your own (20 minutes)

  • Create hybrid querying strategies!
  • Encourage mixing several querying strategies
  • Visualize learning curve
  • Who wins?


Segment 6: Confidence levels preliminary study (20 minutes)

  • Draw distribution of the confidence levels on predicted values (on “testing” set)
  • Draw distribution of the margins on predicted values
  • Draw the learning curve as a function of the amount of data (will be different from supervised because data is “prioritized”)
  • Which learning curve is better?
  • What is a good cutoff?

Segment 7: Design your own (20 minutes)

  • The querying strategy can be a hybrid
  • Final QA and wrap up