O'Reilly logo
live online training icon Live Online training

Protecting data privacy in a machine learning world

Practicing privacy-aware data science

Powered by Jupyter logo

Katharine Jarmul

The EU General Data Protection Regulation left data scientists scrambling, as data privacy was not a focal point for many teams in the industry. However, protecting the data we use from customers and other sources should be a primary competency of the data science team and is of growing concern when we manage others' data and monetize data via models, marketing, or explicit reselling.

Join expert Katharine Jarmul for a hands-on, in-depth exploration of privacy best practices for data science and machine learning. Over three hours, Katharine walks you through applying privacy methods to your data science workflow and machine learning models. Along the way, you'll investigate approaches for anonymization and pseudonymization of datasets and the models that are built on that data. Don't miss this chance to explore key topics, tools, and research related to privacy best practices for data science and machine learning teams and learn how to implement them in your current workflows.

The course uses shared Jupyter notebooks, and many of the tools and solutions are Python or REST based.

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • Data privacy best practices with regard to processing and storage under GDPR
  • Common approaches to data privacy and anonymization
  • Methods to increase data security and privacy in your ML system

And you’ll be able to:

  • Utilize open APIs for pseudonymizing your data
  • Determine how to integrate privacy best practices into your data workflows
  • Evaluate potential data privacy issues in your current data extraction and management

This training course is for you because...

  • You’re a data scientist or data engineer with at least one year of experience, and you need to integrate privacy best practices into your current data science workflows.


  • An intermediate knowledge of Python
  • Experience working on machine learning tools in Python

Recommended follow-up:

  • Take Security for machine learning (live online training course with Katharine Jarmul)

About your instructor

  • Katharine Jarmul is a co-founder of KI Protect, a data security company based in Berlin, Germany. Katharine Jarmul is a data analyst based in Berlin, Germany. She has worked with Python wrangling data since 2008 for both small and large companies. Automated data workflows, Natural Language Processing and data tests are her passions. She is co-author of Data Wrangling with Python and has authored several O'Reilly video courses focused on data analysis with Python


The timeframes are only estimates and may vary according to how the class is progressing

Data science and privacy (10 minutes)

  • Lecture: Real-world data science privacy problems; how increased data use has affected privacy best practices
  • Hands-on exercise: Share current practices you follow for securing and privatizing your data science pipelines and models

Attacks on privacy and machine learning (15 minutes)

  • Lecture: Real-world data security issues relating secrets in machine learning models
  • Hands-on exercise: Share your use of private data for machine learning

Identifying sensitive and private data (10 minutes)

  • Lecture: Strategies for properly identifying sensitive data and analyzing the requirements to secure it for different aspects of data science
  • Hands-on exercise: Explore what may be sensitive for different types of releases
  • Break (5 minutes)

Data pseudonymization (30 minutes)

  • Lecture: Theories, tools, and APIs for data pseudonymization
  • Hands-on exercise: Use pseudonymization strategies such as hashing and structure-preserving pseudonymization in a Jupyter notebook using example IoT data

Data anonymization (30 minutes)

  • Lecture: Theories, tools, and APIs for data anonymization
  • Hands-on exercise: Implement k-anonymity in a Jupyter notebook using a popular income dataset for machine learning
  • Break (5 minutes)

Privacy-preserving ML (40 minutes)

  • Lecture: Research on and tools for building privacy-preserving machine learning models
  • Hands-on exercise: Experiment with pseudonymization of inputs for an example privacy-preserving machine learning notebook
  • Break (5 minutes)

Case study: Protecting your data science (20 minutes)

  • Lecture: A real-world case study of using private data for data science
  • Hands-on exercise: Employ at least one of the learned methods from the course for privacy-preserving dat a science in a case study notebook with a new dataset

Wrap-up and Q&A (10 minutes)

  • Lecture: Future research and new pursuits in privacy in ML