O'Reilly logo
live online training icon Live Online training

Building a robust Machine Learning pipeline

A practical approach to solve a real problem using Machine Learning

Fernando Damasio

In this course, you will code the whole pipeline of a machine learning algorithm, using a supervised learning model, to solve a real problem, working side-by-side with your instructor and other students. You are invited to show off your creativity by creating an algorithm that accurately predicts claims severity. The practical project is a regression type, where you will use an open data-set from an insurance company, already labeled. You can find more information about the dataset in the link shared in the Materials/downloads section below.

We will also cover problems that you may face while coding your machine learning pipeline and discuss different options and approaches to get to the best solution.

The machine learning pipeline includes the following steps:

  1. Load the data
  2. Visualize the data
  3. Pre-process the data
  4. Implement different ML models
  5. Choose the best ML model
  6. Tune the ML model
  7. Visualize the results

You will be able to use your project to show your skills to the industry, implementing your own solution by learning the classes and methods needed to code a machine learning pipeline, using Python notebooks and libraries related to machine learning and data processing, like:

  • Sckit-Learn
  • Pandas
  • Numpy
  • Matplotlib
  • Scipy
  • Seaborn

What you'll learn-and how you can apply it

  • Build a Python Conda environment to work with Machine Learning models
  • Learn the best Python libraries related to Machine Learning
  • Apply the most important classes, methods, and functions to use in your day to day work
  • Learn different options and approaches you can take to solve real problems
  • Implement the solution by yourself and get the skills needed to work on different problems

This training course is for you because...

You are a developer, data scientist or an engineer willing to solve real-life problems using supervised machine learning algorithms. You may need basic knowledge of Python or object-oriented languages, as well as a good understanding of linear algebra, calculus, statistics and the basics of Machine Learning.

Prerequisites

  • A pre-configured Conda environment
  • Basic knowledge of Python or object-oriented languages.
  • Basic knowledge of statistics and math

Materials, downloads, or Supplemental Content needed in advance

https://github.com/fernandodamasio/building-a-robust-machine-learning-pipeline

Recommended Preparation

About your instructor

  • Fernando Damasio is an accomplished Senior Executive and thought-leader with more than 15 years of success across the technology, automotive, education, logistics, marketing, and steel industries. Leveraging extensive experience excelling in competitive markets, he is a valuable asset for a business developing its digital transformation and go-to-market strategy. His broad areas of expertise include technical skills, leadership, relationship management, competitive analysis, and methodology.

    Throughout his career, Fernando has held various leadership positions including Engineer at Odebrect AS, Project Leader at Vale AS, Session Led at Udacity and CEO of CashFlix. Currently, he is the Product Leader and Founder of Skoods, Principal Consultant at Data Riders and Mentor for Udacity and Singularity University.

    Fernando has had tremendous success over the years and has served as a key contributor to numerous organizational achievements. He was responsible for leading three of the largest projects at Vale, all of which were delivered on time and within budget. Fernando founded CashFlix in 2014 after raising investment for the company that provides an innovative purchasing solution that utilizes Machine Learning to read texts in photos of customer purchase vouchers. In 2017, he founded Data Riders, a consultancy company related to digital transformation and in 2018, he is founding a new company, Skoods, a crowdsourced self-racing car team.

    Fernando received his Bachelor’s Degree in Automation and Control Engineering from Pontificia Universidade Católica de Minas Gerais and his Master of Business Administration in Project Management from Fundação Dom Cabral. He regularly participates in continuing education and professional development opportunities and has completed programs in Port and Harbor Engineering, Machine Learning Engineering, Self-Driving Car Engineering and Digital Strategies for Business.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Day 1

Section 1: Introductions (15 mins)

Section 2: Understanding and Classifying the problem (15 mins)

  • Regression vs Classification
  • Type of Machine Learning algorithm

Section 3: Python Notebooks and Managing environments (15 mins)

  • Creating an environment
  • Cloning an environment
  • Building identical environments
  • Activating an environment
  • Deactivating an environment
  • Determining your current environment
  • Viewing a list of your environments
  • Viewing a list of the packages in an environment
  • Sharing an environment
  • Removing an environment CODING TIME! (15 mins)

Break: 10 mins

Section 4: Loading the data (15 mins)

  • What is a data frame?
  • Method: pandas.read_csv
  • Method: pandas.drop
  • Method: pandas.concat CODING TIME! (15 mins)

Section 5: Understanding the data

  • Numerical Features (15 mins)
  • CODING TIME! (15 mins)
  • Categorical Features (15 mins) CODING TIME! (15 mins)
  • The target variable (15 mins) CODING TIME (15 mins)

BREAK (10 mins)

Section 6: Pre-processing the data (30 mins)

  • Why pre-process the data?
  • Processing skewed data
  • Method: .apply(lambda)
  • Method: scipy.stats.boxcox
  • Method: pandas.factorize
  • Method: pandas.factorize CODING TIME! (20 mins)

Day 2

Section 7: Scaling the data (40 mins)

  • Why to scale the data?
  • Method: sklearn.preprocessing.StandardScaler
  • Method: .fit
  • Method: .transform
  • Method: .fit_transform
  • Method: .reshape CODING TIME! (20 mins)

Break: 10 mins

Section 8: Defining the parameters (30 mins)

  • Common parameters for algorithms
  • Standard score functions
  • Method: sklearn.metrics.make_scorer
  • Method: sklearn.metrics.r2_scorer
  • Method: sklearn.cross_validation.ShuffleSplit
  • Method: sklearn.grid_search.GridSearchCV CODING TIME! (20 mins)

BREAK (10 mins)

Section 9: Experimenting different Machine Learning algorithms (60 mins)

  • Method: sklearn.linear_model.BayesianRidge
  • Method: sklearn.esemble.GradientBoostingRegressor
  • Other algorithms CODING TIME! (30 mins)

Section 10: Visualizing the results (15 mins)

CODING TIME! (15 mins)

Final Q&A