O'Reilly logo
live online training icon Live Online training

Accelerate and Migrate Your Data Science to GPU with RAPIDS

Achieve GPU performance the easy way with nvidia’s Python Open-Source RAPIDS

Adam Breindel

Using GPU (graphics hardware) for general computation on data is not a new idea. But until recently, it required special skills and was prohibitively complex and expensive for most businesses. The rise of deep learning made it easier for non-specialist engineers to solve common problems using basic Python code, while leveraging GPU speed.

With the ongoing development and releases of RAPIDS, an open-source data science framework, it is now possible for typical data scientists and engineers, with typical Python knowledge and experience, to easily move their workflows to the GPU.

RAPIDS is designed to use skills you already have -- like working with tabular data in SQL or Pandas, and building models with scikit-learn -- and empower vast speedups with GPU compute. RAPIDS users and partners include Anaconda, Walmart, Databricks, IBM, and Uber among many others.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • The enormous performance improvements enabled by GPU computation on “regular” business data
  • How RAPIDS enables extremely easy use of GPU compute in Python: no writing CUDA kernels, or working with low-level interfaces to hardware
  • How to scale up (with one or more GPUs) as well as scale out (to multiple servers) to handle any data challenge

And you’ll be able to:

  • Transform data and perform feature engineering and extraction with RAPIDS
  • Train or tune ML models with RAPIDS
  • Perform graph computations, clustering, dimensionality reduction, and other techniques]
  • Realize huge improvement in compute speed

This training course is for you because...

  • You are a data scientist or engineer
  • You work with large amounts of data or need much faster performance on modest-sized datasets
  • You want to become a lead or architect for the next generation of data-intensive applications

Prerequisites

  • Basic knowledge of Python
  • At least basic familiarity with NumPy, Pandas, and scikit-learn
  • Some knowledge of SQL is helpful as well

Recommended preparation

  • Attendees will have access to all notebooks for use in-class and after the class. The delivery will be via a free online cloud-based platform, Google Colaboratory.

Recommended follow-up

https://docs.rapids.ai/

Common misunderstandings

  • Mistakenly thinking that complex, low-level coding in special languages will be required to leverage a GPU
  • Imagining that general data engineering on a GPU will be a very resource-intensive project, requiring many person-hours and a large financial investment
  • Being unaware that GPU vendors like nvidia are embracing open-source, easy-to-use tools

About your instructor

  • Adam Breindel consults and teaches courses on Apache Spark, data engineering, machine learning, AI, and deep learning. He supports instructional initiatives as a senior instructor at Databricks, has taught classes on Apache Spark and deep learning for O'Reilly, and runs a business helping large firms and startups implement data and ML architectures. Adam’s first full-time job in tech was neural net–based fraud detection, deployed at North America's largest banks back; since then, he's worked with numerous startups, where he’s enjoyed getting to build things like mobile check-in for two of America's five biggest airlines years before the iPhone came out. He’s also worked in entertainment, insurance, and retail banking; on web, embedded, and server apps; and on clustering architectures, APIs, and streaming analytics.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Intro/Overview (30 minutes)

  • How GPUs Differ from CPUs and Why They Offer Such Enormous Performance
  • Basic tabular computation on GPU
  • NumPy vs. CuPy/PyTorch
  • Exercise: What's on Our Wish List? What's Missing?
  • Q&A

BlazingSQL (30 minutes)

  • GPU-Accelerated SQL Engine
  • Querying Files and Data Lakes
  • Exercise: Write a SQL Report and Run it on GPU
  • Q&A
  • 5 min break

cuDF: The Heart of RAPIDS (30 minutes)

  • CUDA-enabled Data Frame that Works Like Pandas
  • Current and Future Features
  • Performing Common Data Engineering and Processing Tasks with cuDF
  • Exercise: Working With BlazingSQL Result Sets as CUDA Data Frames
  • Q&A

cuML: High-Level Machine Learning Tools (30 minutes)

  • ML on the GPU with the ease of scikit-learn
  • Current Algorithm Support
  • Feature Engineering Helpers
  • Exercise: Let’s Train a Model on GPU
  • Q&A
  • 5 min break

cuGraph: Graph Analytics Overview (20 minutes)

  • Building Graphs
  • Built-in Algorithms
  • PageRank, Breadth-First Search
  • Exercise: Finding Shortest Paths
  • Q&A

End-to-End Problem Solving and Wrap-Up (30 minutes)

  • Scaling to Multiple GPUs or Nodes
  • Integrating with Existing Data Engineering Environments (e.g., Hadoop)
  • Generating Visualizations
  • Q&A