O'Reilly logo
live online training icon Live Online training

Stream Processing with Apache Spark

Mastering Structured Streaming

Gerard Maas

Stream processing gives us the capability to analyze and extract value from the data as soon as it becomes available. It gives businesses the ability to observe and react to changes in their environment as soon as they happen and turn them into actionable insights and ultimately, into a competitive advantage.

Apache Spark is a unified analytics engine offering batch and streaming capabilities that is compatible with a polyglot approach to data analytics, offering APIs in Scala, Java, Python, and the R language.

In this training, we are going to focus our interest on the streaming capabilities of Apache Spark and how they can be practically applied to fulfill our need to extract value from the streams of data available today.

What you'll learn-and how you can apply it

By the end of this live, hands-on, online course, you’ll understand:

  • How the streaming Dataset abstraction and API lets us operate on the streaming data
  • The concept of a streaming Source to obtain data from a stream data producer
  • The concept of a streaming Sink to produce data to other systems
  • The importance of event time, how to use it in aggregations, and its requirements and limitations
  • How to use the stateful processing API to create arbitrary aggregations over a stream
  • How we can use the Spark ML to apply machine learning models to a stream of data

And you’ll be able to:

  • Write structured streaming jobs that apply your business logic to streaming data by transforming and aggregating the data
  • Read and site to Kafka as a streaming backend
  • Load and apply a pre-trained machine learning model to a data stream to score the data

This training course is for you because...

  • You are a Data Engineer who wants to move workloads to a streaming model.
  • You work with Spark and want to increase your understanding of its streaming capabilities.
  • You want to develop streaming super-powers as a Data Engineer.

Prerequisites

  • We use Scala as the main programming language for this course. A basic understanding of the language is recommended.
  • We build upon core Spark concepts in this stream-focused course. While we quickly introduce these concepts, it would be beneficial to have some knowledge of Spark SQL, Datasets, and Dataframes.

Recommended preparation:

Recommended follow-up:

About your instructor

  • Gerard Maas is a Principal Engineer at Lightbend, where he contributes to the integration of stream processing technologies. Previously, he held leading roles at several startups and large enterprises, building data science governance, cloud-native IoT platforms, and scalable APIs. He is the author of Stream Processing with Apache Spark from O’Reilly. Gerard is a frequent speaker at conferences and meetups. He likes to contribute to small and large open-source projects.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction to Stream Processing with Apache Spark (25 mins)

  • Presentation: An overview of the course and its structure
  • Presentation: Thinking in streams: The general streaming model
  • Q&A

Building your intuition: Batch vs Stream processing through an example (30 minutes)

  • Presentation: First exposure to the Spark APIs for data processing through moving a batch process into a streaming application. (10 min)
  • Exercise: Batch vs Streaming Analytics Notebook (20 min)
  • What makes stream processing different? In this exercise, we see how the Spark API offers a unified model for both processing styles. We analyze public weblogs from the NASA website using both approaches and discover the patterns around a launch.
  • Q&A
  • Break (5 minutes)

Core components of the Structured Streaming API (35 minutes)

  • Presentation: The Structured Streaming API. Sources. Data processing, transformations, and joins. Sinks.
  • Exercise: IoT Stream processing using Kafka and Structured Streaming
  • In this exercise, we build an IoT-inspired data processing pipeline that consumes sensor data from Kafka, transforms it using the Structured Streaming APIs, and writes it back to Kafka for further processing.
  • Q&A

Stateful Computations in Structured Streaming (40 minutes)

  • Presentation: Stateful operations, their requirements, and challenges.
  • Presentation: Support for event time, window functions, and other built-it stateful operations, such as stream deduplication. (15 mins)
  • Exercise: Continuing with our IoT use case, we implement several stateful aggregations that helps us observe the behavior of our devices over time. (20 minutes)
  • Q&A
  • Break (5 minutes)

Applying Machine Learning (ML) models in Structured Streaming (40 minutes)

  • Presentation: Integrating Spark ML with structured streaming. Using a pre-trained ML model to score a data stream and predict certain conditions based on the new data. (10 minutes)
  • Exercise: In previous examples, we were processing IoT data to observe the behavior of the sensors. Now, we are going to apply machine learning techniques to predict room occupancy based on the data delivered by sensors in the room. (25 minutes)
  • Q&A (5 minutes)