O'Reilly logo
live online training icon Live Online training

Big Data Modeling

Ted Malaska

Recent advancements in distributed processing — including Spark, Impala, Spark Streaming, and Storm— are exciting. But if your design only focuses on the processing layer to get speed and power then you may be missing half the story, leaving a significant amount of optimization untapped.

In this course, Ted Malaska looks down the stack and demonstrates a set of storage design patterns and schemas. By carefully tailoring how data is stored for each use case, you can reduce your processing and access times by two to three orders of magnitude.

While the strategies and principles you'll learn in this class can be applied in many software environments, examples will be shown using HDFS, HBase, Cassandra, Kudu, Kafka, Elasticsearch, and S3.

What you'll learn-and how you can apply it

By the end of this course… You'll understand:

  • The difference between all major big data storage structures
  • The linkage between access patterns and data modeling selections
  • How latency and uniqueness is effected when selecting a storage system

And you'll be able to:

  • Optimize storage cost
  • Optimize for queries and multi-user workloads
  • Define selection of storage systems and modeling for long-term investments

This training course is for you because...

You are a data architect or engineer who needs to build out big data solutions like IoT, cheap storage, deep learning, or SQL for multiple tenance work loads.


  • Basic RDBMS data modeling
  • An idea of what you want to get out of your data

Recommended Preparation:

About your instructor

  • Ted is working on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, HearthStone, and much more. Previously, he was a Principal Solutions Architect at Cloudera, helping clients succeed with Hadoop and the Hadoop ecosystem. Previously, he was a Lead Architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is also a co-author of O’Reilly “Hadoop Application Architectures” and a frequent speaker at many conferences, and a frequent blogger on data architectures.


The timeframes are only estimates and may vary according to how the class is progressing

Segment 1: Introduction to Big Data Modeling (10 minutes)

  • Fundamental things we need to keep in mind as we continue
  • Q&A (5 minutes)

Segment 2: Walking through the importance of access patterns (10 minutes)

  • Access patterns and the role they play in building big data systems
  • Participants will share their access patterns so the cause can be tailored to them
  • Q&A (5 minutes)

Segment 2: Different Types of Storage Systems (Part 1) (30 minutes)

  • How storage systems work

Segment 3: Different Types of Storage Systems (Part 2) (30 minutes)

  • How storage systems work

Segment 4: Different Types of Storage Systems (Part 3) (30 minutes)

  • How storage systems work

Segment 5: Data Modeling: Starting from relational (30 minutes)

  • Data modeling from a relational system