Scale Your Python Processing with Dask
Crunch Big Data Easily in Python, From a Few Cores to a Few Thousand Machines
Python is a (maybe the) preeminent language for data science. And the SciPy ecosystem of tools enables hundreds of different use cases, from astronomy to financial time series analysis to natural language processing. Most Python tools assume your data fits in memory, and many do not support parallel execution. But today, we have much more data and much more compute power, so we want to scale our open source Python tools to huge datasets and huge compute clusters.
The open-source Dask project supports scaling the Python data ecosystem in a straightforward and understandable way, and works well from single laptops to thousand-machine clusters. Dask scales things like Pandas Dataframes, scikit-learn ML, NumPy tensor operations, as well as allowing lower level, custom task scheduling for more unusual algorithms. Dask plays nice with all of the toys you want -- just a few examples include Kubernetes for scaling, GPUs for acceleration, Parquet for data ingestion, and Datashader for visualization.
What you'll learn-and how you can apply it
By the end of this live, hands-on, online course, you’ll understand:
- What Dask is and why it exists
- How Dask fits into the Python and big data landscape
- How Dask can help you process more data faster, from a laptop up to a big cluster
And you’ll be able to:
- Get started building systems with Dask
- Add Dask and start migrate existing components incrementally
- Analyze data and train ML models with Dask
This training course is for you because...
- You are a data engineer, data scientist, or natural/social scientist
- You work with Python and data
- You want to become a practitioner or leader who focuses on pragmatic, effective solutions
- Python, basic to intermediate level
- Python data science stack (Pandas, NumPy, scikit-learn) at a basic level
- Optionally, review portions of Python for Data Analysis, second edition (book):
- If your Python is rusty, review Chapters 2-3.
- Review NumPy and Pandas with Chapters 4-5.
- Optionally, review portions of Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, second edition (book):
- If you are new to ML, see Chapters 1-2.
- Review the most common ML techniques with Chapters 3-7.
- Read Designing Data-Intensive Applications (book) to understand the wider set of concerns in architecting big data systems (not just using Python and Dask, specifically).
About your instructor
Adam Breindel consults and teaches widely on Apache Spark and other technologies. Adam's experience includes work with banks on neural-net fraud detection, streaming analytics, cluster management code, and web apps, as well as development at a variety of startup and established companies in the travel, productivity, and entertainment industries. He is excited by the way that Spark and other modern big-data tech remove so many old obstacles to system design and make it possible to explore new categories of interesting, fun, hard problems.
The timeframes are only estimates and may vary according to how the class is progressing
Introduction (55 minutes)
- Presentation: About Dask - What it is, where it came from, what problems it solves
- Discussion: Options for setting up and deploying Dask
- Presentation: Pandas-style Analytics with Pandas and Dask DataFrame
- Exercise: Try a Hands-on Analytics Example
- Break (5 minutes)
Dask Graphical User Interfaces (30 minutes)
- Presentation: Monitoring Workers, Tasks, and Memory
- Presentation: Using Dask’s Built-In Profiling to Understand Performance
- Exercise: Analyze the Performance of Data Transformation
Machine Learning (25 minutes)
- Presentation: Scikit-Style Featurization with Dask
- Discussion: Current Algorithm Support and Integration
- Presentation: Modeling Task
- Exercise: Try and Alternate Model
- Break (5 minutes)
Additional Data Structure Overview (25 minutes)
- Presentation: Dask Array
- Discussion: What Can We Do with Dask Array?
- Presentation: Dask Bag
- Exercise: Look at Lower-Level Task Graph Opportunities in the Docs
Best Practices and Extended Q&A (35 minutes)
- Presentation: Managing Partitions and Tasks
- Discussion: File Formats and Data Structures
- Presentation: Caching
Q&A (at least 15 minutes reserved)