O'Reilly logo
live online training icon Live Online training

Understanding data science algorithms in R: Scaling, normalisation and clustering

Jamie Owen

Build on your foundational knowledge of R as a tool for data science. Expert Jamie Owen walks you through typical data science algorithms, such as clustering and regularization, explaining why and how certain datasets should be scaled and normalized and detailing the trade-offs between different clustering algorithms. You'll gain hands-on experience with key concepts using small toy datasets and understand the bigger picture through complex, interesting datasets such as OkCupid registrations and James Bond behavioral statistics. If you’re a programmer who is interested in data science, a manager who wants to summarize datasets, or simply someone who uses data and wants to learn how to analyze and summarize it, this course is for you.

What you'll learn-and how you can apply it

By the end of this live, online course, you’ll understand:

  • Why datasets are normalized and scaled
  • The principles behind clustering algorithms such as k-means
  • The importance of choosing the distance scale

And you’ll be able to:

  • Scale and normalize datasets
  • Cluster data using appropriate algorithms

This training course is for you because...

  • You're a programmer who is interested in data science but has little or no experience with statistics or a background in mathematics.
  • You're a manager who wants to summarize datasets.
  • You use data but don't have the necessary training to analyze and summarize it.

Prerequisites

  • A working knowledge of any programming language (Python, MATLAB, C, Java, etc.)
  • Familiarity with R not required

Required materials and setup:

  • A machine with the latest version of R and RStudio installed
  • Download the course R package (link to come)

Recommended preparation:

About your instructor

  • Dr. Jamie Owen is a Senior Data Scientist and Lead Trainer at Jumping Rivers. Having obtained a PhD focusing on computational statistics, Jamie was one of the founding members of Jumping Rivers. He has been delivering R training since 2011 at a variety of levels, ranging from beginner to advanced to a diverse collection of audiences. Jamie has taught courses for audiences from a variety of Universities, government agencies and some of the largest UK companies including Newcastle University, Virgin Media, the NHS, the Ministry of Defence and Shell.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction and course overview (20 minutes)

  • Lecture and discussion: How you might use data science in your role

Scaling and normalizing data (60 minutes)

  • Lecture: Why scale data?; scaling algorithms; min-max; z-scores; Mahalanobis distance
  • Hands-on exercise

Break (10 minutes)

Clustering (25 minutes)

  • Lecture: What is clustering?; the importance of scaling; hierarchical clustering; distance measures; clustering algorithms

Clustering advanced techniques (20 minutes)

  • Lecture: Advanced techniques; how to cope with the plethora of R packages
  • Hands-on exercise

Break (10 minutes)

K-means clustering (25 minutes)

  • Lecture: How k-means clustering works; alternatives
  • Hands-on exercise

Wrap-up and Q&A (10 minutes)