O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Evaluating Machine Learning Models

Book Description

Data science today is a lot like the Wild West: there’s endless opportunity and excitement, but also a lot of chaos and confusion. If you’re new to data science and applied machine learning, evaluating a machine-learning model can seem pretty overwhelming. Now you have help. With this O’Reilly report, machine-learning expert Alice Zheng takes you through the model evaluation basics.

In this overview, Zheng first introduces the machine-learning workflow, and then dives into evaluation metrics and model selection. The latter half of the report focuses on hyperparameter tuning and A/B testing, which may benefit more seasoned machine-learning practitioners.

With this report, you will:

  • Learn the stages involved when developing a machine-learning model for use in a software application
  • Understand the metrics used for supervised learning models, including classification, regression, and ranking
  • Walk through evaluation mechanisms, such as hold?out validation, cross-validation, and bootstrapping
  • Explore hyperparameter tuning in detail, and discover why it’s so difficult
  • Learn the pitfalls of A/B testing, and examine a promising alternative: multi-armed bandits
  • Get suggestions for further reading, as well as useful software packages

Table of Contents

  1. Preface
  2. 1. Orientation
    1. The Machine Learning Workflow
    2. Evaluation Metrics
      1. Offline Evaluation Mechanisms
    3. Hyperparameter Search
    4. Online Testing Mechanisms
  3. 2. Evaluation Metrics
    1. Classification Metrics
      1. Accuracy
      2. Confusion Matrix
      3. Per-Class Accuracy
      4. Log-Loss
      5. AUC
    2. Ranking Metrics
      1. Precision-Recall
      2. Precision-Recall Curve and the F1 Score
      3. NDCG
    3. Regression Metrics
      1. RMSE
      2. Quantiles of Errors
      3. “Almost Correct” Predictions
    4. Caution: The Difference Between Training Metrics and Evaluation Metrics
    5. Caution: Skewed Datasets—Imbalanced Classes, Outliers, and Rare Data
    6. Related Reading
    7. Software Packages
  4. 3. Offline Evaluation Mechanisms: Hold-Out Validation, Cross-Validation, and Bootstrapping
    1. Unpacking the Prototyping Phase: Training, Validation, Model Selection
    2. Why Not Just Collect More Data?
    3. Hold-Out Validation
    4. Cross-Validation
    5. Bootstrap and Jackknife
    6. Caution: The Difference Between Model Validation and Testing
    7. Summary
    8. Related Reading
    9. Software Packages
  5. 4. Hyperparameter Tuning
    1. Model Parameters Versus Hyperparameters
    2. What Do Hyperparameters Do?
    3. Hyperparameter Tuning Mechanism
    4. Hyperparameter Tuning Algorithms
      1. Grid Search
      2. Random Search
      3. Smart Hyperparameter Tuning
    5. The Case for Nested Cross-Validation
    6. Related Reading
    7. Software Packages
  6. 5. The Pitfalls of A/B Testing
    1. A/B Testing: What Is It?
    2. Pitfalls of A/B Testing
      1. 1. Complete Separation of Experiences
      2. 2. Which Metric?
      3. 3. How Much Change Counts as Real Change?
      4. 4. One-Sided or Two-Sided Test?
      5. 5. How Many False Positives Are You Willing to Tolerate?
      6. 6. How Many Observations Do You Need?
      7. 7. Is the Distribution of the Metric Gaussian?
      8. 8. Are the Variances Equal?
      9. 9. What Does the p-Value Mean?
      10. 10. Multiple Models, Multiple Hypotheses
      11. 11. How Long to Run the Test?
      12. 12. Catching Distribution Drift
    3. Multi-Armed Bandits: An Alternative
    4. Related Reading
    5. That’s All, Folks!