Introduction

This is it, real machine learning! Finally arrived with enough background and context to use the Python sklearn machine learning framework, which we will use for Weeks 4, 5, 6, and 7 with different estimation algorithms. The popularity of Python for machine learning in business is primarily due to the sklearn framework and its relative ease of use. Once you implement one machine learning analysis it is easy to apply the same framework for another machine learning algorithm, with minimal coding changes.

To emphasize, you are learning some of the same procedures used by practicing data scientists, using the same Python code base. Of course, there is much more context to understand. What is the source of the data? Is it reliable? Expensive? How is the model deployed in practice? Are they ethical concerns such as anonymity? More to learn, but the core model machine learning process is what you are learning this week and the following weeks.

Multiple Regression Machine Learning Concepts

Data Splitting

The Most Important Idea in Data Science by Cassie Kozyrkov, Head of Decision Intelligence, Google. (From medium.com, you get 5 free reads a month.)

video [8:56]: Split Data into Training/Testing Data, corresponds to Sec 4.1-4.2

[This video was made in 2019. Some revisions have been made to the written material since then, but the video remains relevant. What the video shows as Chapter 5 is now called Section 4.]

Multiple Regression

Videos of Section 5 Material

Sec 5.1 Multiple Regression Model [12:13]

Sec 5.2 Feature Selection [9:35]

Sec 5.3 Collinearity [9:52]

Python sklearn Implementation

These Jupyter notebook templates contain much information about how to do supervised machine learning, here with the specific application of linear models estimated with traditional least-squares regression. The goal is to provide guidance for real-world analysis. I have also done custom programming throughout to provide analyses I have found useful to generate the results that I prefer when building and analyzing these models.

These templates are written so that hopefully you can transfer the underlying programming to your own analyses with minimal changes to the underlying code. For example, when you do the homework, if you continue to name your data frame d, then some code cells from these templates transfer to your own homework template that you construct with just copy and paste. For the other cells, changing one or more variable names should be all that is needed to apply to the homework data.

The same principle should apply to an analysis you encounter on the job. One summer course does not make you a machine learning expert, but you will be able to build basic predictive models with the Python machine learning framework, which is the way most of the business world does machine learning. Moreover, you will have the foundation to build your skills as needed in the future.

Machine Learning for Multiple Regression

pdf [template]

video [39:19]

00:00 Intro
00:33 Data
07:38 X,y Data Structures
14:13 Estimate
17:59 Predict
19:40 Assess Fit
27:45 k-Fold
36:16 Strategy

Feature Selection for Multiple Regression

pdf [template]

video [22:15]

organized by chapter, click on the three horizontal bars to locate chapters, click again to remove the menu

00:00 Intro
04:23 Manual Selection
15:47 Automatic Selection
19:33 Postscript

Outlier Detection in Linear Models

[There is already much material for this week, so we do not cover outlier detection. We do cover univariate outlier detection with a box plot, which is a worthwhile technique, and better than nothing. Detecting outliers directly from the regression analysis, simultaneously considering all the variables in the model, is more refined, and also a procedure I always follow when building these models. On your own you can read about this in my online textbook, Section 6 titled Influence and Outliers, but no testable material in this course.]

Homework

Short-Answer Problems

Analysis Problems