To understand data analysis you need to understand the following question regarding a simple coin flipping experiment. The precise answer is not as important as being able to provide an approximate answer from intuition. A fair coin is one in which the outcome of a Head is as likely as the outcome of a Tail, and those are the only two possible outcomes.
If you flip a fair coin 1000 times, how many Heads would you obtain?
The basis for answering this question follows from an understanding of sampling error, explored in the following online reading and corresponding video.
Sample vs. Population Sec 3.1
Sample vs. Population video [12:29] Sec 3.1
And so the machine learning part of this course begins. This material provides a concise and general overview of supervised machine learning. You might want to skim it first, paying more attention to the first three subsections to set up the linear regression material, and then move directly into the example of regression analysis that follows. Refer to this overview as needed in the coming weeks as we will implement the processes described here in Weeks 4, 5, 6 and 7 with formal machine learning analysis using the Python machine learning framework.
The Prediction Equation [22:15] Sec 1
When the video plays you will see three horizontal bars at the top left. Click on those bars and a table of contents appears where you can skip to any individual chapter. The resulting menu will cover the left side of the video. To close the menu, go to the top-right side of the menu and find another three horizontal bars. Click to close the menu down. The issue is that with a light background color the second set of white bars may be difficult to recognize.
This material reviews the concept of a linear equation and then presents an example with the simplest possible machine learning model, a linear regression model with only a single predictor variable. This material mostly, if not entirely, reviews regression analysis from your previous stat class(es), such as STAT 241, GSCM 471, GSCM 571, or BTA 516.
Linear Models [21:38]
2.1 The form of the model, a linear prediction equation with slope, b1, and intercept, b0. [1:27]
2.2 Example two columns of data, for predictor variable x and variable to be predicted, y. [7:39]
2.3 The estimated model, that is, estimates for the slope and intercept from running computer software to do the regression analysis. [12:28]
Residuals and Estimation [9:17]
2.4 How the machine learns by choosing values of b0 and b1 to minimize the (squared) residuals over all rows of data. The learning process is all about how far each forecasted value fitted (computed) by the prediction equation, ŷi, is to the actual value, yi, the residual error term for each row of data, yi - ŷi.
Model Fit [13:40]
2.5 Evaluate the fit of a learned model based on the size of the residuals.
How does the machine (i.e., the solution algorithm) learn? The algorithm chooses parameter values of the model, such as the weights for a linear model, the y-intercept and slope, that optimize some statistical criterion. The popular least-squares analysis chooses values for the weights that minimize the sum of the squared residuals or errors. For this situation, an analytic solution is available, obtained from solving equations. For most machine learning solution algorithms, however, the machine optimizes by choosing an initial, perhaps arbitrary, solution, then continually adjusts the solution by modifying the estimated parameter values step-by-step, which each step leading to parameter values that lead to a more optimized solution. This more general solution method is called gradient descent.
The content linked below shows how to apply an informal, intuitive version of gradient descent, while illustrating the meaning of the sum of squared errors. The application is an example of successively choosing values of the y-intercept and the slope to center the line through a scatterplot.
[This same link appears in the on-line posted material, repeated here for convenience.]