Every time we make a forecast, we need to assess the quality of the forecast. Every time. A forecast can include a single value, but around that value is a range of likely values that would be obtained. The range is more important than the single value, which is almost always wrong.
One such assessment is based on the concept of forecasting error, the difference between what occurred and what we forecasted would occur. This week we begin to learn how to assess the amount of this forecasting error, though actual forecasting error is the error we would obtain from forecasting new data. We will study this distinction more later. This assessment follows from concepts learned in your introductory stat course but reviewed here and in the context of forecasting error.
The goal is to develop intuition before pursuing a more analytic approach.
A large part of data analysis is the analysis of variability about the mean. The concepts of the mean and the standard deviation as an indicator of variability about the mean are crucial to understanding and doing data analysis in general and forecasting in particular. Section 2.1 presents these concepts. As part of last week's material, we saw that random sampling variability is something we observe in each of our measurements. So, even if we correctly model a process's structure, our forecasts will never be perfect.
Assessing forecasting error is the analysis of variability. How can we evaluate the amount of error in our forecasts? The key statistic is the standard deviation or some closely related statistics. The standard deviation is the key statistic to assess the variability of numerical data. Reviewing material from your previous stat class, Sec 2.1c of the posted slides and videos presents the definition and explanation of the standard deviation. The standard deviation is based on a deviation score about the mean. The standard deviation is then computed by averaging the squared deviation scores (Sec 2.1, #16). You can view the formula for the standard deviation built piece-by-piece if you watch Slide #24 from Section 2.1c on the video.
A key index of forecasting error is the deviation of each data value from its value as forecasted by the underlying model, which can be called the target value. That makes sense: How far is each data value from our forecasted value, the target? In general, then, we look at deviations from the target. We can, however, apply the standard deviation directly when the target is the mean. When is that? As we saw last week, a stable process consists of random deviations about the mean, so the standard deviation becomes the basis for assessing forecasting error in that situation.
Video Update: ss.brief(), referenced in Video 2.1a is now ss_brief(), as identified in the pdf. Note that Histogram() gives the same summary stats, plus the histogram, so so can use either function.
| 2.1 | Mean, Standard Deviation | slides [27] |
2.1a [5:26] |
2.1b [8:26] |
2.1c [14:35] |
Beyond getting an index of forecasting error, we need to understand the extent of the error in terms of probabilities. We have learned that data values fluctuate. The good news is that if the fluctuation is normal, we can calculate probabilities of how much fluctuation about a forecasted value is expected. For each forecast, we also want to understand the range of likely values that the actual data value we obtain will fall within. That is when we generate a forecast, we want to know how close the actual value will likely be to what we forecasted. Assuming the errors are normally distributed, we can calculate the 95% range of variation of the errors, which shows us the expected variability about any forecasted value.
95% range of variation about a forecasted value: 1.96 standard deviations above the forecasted value, and 1.96 standard deviations below.
That assessment of the range of likely values is a probability statement. These probabilities are typically based on the profound relationship between the standard deviation and the normal curve. Section 3.2c shows how to compute normal curve probabilities from my prob_norm() function, so no more needing normal curve probability tables at the back of the book. Again, a review of material from your previous stat class, the basis for evaluating the adequacy of a forecasting model.
| 3.2 | Normal Curve, Standard Scores, Probabilities | slides [41] |
3.2a [11:32] |
3.2b [11:59] |
3.2c [13:31] |
One of the fundamental concepts in statistics and data analysis is that 95% of normally distributed data values are within 1.96 standard deviations of the mean. So much understanding of applying statistics to data analysis depends on knowing that basic fact. We will be using it several times throughout this course.