Week-by-Week Course Summary
Week 1: R
- Separate Data from Code
- Unlike a worksheet app such as Excel, R and related analysis systems separate the data from the code that manipulates the data. Excel is extraordinarily useful, but also extraordinarily overused. Countless examples of complex, linked worksheets exist that require full-time people just to understand and debug, and some are probably never fully debugged. Businesses that rely solely upon Excel for complex operations are behind the times, using 20th century practices when more modern, more efficient, more powerful, and easier to use alternatives exist. R with lessR top that list of better alternatives. Welcome to the world of real data science!
- Read and Analyze
- Read the data, usually from an Excel worksheet or a csv text file, then analyze as needed. The analysis includes visualizations as well as statistical analysis such as forecasting.
Week 2: Plot a Time Series
- Randomness
- All data values are influenced by a random component. Understanding the nature of randomness and the associated sampling error is essential to understanding data analysis.
- Visualize
- The first step to forecasting visualizes time-oriented data to reveal its underlying structure in terms of any existing components: stablility, trend, and seasonality. From this structure apply a proper analytic forecasting technique. One simple technique intuits the underlying structure from the visualization of the time series and then draws the extension into the future. Obviously not precise, and without error bands, but can be useful.
- Date Type
- Achieve a visualization of time-oriented data with dates on the horizontal axis from one of two possibilities.
- Represent the dates as a variable of the R
Date type.
- If the data file is an Excel file, then the date will be properly represented if a Date type in Excel.
- If the data are stored as a csv file, then convert the date values from character strings to the R
Date type with the as.Date() function.
- Define a time series object with the R function
ts() with the dates specified with the parameters of the function.
Week 3: Variability
- Variability
- Data analysis is the analysis of variability. Assess variability of a numerical variable according to the squared deviations about the mean. Take the average and then un-square with the square root, which is standard deviation.
- Normally Distributed Variability
- Many natural phenomenon, including random error distributions, are normally distributed. The key fact here is that almost 95% of all normally distributed values are within two standard deviations of the population mean. This key concept allows the construction of error bands around a forecasted value from the mathematically derived standard deviation of the sample estimate over hypothetical repeated samples, the standard error of the statistic.
Week 4: Correlation and Regression
- Correlation
- The extent of a linear relationship can be expressed with the correlation coefficient, geometrically with a scatterplot. Correlation varies from -1 for a perfect inverse relationship, to 0 for no relationship, to 1 for a perfect direct relationship between two variables.
- Regression
- With one predictor variable, the regression model is a linear function (y-intercept, slope) of a single variable.
- Explanatory vs Time Series Models
- An explanatory model generates a forecast from the values of variables different than the variable forecasted. A time series model generates a forecast from values of the same variable from earlier time periods.
Week 6: One-Predictor Least-Squares Regression, Inference, Prediction Intervals
- Model Fit
- A line can be put through any scatterplot. But does the model (line) effectively summarize the relationship between the variables, that is, does the model fit the data? Evaluate fit with se and R2. More importantly, apply those fit indices to new data to best evaluate forecasting fit.
- Inference
- Does target y change on average as predictor or feature x changes? If yes, then the slope is nonzero. But in the sample the slope is always nonzero, even if there is no relationship. To evaluate the population slope, the value of interest, do statistical inference in the form of hypothesis testing and the confidence interval.
- Prediction Intervals
- The forecasted value is almost always wrong. A proper forecast is not a point, a single value, but a range of values that likely contains the actual value when it occurs. That range is the prediction interval.
- Forecasting Error
- The standard deviation of the residuals, se, is too optimistic for evaluating forecasting error. The reason is that this se is a descriptive statistic that only describes fit to the data from which the model was estimated, the training data. Forecasts, however, occur with new data, and so must account for sampling error as well, so the standard error of prediction, spred, is always larger than se from the analysis of the data from which the model was estimated. Also, spred is a different value for each set of values for the predictor variables.
Week 7: Multiple Least-Squares Regression
- Multiple Predictors
- Explanatory regression models typically have multiple predictor variables, usually about three to nine, though could be any number. New predictor variables that are relevant (correlate with y) and unique (do not correlate with other x's) contribute to better fit and forecasting accuracy.
- Ceteris Paribus
- An extraordinarily useful property of multiple regression is that the impact of a predictor variable on the forecasted response is how much the response changes, on average, for a unit increase in the value of the predictor, with the values of all other variables held constant.
- Collinearity
- When predictor variables correlate too highly, more generally when they are linearly related, the inflate their respective standard errors of the slope coefficients. Predictive accuracy is not hurt, but the understanding of the relationships of the predictors to the response variable y is diminished. Also, predictive accuracy is not increased much in general with collinear predictors.
- Feature Selection
- When constructing a multiple regression model, the predictor variables, the features, must be selected. The goal is to construct a parsimonious model, with almost the maximum available fit with the smallest number of features. One helpful technique is best-subsets regression, in which all, or most, possible combinations of predictor variables are analyzed for fit.
Week 8: Trend and Seasonality
- Trend
- Often linear or approximated by linearity, trend is the general movement of the time series, either increasing or decreasing.
- Seasonality
- Time series data always fluctuates, but seasonal fluctuations are regular patterns of fluctuation that vary according to the relevant time period, such as four quarters over the year.
- Decomposition
- To decompose the time series is to isolate the trend and seasonal components from the random error fluctuations. The components can influence the data values with either an additive or a multiplicative relationship.
- Forecast
- Many forecasting techniques exist for a time series with trend and seasonality. One basic technique is to de-seasonalize the data, run the linear regression, project into the future, then add the seasonality back. The
tslm() function from the forecast package automates this procedure. The alternative is to manually deseasonalize the data, run the regression analysis, then manually add back in the seasonality. The tslm() function eliminates the need for this extra work.
Week 9: Exponential Smoothing
- Self Adjustment
- An exponential smoothing forecast self-adjusts according to the error in the current forecast. The amount of self-adjustment follows from the smoothing parameter alpha. A forecasting method based only on the smoothing parameter is simple exponential smoothing, ses, which only yields a flat forecast.
- Exponential Decay of Past Times
- The basic definition of the exponential smoothing model implies that the impact of past time periods is less than the impact of more recent time periods. The decrease in the weights is exponential. The larger the smoothing parameter, the faster the decay.
- Holt's Adaptation
- The ses forecasting method does not account for trend or seasonality. Holt's adaption adds a second smoothing parameter, beta, that accounts for linear trend.
- Holt-Winters Adaption
- To use exponential smoothing to forecast a time series with trend and seasonality, apply the Holt-Winters adaptation, which adds a third smoothing parameter.