Set-Up

This R Markdown document demonstrates running Python in R, here within RStudio. Reading data and bar charts are demonstrated, but the focus is multiple regression.

knitr::opts_chunk$set(echo=TRUE, fig.width=4.5, fig.height=3.25)
library(lessR)
## 
## lessR 3.9.0     feedback: gerbing@pdx.edu     web: lessRstats.com/new
## ---------------------------------------------------------------------
## 1. d <- Read("")           Read text, Excel, SPSS, SAS or R data file
##                            d: default data frame, no need for data=
## 2. l <- Read("", var_labels=TRUE)   Read variable labels into l,
##                            required name for data frame of labels
## 3. Help()                  Get help, and, e.g., Help(Read)
## 4. hs(), bc(), or ca()     All histograms, all bar charts, or both
## 5. Plot(X) or Plot(X,Y)    For continuous and categorical variables
## 6. by1= , by2=             Trellis graphics, a plot for each by1, by2
## 7. reg(Y ~ X, Rmd="eg")    Regression with full interpretative output
## 8. style("gray")           Grayscale theme, + many others available
##    style(show=TRUE)        all color/style options and current values
## 9. getColors()             create many styles of color palettes
## 
## lessR parameter names now use _'s. Names with a period are deprecated.
## Ex:  bin_width  instead of  bin.width

Reticulate Package

Get access to Python from R with functions from the reticulate package.

library(reticulate)

Mac users need to inform reticulate with use_python() to use the anaconda distribution of Python. Otherwise reticulate uses the pre-installed older distribution of Python (version 2), which also does not have needed packages for data analysis.

use_python("~/anaconda3/bin")

Windows users need to identify the path of a plugin that allows graphics, which requires the os package. Replace xxxx with your Windows user name. Mac users can ignore.

import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/Users/xxxx/Anaconda3/Library/plugins/platforms'

For reference, however, py_config() lists the locations of Python on your system. Do not need this if you did standard, default installation of Anaconda, but particularly useful to see this output if R is not properly referencing Python.

py_config()
## python:         /Users/davidgerbing/anaconda3/bin/python
## libpython:      /Users/davidgerbing/anaconda3/lib/libpython3.7m.dylib
## pythonhome:     /Users/davidgerbing/anaconda3:/Users/davidgerbing/anaconda3
## version:        3.7.5 (default, Oct 25 2019, 10:52:18)  [Clang 4.0.1 (tags/RELEASE_401/final)]
## numpy:          /Users/davidgerbing/anaconda3/lib/python3.7/site-packages/numpy
## numpy_version:  1.17.3
## 
## python versions found: 
##  /Users/davidgerbing/anaconda3/bin/python
##  /usr/bin/python

Python Packages

Python consists of the core language plus many packages that extend the core language with additional functions. Every data analysis invokes functions from at least two packages, pandas and numpy, and every data visualization invokes functions from matplotlib with increasing reference to the more elegant seaborn, which is based on matplotlib.

  • pandas, a general package for storing and manipulating data in data frames, standard row x column data tables, with provided operations for reading, organizing, and cleaning
  • numpy, a general package for numerical computations, upon which many other packages rely, such as the following
  • matplotlib, basic plotting library for data visualizations
  • seaborn, more elegant, higher-level plotting library based on matplotlib

The pd, np, plt, and sns abbreviates the corresponding full package name. Easier to reference a corresponding package in subsequent code with its abbreviation instead of the full name.

import pandas as pd
import numpy as np

import matplotlib as plt
import seaborn as sns

To access a pandas function, for example, preface the function name with pd followed by a dot.

An issue with working with Python pandas is that it is built on the numpy package, yet both packages have different data structures with different functions to apply to them.

Data

Read the data with Python panda’s read_excel() function into a panda’s data frame d. The pd. in front of the function name identifies the function as from the panda’s package.

df = pd.read_excel('http://lessRstats.com/data/employee.xlsx')
#df = pd.read_excel('data/employee.xlsx')

The shape attribute provides the dimensions, rows and columns, of the data frame. It is not a function per se because it does not compute anything, just provides the already computed attributes of the data frame. As such, no () after its name.

The head() function is the same as R, list the variable names and some first lines of the data. Note Python calls the functions as a chain, data frame name, then a period, then the function, a sequence that can contain many function calls.

The set_index() function is optional. It takes the employee names and re-defines them as row names instead of as a variable.

Indicate missing data with na. The function dropna() removes all rows with any missing data. This is needed as when accessing the panda’s data frame from R, any missing data in the categorical variables messes up the conversion.

The type() functions provides information regarding the type of data structure. That information is also provided by info(), but more focused on just the type. As indicated, some data structures are from pandas, some from numpy, and then different types of data structures within each general category. So sometimes make sure to understand the data structure you are working with.

df.shape
## (37, 9)
df.head()
##                Name  Years Gender  Dept     Salary JobSat  Plan  Pre  Post
## 0  Ritchie, Darnell    7.0      M  ADMN   53788.26    med     1   82    92
## 1         Wu, James    NaN      M  SALE   94494.58    low     1   62    74
## 2       Hoang, Binh   15.0      M  SALE  111074.86    low     3   96    97
## 3     Jones, Alissa    5.0      F   NaN   53772.58    NaN     1   65    62
## 4    Downs, Deborah    7.0      F  FINC   57139.90   high     2   90    86
df = df.set_index('Name')
df.head()
##                   Years Gender  Dept     Salary JobSat  Plan  Pre  Post
## Name                                                                   
## Ritchie, Darnell    7.0      M  ADMN   53788.26    med     1   82    92
## Wu, James           NaN      M  SALE   94494.58    low     1   62    74
## Hoang, Binh        15.0      M  SALE  111074.86    low     3   96    97
## Jones, Alissa       5.0      F   NaN   53772.58    NaN     1   65    62
## Downs, Deborah      7.0      F  FINC   57139.90   high     2   90    86
df.info()
## <class 'pandas.core.frame.DataFrame'>
## Index: 37 entries, Ritchie, Darnell to Cassinelli, Anastis
## Data columns (total 8 columns):
## Years     36 non-null float64
## Gender    37 non-null object
## Dept      36 non-null object
## Salary    37 non-null float64
## JobSat    35 non-null object
## Plan      37 non-null int64
## Pre       37 non-null int64
## Post      37 non-null int64
## dtypes: float64(2), int64(3), object(3)
## memory usage: 2.6+ KB
df = df.dropna()
df.info()
## <class 'pandas.core.frame.DataFrame'>
## Index: 34 entries, Ritchie, Darnell to Cassinelli, Anastis
## Data columns (total 8 columns):
## Years     34 non-null float64
## Gender    34 non-null object
## Dept      34 non-null object
## Salary    34 non-null float64
## JobSat    34 non-null object
## Plan      34 non-null int64
## Pre       34 non-null int64
## Post      34 non-null int64
## dtypes: float64(2), int64(3), object(3)
## memory usage: 2.4+ KB
print("df: ", type(df))
## df:  <class 'pandas.core.frame.DataFrame'>

Bar Chart

Not only can Python code run from within the R environment, R and Python can read each other’s data structures. To illustrate, here have lessR BarChart() access the data in the Pandas data frame. From R code, indicate a panda’s data structure with the prefix py$.

R (lessR)

The function db() is the brief version of the lessR details() function, which displays information about the data frame. Within the R code block, convert the Python data frame, named df, to an R data frame, named d, with the R data.frame() function. Then create the bar chart.

head(py$df)
##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
## Hoang, Binh         15      M SALE 111074.86    low    3  96   97
## Downs, Deborah       7      F FINC  57139.90   high    2  90   86
## Afshari, Anbar       6      F ADMN  69441.93   high    2 100  100
## Knox, Michael       18      M MKTG  99062.66    med    3  81   84
## Campagna, Justin     8      M SALE  72321.36    low    1  76   84
d <- data.frame(py$df)
db(d)
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years    double     34       0      16   7  15  7 ... 1  2  10
##  2    Gender character     34       0       2   M  M  F ... F  F  M
##  3      Dept character     34       0       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
##  4    Salary    double     34       0      34   53788.26  111074.86 ... 56508.32  57562.36
##  5    JobSat character     34       0       3   med  low  high ... high  low  high
##  6      Plan    double     34       0       3   1  3  2 ... 2  2  1
##  7       Pre    double     34       0      25   82  96  90 ... 83  59  80
##  8      Post    double     34       0      20   92  97  86 ... 90  71  87
## ------------------------------------------------------------------------------------------
BarChart(Dept)

## >>> Suggestions
## BarChart(Dept, horiz=TRUE)  # horizontal bar chart
## BarChart(Dept, fill="greens")  # sequential green bars
## PieChart(Dept)  # doughnut (ring) chart
## Plot(Dept)  # bubble plot
## Plot(Dept, stat="count")  # lollipop plot 
## 
## 
## --- Dept ---
## 
## 
## Missing Values of Dept: 0 
## 
## 
##                 ACCT   ADMN   FINC   MKTG   SALE    Total 
## Frequencies:       4      6      4      6     14       34 
## Proportions:   0.118  0.176  0.118  0.176  0.412    1.000 
## 
## 
## Chi-squared test of null hypothesis of equal probabilities 
##   Chisq = 10.118, df = 4, p-value = 0.038

Python (seaborn)

Here is a data visualization, a bar chat with the seaborn function countplot, which produces a bar chart similar to the output of lessR BarChart().

sns.countplot(x='Dept', saturation=.5, data=df)

Understand the distribution of the target variable, Salary, with its histogram and density estimate, obtained with the seaborn function distplot().

sns.distplot(df.Salary, color='steelblue')

Here the data frame, df, is referenced a different way from the data parameter as with the previous bar chart. Either reference works in either function.

lessR Regression

lessR Regression() we have seen, provided here for comparison.

reg(Salary ~ Years + Pre + Post)

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ Years + Pre + Post, Rmd="eg")  
## 
## 
##   BACKGROUND
## 
## Data Frame:  d 
##  
## Response Variable: Salary 
## Predictor Variable 1: Years 
## Predictor Variable 2: Pre 
## Predictor Variable 3: Post 
##  
## Number of cases (rows) of data:  34 
## Number of cases retained for analysis:  34 
## 
## 
##   BASIC ANALYSIS
## 
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95% 
## (Intercept) 44009.758  15030.681    2.928    0.006   13313.013   74706.502 
##       Years  3350.607    362.606    9.240    0.000    2610.068    4091.147 
##         Pre   191.190    451.283    0.424    0.675    -730.454    1112.834 
##        Post  -221.560    468.878   -0.473    0.640   -1179.137     736.018 
## 
## 
## Standard deviation of residuals:  11444.349 for 30 degrees of freedom 
##  
## R-squared:  0.759    Adjusted R-squared:  0.734    PRESS R-squared:  0.676 
## 
## Null hypothesis that all population slope coefficients are 0:
##   F-statistic: 31.425     df: 3 and 30     p-value:  0.000 
## 
## 
##             df           Sum Sq          Mean Sq   F-value   p-value 
##     Years    1  12318195736.747  12318195736.747    94.051     0.000 
##       Pre    1       221689.330       221689.330     0.002     0.967 
##      Post    1     29244411.687     29244411.687     0.223     0.640 
## Model        3  12347661837.764   4115887279.255    31.425     0.000 
##  
## Residuals   30   3929193540.415    130973118.014 
##  
## Salary      33  16276855378.179    493238041.763 
## 
## 
##   K-FOLD CROSS-VALIDATION
## 
##   RELATIONS AMONG THE VARIABLES
## 
##          Salary Years   Pre  Post 
##   Salary   1.00  0.87 -0.00 -0.10 
##    Years   0.87  1.00  0.00 -0.10 
##      Pre  -0.00  0.00  1.00  0.92 
##     Post  -0.10 -0.10  0.92  1.00 
## 
## 
##         Tolerance       VIF 
##   Years     0.932     1.073 
##     Pre     0.137     7.285 
##    Post     0.136     7.353 
## 
## 
##  Years Pre Post    R2adj    X's 
##      1   0    0    0.749      1 
##      1   0    1    0.741      2 
##      1   1    0    0.741      2 
##      1   1    1    0.734      3 
##      0   1    1    0.012      2 
##      0   0    1   -0.020      1 
##      0   1    0   -0.031      1 
##  
## [based on Thomas Lumley's leaps function from the leaps package] 
##  
## 
## 
##   RESIDUALS AND INFLUENCE
## 
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance 
##    [sorted by Cook's Distance] 
##    [res_rows = 20, out of 34 rows of data, or do res_rows="all"] 
## ------------------------------------------------------------------------------------------------- 
##                        Years     Pre    Post     Salary     fitted      resid rstdnt dffits cooks 
##       Correll, Trevon 21.000  97.000  94.000 134419.230 112091.334  22327.896  2.358  1.220 0.323 
##         Capelle, Adam 24.000  83.000  81.000 108138.430 122346.772 -14208.342 -1.432 -0.766 0.142 
##         James, Leslie 18.000  70.000  70.000 122563.380 102194.814  20368.566  1.984  0.720 0.118 
##           Hoang, Binh 15.000  96.000  97.000 111074.860  91131.821  19943.039  1.939  0.708 0.115 
##          Singh, Niral  2.000  59.000  59.000  61055.440  48919.162  12136.278  1.217  0.663 0.108 
##   Anastasiou, Crystal  2.000  59.000  71.000  56508.320  46260.446  10247.874  1.034  0.593 0.088 
##        Billing, Susan  4.000  91.000  90.000  72675.260  54870.107  17805.153  1.690  0.562 0.074 
##        Skrotzki, Sara 18.000  63.000  61.000  91352.330 102850.521 -11498.191 -1.114 -0.521 0.067 
##         Kralik, Laura 10.000  74.000  71.000  92681.190  75933.155  16748.035  1.566  0.475 0.054 
##   Cassinelli, Anastis 10.000  80.000  87.000  57562.360  73535.340 -15972.980 -1.481 -0.427 0.044 
##      Saechao, Suzanne  8.000  98.000 100.000  55545.250  67395.270 -11850.020 -1.100 -0.384 0.037 
##      Ritchie, Darnell  7.000  82.000  92.000  53788.260  62758.100  -8969.840 -0.842 -0.344 0.030 
##        Afshari, Anbar  6.000 100.000 100.000  69441.930  61076.435   8365.495  0.778  0.301 0.023 
##     Langston, Matthew  5.000  94.000  93.000  49188.960  58129.605  -8940.645 -0.820 -0.276 0.019 
##          Hamide, Bita  1.000  83.000  90.000  51036.850  43288.764   7748.086  0.719  0.275 0.019 
##          LaRoe, Maria 10.000  80.000  86.000  61961.290  73756.900 -11795.610 -1.066 -0.271 0.018 
##        Sheppard, Cory 14.000  66.000  73.000  95027.550  87362.945   7664.605  0.709  0.266 0.018 
##        Downs, Deborah  7.000  90.000  86.000  57139.900  65616.977  -8477.077 -0.778 -0.266 0.018 
##      Osterman, Pascal  5.000  69.000  70.000  49704.790  58445.728  -8740.938 -0.795 -0.249 0.016 
##       Cooper, Lindsay  4.000  78.000  91.000  56772.950  52163.077   4609.873  0.451  0.243 0.015 
## 
## 
##   FORECASTING ERROR
## 
## Data, Predicted, Standard Error of Forecast, 95% Prediction Intervals 
##    [sorted by lower bound of prediction interval] 
##    [to see all intervals do pred_rows="all"] 
## --------------------------------------------------------------------------------------------------------- 
##                        Years    Pre   Post     Salary       pred        sf    pi:lwr     pi:upr     width 
##          Hamide, Bita  1.000 83.000 90.000  51036.850  43288.764 12151.865 18471.345  68106.184 49634.839 
##   Anastasiou, Crystal  2.000 59.000 71.000  56508.320  46260.446 12783.281 20153.503  72367.389 52213.886 
##          Singh, Niral  2.000 59.000 59.000  61055.440  48919.162 12686.005 23010.883  74827.440 51816.557 
## ... 
##   Bellingar, Samantha 10.000 67.000 72.000  66337.830  74373.265 11842.748 50187.148  98559.383 48372.235 
##          Link, Thomas 10.000 83.000 83.000  66312.890  74995.149 11641.777 51219.468  98770.830 47551.362 
##         Kralik, Laura 10.000 74.000 71.000  92681.190  75933.155 11917.446 51594.484 100271.826 48677.341 
## ... 
##         James, Leslie 18.000 70.000 70.000 122563.380 102194.814 12092.033 77499.589 126890.039 49390.450 
##       Correll, Trevon 21.000 97.000 94.000 134419.230 112091.334 12595.276 86368.349 137814.319 51445.971 
##         Capelle, Adam 24.000 83.000 81.000 108138.430 122346.772 12652.859 96506.188 148187.357 51681.169 
## 
## 
## ---------------------------------- 
## Plot 1: Distribution of Residuals 
## Plot 2: Residuals vs Fitted Values 
## Plot 3: ScatterPlot Matrix 
## ----------------------------------

base R Regression

Base R regression involves calling many functions, which are packaged together in the lessR Regression() function.

Analysis of All Data

The lm() for linear model is the core R function for least-squares model estimation. The companion summary() and confint() function provide the usual estimates, with hypothesis tests and confidence intervals.

lm.out <- lm(Salary ~ Years + Pre, data=d)
summary(lm.out)
## 
## Call:
## lm(formula = Salary ~ Years + Pre, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17149  -8310  -2268   6837  22477 
## 
## Coefficients:
##              Estimate Std. Error t value        Pr(>|t|)
## (Intercept) 41309.237  13726.537   3.009         0.00516
## Years        3395.231    345.679   9.822 0.0000000000491
## Pre            -6.879    165.094  -0.042         0.96703
## 
## Residual standard error: 11300 on 31 degrees of freedom
## Multiple R-squared:  0.7568, Adjusted R-squared:  0.7411 
## F-statistic: 48.24 on 2 and 31 DF,  p-value: 0.0000000003036
confint(lm.out)
##                 2.5 %     97.5 %
## (Intercept) 13313.781 69304.6930
## Years        2690.214  4100.2478
## Pre          -343.591   329.8331

Further functions provide influence measures. The main such function is influence.measures(). Its output is an R list, with the influence measures themselves going into the list component infmat, so reference with $infmat.

The Sort() function is from lessR . Base R provides no single function for sorting a data frame. (The Sort() function in the current version of lessR has a bug, so commented out. Version uploaded to CRAN in early January will fix that.)

yhat <- fitted(lm.out)
rstdnt <- rstudent(lm.out)
cooks <- cooks.distance(lm.out)
d_inf <- influence.measures(lm.out)$infmat
d_inf <- data.frame(d_inf)
head(d_inf)
##                       dfb.1_    dfb.Yers      dfb.Pre       dffit
## Ritchie, Darnell -0.01371453  0.08252407 -0.031625959 -0.19057160
## Hoang, Binh      -0.48792131  0.32375961  0.476518727  0.67322887
## Downs, Deborah    0.06691406  0.05740949 -0.101545896 -0.16500230
## Afshari, Anbar   -0.19720370 -0.09852781  0.252103594  0.30745399
## Knox, Michael     0.01437980 -0.06813447 -0.004377089 -0.08246780
## Campagna, Justin  0.03585067 -0.02091895 -0.022022042  0.07436278
##                      cov.r      cook.d        hat
## Ritchie, Darnell 1.0456068 0.012131836 0.03744441
## Hoang, Binh      0.8807424 0.139251833 0.11091927
## Downs, Deborah   1.1222740 0.009243034 0.05861211
## Afshari, Anbar   1.1905036 0.031884982 0.12961625
## Knox, Michael    1.2093212 0.002337414 0.09359480
## Campagna, Justin 1.1266792 0.001895178 0.03532794
#d_inf <- Sort(by=cook.d, data=d_inf, direction="-", quiet=TRUE)
d_inf <- round(d_inf, 3)
d_inf[1:10,]
##                  dfb.1_ dfb.Yers dfb.Pre  dffit cov.r cook.d   hat
## Ritchie, Darnell -0.014    0.083  -0.032 -0.191 1.046  0.012 0.037
## Hoang, Binh      -0.488    0.324   0.477  0.673 0.881  0.139 0.111
## Downs, Deborah    0.067    0.057  -0.102 -0.165 1.122  0.009 0.059
## Afshari, Anbar   -0.197   -0.099   0.252  0.307 1.191  0.032 0.130
## Knox, Michael     0.014   -0.068  -0.004 -0.082 1.209  0.002 0.094
## Campagna, Justin  0.036   -0.021  -0.022  0.074 1.127  0.002 0.035
## Kimball, Claire   0.090    0.033  -0.117 -0.160 1.146  0.009 0.069
## Cooper, Lindsay   0.021   -0.039  -0.006  0.055 1.169  0.001 0.061
## Saechao, Suzanne  0.264    0.066  -0.323 -0.390 1.080  0.050 0.103
## Pham, Scott       0.039   -0.027  -0.041 -0.068 1.167  0.002 0.061

Training and Testing Data

The k-fold cross-validation process evaluates a model estimated from training data on testing data. And, as we have seen, \(R^2_{PRESS}\) is the ultimate limit of k-fold, the testing data one row at a time. Still, good to know how to do a one-time divide of the full data set into training and testing data.

The key to dividing a sample into training and testing data is the base R sample() function. The first parameter is the values from which to sample, here the integers from 1 the number of rows of data in the data table. The second parameter, size, specifies the sample size, here set to 75% of the original rows of data. The parameter replace is FALSE by default, so all sampled values are unique if the population from which they are sampled only has unique values.

To illustrate the sample() function, randomly sample 3 integers out of the first 10 integers. You get a different result, in general, every time you run this function.

sample(1:10, size=3)
## [1]  5  1 10

Now apply sample() to the row numbers (indices) of data frame, d, defining the rows of data to be in the training data frame.

propTrain <- 0.75
set.seed(23)  # set seed to reproduce the sample in additional runs
rowTrain <- sample(1:nrow(d), size=propTrain*nrow(d))  # train data rows
head(rowTrain)
## [1] 29 28 24 19  8 25

With the row indices of the training data established, create a separate data frame of training data. Then, indicated by the - sign, take all rows that are not part of the training data into the testing data frame.

d_train <- d[rowTrain, ]  # train data
d_test <- d[-rowTrain, ]   # what is not train data, the test data
cat("Number of rows of test  data:", nrow(d_train))
## Number of rows of test  data: 25
cat("Number of rows of train data:", nrow(d_test))
## Number of rows of train data: 9

Estimate Training Data Model

Now estimate the model just from the training data. Save the results into R object lm_train.

lm_train <- lm(Salary ~ Years + Pre, data=d_train)  # build the model
summary(lm_train)
## 
## Call:
## lm(formula = Salary ~ Years + Pre, data = d_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14832  -8720  -2238   6660  18595 
## 
## Coefficients:
##             Estimate Std. Error t value     Pr(>|t|)
## (Intercept)  29338.4    15859.8   1.850       0.0778
## Years         3898.4      471.9   8.261 0.0000000345
## Pre            100.5      184.8   0.544       0.5919
## 
## Residual standard error: 11510 on 22 degrees of freedom
## Multiple R-squared:  0.7562, Adjusted R-squared:  0.7341 
## F-statistic: 34.12 on 2 and 22 DF,  p-value: 0.0000001807

Apply Training Model to Test Data

Get the forecasted values of \(y\), \(\hat y\), from the testing data applying the estimated model from the training data. Use the R function predict().

yhat_test <- predict(lm_train, d_test)
head(yhat_test)
##    Ritchie, Darnell      Downs, Deborah       Knox, Michael 
##            64870.78            65675.02           107652.82 
## Gvakharia, Kimberly     Anderson, David       Capelle, Adam 
##            49377.64            73873.97           131244.37

Evaluate Training Model Applied to Test Data

Get the \(y\) and \(\hat y\) values for the testing data together, then calculate the residuals, the squared residuals, and then their sum. Display with the R function cat() that lists information at the console.

resid_test <- cbind(d_test$Salary, yhat_test)
resid_test <- data.frame(resid_test)
names(resid_test)[1] <- "y_test"
resid_test$e_test <- resid_test$y_test - resid_test$yhat_test
resid_test$e_sq <- resid_test$e_test^2
head(resid_test)
##                        y_test yhat_test      e_test        e_sq
## Ritchie, Darnell     53788.26  64870.78 -11082.5154 122822147.5
## Downs, Deborah       57139.90  65675.02  -8535.1199  72848272.5
## Knox, Michael        99062.66 107652.82  -8590.1558  73790776.0
## Gvakharia, Kimberly  49868.68  49377.64    491.0362    241116.5
## Anderson, David      69547.60  73873.97  -4326.3733  18717505.9
## Capelle, Adam       108138.43 131244.37 -23105.9401 533884469.5
MSE_test <- sum(resid_test$e_sq)
cat("MSE_test:", MSE_test, "\n")
## MSE_test: 1445941398

We are done with R now, so to keep things simple, refer to the data frame d as the Python data frame.

d = pd.read_excel('http://lessRstats.com/data/employee.xlsx')

Python data preparation

Unlike the R lm() regression function, Python regression analyses fail with missing data, so remove missing data with the dropna() function.

d = d.dropna()

Python regression functions do not automatically convert categorical variables to dummy variables, as does the R lm() function. So if the analysis contains one or more categorical variables, create the corresponding dummy variables with the pandas get_dummies() function. Here illustrate this function for the categorical variable Gender.

d = pd.get_dummies(d, columns=["Gender"])

Note, however, that this particular analysis does not involve Gender. This code is to illustrate the concept for analyses that do involve categorical predictor variables.

Python statsmodel Regression

Here use the OLS() function from the Python statsmodel package. The OLS abbreviates Ordinary Least Squares. That is, the usual version of the least squares estimation algorithm. This Python regression analysis is basically a traditional analysis (compared to the machine learning approach).

Form X and y Data Structures

Store the features, the predictor variables, in data structure X. Store the target variable in data structure \(y\).

To define a vector, R uses the c() function, such as
c('Years', 'Pre', 'Post')
Python uses square brackets, such as
['Years', 'Pre', 'Post']
Both R and Python use the square brackets to indicate subsetting of a data frame. So with Python, to indicate a subset of a vector of variables by name, use double square brackets.

y = d['Salary']
X = d[['Years', 'Pre', 'Post']]

Estimate the Model

By default, the OLS() function assumes a \(y\)-intercept of 0 unless there is constant value in the feature data. To compensate, before estimating the model, explicitly add a column of 1’s to the X data structure so that the estimated model will have a \(y\)-intercept, and therefore fit better without requiring the assumption of a value of 0. Add the constant with the statsmodels function add_constant().

import statsmodels.api as sm
X = sm.add_constant(X)
## /Users/davidgerbing/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
##   return ptp(axis=axis, out=out, **kwargs)
X.head()
##    const  Years  Pre  Post
## 0    1.0    7.0   82    92
## 2    1.0   15.0   96    97
## 4    1.0    7.0   90    86
## 5    1.0    6.0  100   100
## 6    1.0   18.0   81    84

Estimate the model with the OLS() function from the statsmodel package, and then apply the fit method. Save not just the estimated coefficients of the model, but also a reasonably complete analysis of the model. Save this analysis of the regression model to a statsmodel data structure named results.

Note that here is no pre-analysis scaling, such as standardization, to set the features to more or less the same scale. So the sizes of the estimated coefficients cannot be directly compared. For example, measuring in inches would yield coefficients much smaller in size than measurement in millimeters, but the relationship of that variable with all other variables would be exactly the same.

results = sm.OLS(y, X).fit()
print(results.summary())
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                 Salary   R-squared:                       0.759
## Model:                            OLS   Adj. R-squared:                  0.734
## Method:                 Least Squares   F-statistic:                     31.43
## Date:                Wed, 04 Dec 2019   Prob (F-statistic):           2.17e-09
## Time:                        11:20:12   Log-Likelihood:                -363.85
## No. Observations:                  34   AIC:                             735.7
## Df Residuals:                      30   BIC:                             741.8
## Df Model:                           3                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const       4.401e+04    1.5e+04      2.928      0.006    1.33e+04    7.47e+04
## Years       3350.6074    362.606      9.240      0.000    2610.068    4091.147
## Pre          191.1900    451.283      0.424      0.675    -730.454    1112.834
## Post        -221.5597    468.878     -0.473      0.640   -1179.137     736.018
## ==============================================================================
## Omnibus:                        3.610   Durbin-Watson:                   2.504
## Prob(Omnibus):                  0.164   Jarque-Bera (JB):                2.946
## Skew:                           0.607   Prob(JB):                        0.229
## Kurtosis:                       2.221   Cond. No.                         886.
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Collinearity

When computing the variance inflation factor for each feature, a technical requirement is the need to work with a matrix of numbers instead of the more sophisticated data frame, a higher-level data structure. To reduce the X data to just a set of numbers, use the values method.

print("X is a: ", type(X))
## X is a:  <class 'pandas.core.frame.DataFrame'>
print("X.values is a: ", type(X.values))
## X.values is a:  <class 'numpy.ndarray'>

Use the statsmodels function variance_inflation_factor() to compute the variance inflation factor for each predictor. The VIF’s are a property only of the X’s, so the target \(y\) is not used in this analysis. The variance_inflation_factor function does not compute all the VIF’s, but only one at a time. Create a data frame named vif, then fill each row of the data frame with the corresponding name of the predictor variable and its corresponding variance inflation factor.

To systematically calculate and retrieve the VIF’s, one for each feature, traverse through the variables in X one at a time with a programming structure known as a for loop, from the first X variable through the last X variable, where X.shape[1] is the number of rows of the data frame. As an even more technical programming note, because the loop cannot traverse through the original data frame, transfer the X data frame to a numeric matrix, obtained with the values method.

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["Predictor"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) 
              for i in range(X.shape[1])]
vif
##   Predictor        VIF
## 0     const  58.648112
## 1     Years   1.072776
## 2       Pre   7.284852
## 3      Post   7.352515

Influence and Outliers

Calculate the values of Cook’s Distance, D, and other residual based statistics, with the method get_influence. To be useful, sort the data with sort_values according to D, then list the first 15 or so rows of D and related indices. Set the parameter ascending to False to begin the sort with the largest values, that is, a descending sort.

from statsmodels.stats.outliers_influence import OLSInfluence
infl = results.get_influence()
smry = infl.summary_frame()
smry = smry.loc[:, ['standard_resid', 'student_resid', 'cooks_d']]
smry.sort_values(by="cooks_d", ascending=False).head(15)
##     standard_resid  student_resid   cooks_d
## 17        2.196780       2.357806  0.323122
## 33       -1.407861      -1.432323  0.141680
## 18        1.893387       1.983866  0.118054
## 2         1.855212       1.938623  0.114792
## 27        1.207536       1.217189  0.108127
## 35        1.032383       1.033556  0.087721
## 32        1.639621       1.689540  0.074367
## 16       -1.109195      -1.113626  0.067304
## 15        1.529385       1.565965  0.053893
## 36       -1.452576      -1.481199  0.043860
## 10       -1.096571      -1.100419  0.036539
## 0        -0.846454      -0.842346  0.029792
## 5         0.783619       0.778456  0.022910
## 25       -0.824292      -0.819773  0.019243
## 34        0.724790       0.718930  0.019186

Python sklearn Machine Learning Framework

Create feature and target data structures

Create the X feature variable and \(y\) target variable data structures. Define the initial model by specifying the target and the features. Unlike the previous example, define the vector of variable names separate from creating the subset data frame X of just those variables.

y = d['Salary']
pred_vars = ['Years', 'Pre', 'Post']
X = d[pred_vars]

Model Validation with One Hold-Out Sample

Split data into train and test sets

Cross-validation tests a model on a new data set, testing data different from the data on which the model was estimated, training data. The concept of cross-validation has applied to regression analysis for many decades, though perhaps often recommended more than actually accomplished. The machine learning framework provides for easily accessible cross-validation methods, and is considered a necessary component of the analysis.

Do the split

Function train_test_split() randomly divides the original data into two sets, training data and testing data, here called X_train and X_test for the features and y_train and y_test for the target. Parameter test_size specifies the percentage of the original data set allocated to the test split. Parameter random_state specifies the initial seed so that the same “random” process can be repeated at some future time so that the same data split is obtained from train_test_split().

The input into the train_test_split() function are the X and \(y\) data structures. The function provides four outputs, X training and testing data, and y training and testing data. Python has the convention of listing the names for multiple outputs on the left side of the equals sign, separated by commas, in the correct order of the output.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=7)
print("size of X data structures: ", X_train.shape, X_test.shape)
## size of X data structures:  (25, 3) (9, 3)
print("size of y data structures: ", y_train.shape, y_test.shape)
## size of y data structures:  (25,) (9,)

Fit the model

The sklearn package provides many machine learning algorithms, all implemented with the same general procedure as illustrated here, a primary reason why Python has become the leading platform for machine learning. The LinearRegression module, called a class, provides the functions for the linear regression algorithm. Implement a general algorithm by making a specific instance of the algorithm, referred to by a specific name in the analysis. This process is called instantiation. Here instantiate with the reference model. All subsequent references to the linear regression algorithm are implemented via the name model.

from sklearn.linear_model import LinearRegression
model = LinearRegression()

Fit the model, that is, estimate the parameters with the fit method on the training data applied to a linear regression model. Expressed yet another way, have the machine (i.e., algorithm) learn from the training data. Use the training data only at this point in the analysis.

model.fit(X_train, y_train)
## LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

The analysis was on a pandas data structures, but numpy data structures result. This is one of the potentially confusing issues working with Python.

print("model.intercept_: ", type(model.intercept_))
## model.intercept_:  <class 'numpy.float64'>
print("model.coef_: ", type(model.coef_))
## model.coef_:  <class 'numpy.ndarray'>

Display the estimated model coefficients, stored in the _lm.coefobject, a data structure created by thefit` method. The name of a variable whose values that the procedure creates ends in an underline. The coefficients of the final model, estimated with all of the data, are needed to apply the model to other situations.

The machine learning implementation of regression is not directed towards understanding and interpreting the model coefficients, but instead focuses on evaluating the extent of forecasting error. The analysis does not provide the usual output with the coefficients listed along with their corresponding \(t\)-tests of the null hypothesis of 0, and the associated confidence interval (as obtained from the statsmodels package).

print("intercept: %.3f" % (model.intercept_), "\n")
## intercept: 48302.176
print(np.round(model.coef_, 3))
## [3364.043  298.155 -378.642]

Calculate the forecasts \(\hat y\)

Given the estimated model, generate forecasts. Use the predict method to compute two sets of \(\hat{y}\) values: y_fit when the model is applied to the data om which it trained, and y_pred when the model is applied to the test data.

y_fit = model.predict(X_train)
y_pred = model.predict(X_test)

If there is only one predictor variable, plot the scatter plot of X and y, as well as the least-squares regression line through the scatterplot. If this multiple regression, then this code is not run.

The Python syntax for an if uses the double equal sign, ==, to evaluate the equality, and a single equal sign, =, to create the equality by assigning the value on the right to the variable on the left. Indicate the end of the conditional, here n_pred==1, with a colon, :, and then indent two spaces for the statements that are run if the conditional is true.

n_pred = len(pred_vars)
if n_pred == 1:
  plt.scatter(X_train, y_train, color='gray')
  plt.plot(X_train, y_fit, color='black', linewidth=2)
  plt.xlabel("Prices: $X_i$")
  plt.title("Y and Fitted $\hat{Y}_i$ Plotted Against X")

Assess fit

The basis of the assessment of the model is the comparison of the actual data values of \(y\) in the testing data, y_test, to the values of \(y\) calculated from the model, y_pred.

The metrics module in the sklearn package provides the computations for the fit indices. Compute fit with mean squared error, MSE, and \(R^2\) using metrics. For pedagogy, compute the standard deviation of the residuals two ways: manually from the data, and also as the square root of MSE.

This first application need not be done. It evaluates the fit of the model to the training data, comparing the actual data values, y_train, to the corresponding values computed by the model, y_fit. This is not the official evaluation of model fit and performance, but rather more for pedagogy. It is useful, however, to compare the fit indices for the training data to the testing data. A large drop indicates overfitting the model to the training data.

from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_train, y_fit)
rsq = r2_score(y_train, y_fit)
e = y_train - y_fit

Display the fit indices. Display the standard deviation of the residuals from the mean squared error, MSE, two ways, with MSE as a computation, and as a reported value.

The %.3f instructs the print() function to print a floating point number (numeric with decimal digits) with three decimal digits.

print("stdev of residuals: %.3f " % np.sqrt(np.mean(e**2)))
## stdev of residuals: 10517.662
print("stdev of residuals: %.3f " % np.sqrt(mse))
## stdev of residuals: 10517.662
print("R-squared: %.3f" % rsq)
## R-squared: 0.775

Here do the actual evaluation of model performance, by seeing how well the actual data values for \(y\), y_test, match the forecasted or predicted values of \(y\), \(\hat y\). From this split of data, the value of \(R^2\) dropped from that obtained from the training data. (Sometimes, however, by chance, the testing data may outperform the training data, again due to chance.)

mse_f = mean_squared_error(y_test, y_pred)
rsq_f = r2_score(y_test, y_pred)
print('Forecasting Mean squared error: %.3f' % mse_f)
## Forecasting Mean squared error: 131830468.737
print('Forecasting R-squared: %.3f' % rsq_f)
## Forecasting R-squared: 0.680

Model Validation with Multiple Hold-Out Samples

K-fold analysis

As an alternative to the one hold-out cross-validation in the previous section, here evaluate model fit with cross-validation on multiple samples. The KFold procedure performs the cross-validation in which the model is estimated using \(k-1\) folds and then tested on the remaining fold. Repeat for each fold.

Specify the number of splits (folds) of the training data according to parameter n_splits, which can vary from 2 all the way to \(n-1\), where \(n\) is the total number of rows in the training data. The parameter shuffle indicates if to shuffle the data before splitting into the folds. The random_state parameter specifies the seed to recover the same “random” data set in a future analysis.

Here instantiate KFold with kf, invoking the desired parameters.

from sklearn.model_selection import KFold, cross_validate
model = LinearRegression()
kf = KFold(n_splits=5, shuffle=True, random_state=1)

Evaluate k-fold analysis

The cross_validate method provides for multiple evaluation scores from the same cross-validation folds without having to repeat the computations for each score. Plus computation times are also provided.

To estimate the model (here five different estimates from five different samples) we need to specify the estimation algorithm. We have already instantiated the LinearRegression() estimator earlier as model, but repeat here again for clarity. The scoring parameter here specifies to obtain \(R^2\) and MSE scores for each of the true forecasts of applying the model, for each split, from the k-1 folds data to the hold-out fold.

The MSE criterion has the weird condition that MSE is reported in the negative. The reason is that the best score is always the largest across all scoring procedures and all estimation algorithms, and is thus consistent with other model-tuning algorithms that expect this behavior, and consistent with the internal code of the related sklearn functions. So here, the least negative is the largest value. In reality, MSE must be a non-negative number, so the sign of the real MSE is just flipped to go negative.

The training evaluations are not needed for the evaluation per se, which occurs on the testing data, but sometimes useful to compare to the corresponding testing scores. Training scores much larger than the corresponding testing scores indicates overfitting. Obtain the training information with the parameter return_train_score.

scores = cross_validate(model, X, y, cv=kf, 
                        scoring=('r2', 'neg_mean_squared_error'),
                        return_train_score=True)

The scores array contains much information, but not so directly readable. Here convert to a data frame, rename the long column names, convert the MSE scores to positive numbers, and average the results. The display includes the time to fit the training data for each fold, and also the time to calculate the evaluation scores, which includes getting the predicted values.

The parameter inplace set to True means to make the change in the specified data frame and the save the data frame with those changes. That is, do not need to copy to a new data frame.

ds = pd.DataFrame(scores)
ds.rename(columns = {'test_neg_mean_squared_error': 'test_MSE',
                     'train_neg_mean_squared_error': 'train_MSE'},
         inplace=True)
ds['test_MSE'] = -ds['test_MSE']
ds['train_MSE'] = -ds['train_MSE']
print(ds.round(4))
##    fit_time  score_time  test_r2  train_r2      test_MSE     train_MSE
## 0    0.0042      0.0038   0.6116    0.7631  7.534565e+07  1.300825e+08
## 1    0.0050      0.0043   0.7222    0.7484  1.418617e+08  1.178293e+08
## 2    0.0058      0.0061  -0.1253    0.7884  1.090965e+08  1.185459e+08
## 3    0.0037      0.0033   0.4513    0.8214  5.449724e+08  4.670810e+07
## 4    0.0037      0.0037   0.7008    0.7608  6.751014e+07  1.267584e+08
print('\n')
print('Mean of test R-squared scores: %.3f' % ds['test_r2'].mean())
## Mean of test R-squared scores: 0.472
print('\n')
print('Mean of test MSE scores: %.3f' % ds['test_MSE'].mean())
## Mean of test MSE scores: 187757281.913
se = np.sqrt(ds['test_MSE'].mean())
print('Standard deviation of mean test MSE scores: %.3f' % se)
## Standard deviation of mean test MSE scores: 13702.455