This R Markdown document demonstrates running Python in R, here within RStudio. Reading data and bar charts are demonstrated, but the focus is multiple regression.
knitr::opts_chunk$set(echo=TRUE, fig.width=4.5, fig.height=3.25)
library(lessR)
##
## lessR 3.9.0 feedback: gerbing@pdx.edu web: lessRstats.com/new
## ---------------------------------------------------------------------
## 1. d <- Read("") Read text, Excel, SPSS, SAS or R data file
## d: default data frame, no need for data=
## 2. l <- Read("", var_labels=TRUE) Read variable labels into l,
## required name for data frame of labels
## 3. Help() Get help, and, e.g., Help(Read)
## 4. hs(), bc(), or ca() All histograms, all bar charts, or both
## 5. Plot(X) or Plot(X,Y) For continuous and categorical variables
## 6. by1= , by2= Trellis graphics, a plot for each by1, by2
## 7. reg(Y ~ X, Rmd="eg") Regression with full interpretative output
## 8. style("gray") Grayscale theme, + many others available
## style(show=TRUE) all color/style options and current values
## 9. getColors() create many styles of color palettes
##
## lessR parameter names now use _'s. Names with a period are deprecated.
## Ex: bin_width instead of bin.width
Get access to Python from R with functions from the reticulate package.
library(reticulate)
Mac users need to inform reticulate with use_python() to use the anaconda distribution of Python. Otherwise reticulate uses the pre-installed older distribution of Python (version 2), which also does not have needed packages for data analysis.
use_python("~/anaconda3/bin")
Windows users need to identify the path of a plugin that allows graphics, which requires the os package. Replace xxxx with your Windows user name. Mac users can ignore.
import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/Users/xxxx/Anaconda3/Library/plugins/platforms'
For reference, however, py_config() lists the locations of Python on your system. Do not need this if you did standard, default installation of Anaconda, but particularly useful to see this output if R is not properly referencing Python.
py_config()
## python: /Users/davidgerbing/anaconda3/bin/python
## libpython: /Users/davidgerbing/anaconda3/lib/libpython3.7m.dylib
## pythonhome: /Users/davidgerbing/anaconda3:/Users/davidgerbing/anaconda3
## version: 3.7.5 (default, Oct 25 2019, 10:52:18) [Clang 4.0.1 (tags/RELEASE_401/final)]
## numpy: /Users/davidgerbing/anaconda3/lib/python3.7/site-packages/numpy
## numpy_version: 1.17.3
##
## python versions found:
## /Users/davidgerbing/anaconda3/bin/python
## /usr/bin/python
Python consists of the core language plus many packages that extend the core language with additional functions. Every data analysis invokes functions from at least two packages, pandas and numpy, and every data visualization invokes functions from matplotlib with increasing reference to the more elegant seaborn, which is based on matplotlib.
pandas, a general package for storing and manipulating data in data frames, standard row x column data tables, with provided operations for reading, organizing, and cleaningnumpy, a general package for numerical computations, upon which many other packages rely, such as the followingmatplotlib, basic plotting library for data visualizationsseaborn, more elegant, higher-level plotting library based on matplotlibThe pd, np, plt, and sns abbreviates the corresponding full package name. Easier to reference a corresponding package in subsequent code with its abbreviation instead of the full name.
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
To access a pandas function, for example, preface the function name with pd followed by a dot.
An issue with working with Python pandas is that it is built on the numpy package, yet both packages have different data structures with different functions to apply to them.
Read the data with Python panda’s read_excel() function into a panda’s data frame d. The pd. in front of the function name identifies the function as from the panda’s package.
df = pd.read_excel('http://lessRstats.com/data/employee.xlsx')
#df = pd.read_excel('data/employee.xlsx')
The shape attribute provides the dimensions, rows and columns, of the data frame. It is not a function per se because it does not compute anything, just provides the already computed attributes of the data frame. As such, no () after its name.
The head() function is the same as R, list the variable names and some first lines of the data. Note Python calls the functions as a chain, data frame name, then a period, then the function, a sequence that can contain many function calls.
The set_index() function is optional. It takes the employee names and re-defines them as row names instead of as a variable.
Indicate missing data with na. The function dropna() removes all rows with any missing data. This is needed as when accessing the panda’s data frame from R, any missing data in the categorical variables messes up the conversion.
The type() functions provides information regarding the type of data structure. That information is also provided by info(), but more focused on just the type. As indicated, some data structures are from pandas, some from numpy, and then different types of data structures within each general category. So sometimes make sure to understand the data structure you are working with.
df.shape
## (37, 9)
df.head()
## Name Years Gender Dept Salary JobSat Plan Pre Post
## 0 Ritchie, Darnell 7.0 M ADMN 53788.26 med 1 82 92
## 1 Wu, James NaN M SALE 94494.58 low 1 62 74
## 2 Hoang, Binh 15.0 M SALE 111074.86 low 3 96 97
## 3 Jones, Alissa 5.0 F NaN 53772.58 NaN 1 65 62
## 4 Downs, Deborah 7.0 F FINC 57139.90 high 2 90 86
df = df.set_index('Name')
df.head()
## Years Gender Dept Salary JobSat Plan Pre Post
## Name
## Ritchie, Darnell 7.0 M ADMN 53788.26 med 1 82 92
## Wu, James NaN M SALE 94494.58 low 1 62 74
## Hoang, Binh 15.0 M SALE 111074.86 low 3 96 97
## Jones, Alissa 5.0 F NaN 53772.58 NaN 1 65 62
## Downs, Deborah 7.0 F FINC 57139.90 high 2 90 86
df.info()
## <class 'pandas.core.frame.DataFrame'>
## Index: 37 entries, Ritchie, Darnell to Cassinelli, Anastis
## Data columns (total 8 columns):
## Years 36 non-null float64
## Gender 37 non-null object
## Dept 36 non-null object
## Salary 37 non-null float64
## JobSat 35 non-null object
## Plan 37 non-null int64
## Pre 37 non-null int64
## Post 37 non-null int64
## dtypes: float64(2), int64(3), object(3)
## memory usage: 2.6+ KB
df = df.dropna()
df.info()
## <class 'pandas.core.frame.DataFrame'>
## Index: 34 entries, Ritchie, Darnell to Cassinelli, Anastis
## Data columns (total 8 columns):
## Years 34 non-null float64
## Gender 34 non-null object
## Dept 34 non-null object
## Salary 34 non-null float64
## JobSat 34 non-null object
## Plan 34 non-null int64
## Pre 34 non-null int64
## Post 34 non-null int64
## dtypes: float64(2), int64(3), object(3)
## memory usage: 2.4+ KB
print("df: ", type(df))
## df: <class 'pandas.core.frame.DataFrame'>
Not only can Python code run from within the R environment, R and Python can read each other’s data structures. To illustrate, here have lessR BarChart() access the data in the Pandas data frame. From R code, indicate a panda’s data structure with the prefix py$.
The function db() is the brief version of the lessR details() function, which displays information about the data frame. Within the R code block, convert the Python data frame, named df, to an R data frame, named d, with the R data.frame() function. Then create the bar chart.
head(py$df)
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Downs, Deborah 7 F FINC 57139.90 high 2 90 86
## Afshari, Anbar 6 F ADMN 69441.93 high 2 100 100
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
d <- data.frame(py$df)
db(d)
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years double 34 0 16 7 15 7 ... 1 2 10
## 2 Gender character 34 0 2 M M F ... F F M
## 3 Dept character 34 0 5 ADMN SALE FINC ... MKTG SALE FINC
## 4 Salary double 34 0 34 53788.26 111074.86 ... 56508.32 57562.36
## 5 JobSat character 34 0 3 med low high ... high low high
## 6 Plan double 34 0 3 1 3 2 ... 2 2 1
## 7 Pre double 34 0 25 82 96 90 ... 83 59 80
## 8 Post double 34 0 20 92 97 86 ... 90 71 87
## ------------------------------------------------------------------------------------------
BarChart(Dept)
## >>> Suggestions
## BarChart(Dept, horiz=TRUE) # horizontal bar chart
## BarChart(Dept, fill="greens") # sequential green bars
## PieChart(Dept) # doughnut (ring) chart
## Plot(Dept) # bubble plot
## Plot(Dept, stat="count") # lollipop plot
##
##
## --- Dept ---
##
##
## Missing Values of Dept: 0
##
##
## ACCT ADMN FINC MKTG SALE Total
## Frequencies: 4 6 4 6 14 34
## Proportions: 0.118 0.176 0.118 0.176 0.412 1.000
##
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 10.118, df = 4, p-value = 0.038
Here is a data visualization, a bar chat with the seaborn function countplot, which produces a bar chart similar to the output of lessR BarChart().
sns.countplot(x='Dept', saturation=.5, data=df)
Understand the distribution of the target variable, Salary, with its histogram and density estimate, obtained with the seaborn function distplot().
sns.distplot(df.Salary, color='steelblue')
Here the data frame, df, is referenced a different way from the data parameter as with the previous bar chart. Either reference works in either function.
lessR Regression() we have seen, provided here for comparison.
reg(Salary ~ Years + Pre + Post)
## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(Salary ~ Years + Pre + Post, Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: d
##
## Response Variable: Salary
## Predictor Variable 1: Years
## Predictor Variable 2: Pre
## Predictor Variable 3: Post
##
## Number of cases (rows) of data: 34
## Number of cases retained for analysis: 34
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 44009.758 15030.681 2.928 0.006 13313.013 74706.502
## Years 3350.607 362.606 9.240 0.000 2610.068 4091.147
## Pre 191.190 451.283 0.424 0.675 -730.454 1112.834
## Post -221.560 468.878 -0.473 0.640 -1179.137 736.018
##
##
## Standard deviation of residuals: 11444.349 for 30 degrees of freedom
##
## R-squared: 0.759 Adjusted R-squared: 0.734 PRESS R-squared: 0.676
##
## Null hypothesis that all population slope coefficients are 0:
## F-statistic: 31.425 df: 3 and 30 p-value: 0.000
##
##
## df Sum Sq Mean Sq F-value p-value
## Years 1 12318195736.747 12318195736.747 94.051 0.000
## Pre 1 221689.330 221689.330 0.002 0.967
## Post 1 29244411.687 29244411.687 0.223 0.640
## Model 3 12347661837.764 4115887279.255 31.425 0.000
##
## Residuals 30 3929193540.415 130973118.014
##
## Salary 33 16276855378.179 493238041.763
##
##
## K-FOLD CROSS-VALIDATION
##
## RELATIONS AMONG THE VARIABLES
##
## Salary Years Pre Post
## Salary 1.00 0.87 -0.00 -0.10
## Years 0.87 1.00 0.00 -0.10
## Pre -0.00 0.00 1.00 0.92
## Post -0.10 -0.10 0.92 1.00
##
##
## Tolerance VIF
## Years 0.932 1.073
## Pre 0.137 7.285
## Post 0.136 7.353
##
##
## Years Pre Post R2adj X's
## 1 0 0 0.749 1
## 1 0 1 0.741 2
## 1 1 0 0.741 2
## 1 1 1 0.734 3
## 0 1 1 0.012 2
## 0 0 1 -0.020 1
## 0 1 0 -0.031 1
##
## [based on Thomas Lumley's leaps function from the leaps package]
##
##
##
## RESIDUALS AND INFLUENCE
##
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
## [sorted by Cook's Distance]
## [res_rows = 20, out of 34 rows of data, or do res_rows="all"]
## -------------------------------------------------------------------------------------------------
## Years Pre Post Salary fitted resid rstdnt dffits cooks
## Correll, Trevon 21.000 97.000 94.000 134419.230 112091.334 22327.896 2.358 1.220 0.323
## Capelle, Adam 24.000 83.000 81.000 108138.430 122346.772 -14208.342 -1.432 -0.766 0.142
## James, Leslie 18.000 70.000 70.000 122563.380 102194.814 20368.566 1.984 0.720 0.118
## Hoang, Binh 15.000 96.000 97.000 111074.860 91131.821 19943.039 1.939 0.708 0.115
## Singh, Niral 2.000 59.000 59.000 61055.440 48919.162 12136.278 1.217 0.663 0.108
## Anastasiou, Crystal 2.000 59.000 71.000 56508.320 46260.446 10247.874 1.034 0.593 0.088
## Billing, Susan 4.000 91.000 90.000 72675.260 54870.107 17805.153 1.690 0.562 0.074
## Skrotzki, Sara 18.000 63.000 61.000 91352.330 102850.521 -11498.191 -1.114 -0.521 0.067
## Kralik, Laura 10.000 74.000 71.000 92681.190 75933.155 16748.035 1.566 0.475 0.054
## Cassinelli, Anastis 10.000 80.000 87.000 57562.360 73535.340 -15972.980 -1.481 -0.427 0.044
## Saechao, Suzanne 8.000 98.000 100.000 55545.250 67395.270 -11850.020 -1.100 -0.384 0.037
## Ritchie, Darnell 7.000 82.000 92.000 53788.260 62758.100 -8969.840 -0.842 -0.344 0.030
## Afshari, Anbar 6.000 100.000 100.000 69441.930 61076.435 8365.495 0.778 0.301 0.023
## Langston, Matthew 5.000 94.000 93.000 49188.960 58129.605 -8940.645 -0.820 -0.276 0.019
## Hamide, Bita 1.000 83.000 90.000 51036.850 43288.764 7748.086 0.719 0.275 0.019
## LaRoe, Maria 10.000 80.000 86.000 61961.290 73756.900 -11795.610 -1.066 -0.271 0.018
## Sheppard, Cory 14.000 66.000 73.000 95027.550 87362.945 7664.605 0.709 0.266 0.018
## Downs, Deborah 7.000 90.000 86.000 57139.900 65616.977 -8477.077 -0.778 -0.266 0.018
## Osterman, Pascal 5.000 69.000 70.000 49704.790 58445.728 -8740.938 -0.795 -0.249 0.016
## Cooper, Lindsay 4.000 78.000 91.000 56772.950 52163.077 4609.873 0.451 0.243 0.015
##
##
## FORECASTING ERROR
##
## Data, Predicted, Standard Error of Forecast, 95% Prediction Intervals
## [sorted by lower bound of prediction interval]
## [to see all intervals do pred_rows="all"]
## ---------------------------------------------------------------------------------------------------------
## Years Pre Post Salary pred sf pi:lwr pi:upr width
## Hamide, Bita 1.000 83.000 90.000 51036.850 43288.764 12151.865 18471.345 68106.184 49634.839
## Anastasiou, Crystal 2.000 59.000 71.000 56508.320 46260.446 12783.281 20153.503 72367.389 52213.886
## Singh, Niral 2.000 59.000 59.000 61055.440 48919.162 12686.005 23010.883 74827.440 51816.557
## ...
## Bellingar, Samantha 10.000 67.000 72.000 66337.830 74373.265 11842.748 50187.148 98559.383 48372.235
## Link, Thomas 10.000 83.000 83.000 66312.890 74995.149 11641.777 51219.468 98770.830 47551.362
## Kralik, Laura 10.000 74.000 71.000 92681.190 75933.155 11917.446 51594.484 100271.826 48677.341
## ...
## James, Leslie 18.000 70.000 70.000 122563.380 102194.814 12092.033 77499.589 126890.039 49390.450
## Correll, Trevon 21.000 97.000 94.000 134419.230 112091.334 12595.276 86368.349 137814.319 51445.971
## Capelle, Adam 24.000 83.000 81.000 108138.430 122346.772 12652.859 96506.188 148187.357 51681.169
##
##
## ----------------------------------
## Plot 1: Distribution of Residuals
## Plot 2: Residuals vs Fitted Values
## Plot 3: ScatterPlot Matrix
## ----------------------------------
Base R regression involves calling many functions, which are packaged together in the lessR Regression() function.
The lm() for linear model is the core R function for least-squares model estimation. The companion summary() and confint() function provide the usual estimates, with hypothesis tests and confidence intervals.
lm.out <- lm(Salary ~ Years + Pre, data=d)
summary(lm.out)
##
## Call:
## lm(formula = Salary ~ Years + Pre, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17149 -8310 -2268 6837 22477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41309.237 13726.537 3.009 0.00516
## Years 3395.231 345.679 9.822 0.0000000000491
## Pre -6.879 165.094 -0.042 0.96703
##
## Residual standard error: 11300 on 31 degrees of freedom
## Multiple R-squared: 0.7568, Adjusted R-squared: 0.7411
## F-statistic: 48.24 on 2 and 31 DF, p-value: 0.0000000003036
confint(lm.out)
## 2.5 % 97.5 %
## (Intercept) 13313.781 69304.6930
## Years 2690.214 4100.2478
## Pre -343.591 329.8331
Further functions provide influence measures. The main such function is influence.measures(). Its output is an R list, with the influence measures themselves going into the list component infmat, so reference with $infmat.
The Sort() function is from lessR . Base R provides no single function for sorting a data frame. (The Sort() function in the current version of lessR has a bug, so commented out. Version uploaded to CRAN in early January will fix that.)
yhat <- fitted(lm.out)
rstdnt <- rstudent(lm.out)
cooks <- cooks.distance(lm.out)
d_inf <- influence.measures(lm.out)$infmat
d_inf <- data.frame(d_inf)
head(d_inf)
## dfb.1_ dfb.Yers dfb.Pre dffit
## Ritchie, Darnell -0.01371453 0.08252407 -0.031625959 -0.19057160
## Hoang, Binh -0.48792131 0.32375961 0.476518727 0.67322887
## Downs, Deborah 0.06691406 0.05740949 -0.101545896 -0.16500230
## Afshari, Anbar -0.19720370 -0.09852781 0.252103594 0.30745399
## Knox, Michael 0.01437980 -0.06813447 -0.004377089 -0.08246780
## Campagna, Justin 0.03585067 -0.02091895 -0.022022042 0.07436278
## cov.r cook.d hat
## Ritchie, Darnell 1.0456068 0.012131836 0.03744441
## Hoang, Binh 0.8807424 0.139251833 0.11091927
## Downs, Deborah 1.1222740 0.009243034 0.05861211
## Afshari, Anbar 1.1905036 0.031884982 0.12961625
## Knox, Michael 1.2093212 0.002337414 0.09359480
## Campagna, Justin 1.1266792 0.001895178 0.03532794
#d_inf <- Sort(by=cook.d, data=d_inf, direction="-", quiet=TRUE)
d_inf <- round(d_inf, 3)
d_inf[1:10,]
## dfb.1_ dfb.Yers dfb.Pre dffit cov.r cook.d hat
## Ritchie, Darnell -0.014 0.083 -0.032 -0.191 1.046 0.012 0.037
## Hoang, Binh -0.488 0.324 0.477 0.673 0.881 0.139 0.111
## Downs, Deborah 0.067 0.057 -0.102 -0.165 1.122 0.009 0.059
## Afshari, Anbar -0.197 -0.099 0.252 0.307 1.191 0.032 0.130
## Knox, Michael 0.014 -0.068 -0.004 -0.082 1.209 0.002 0.094
## Campagna, Justin 0.036 -0.021 -0.022 0.074 1.127 0.002 0.035
## Kimball, Claire 0.090 0.033 -0.117 -0.160 1.146 0.009 0.069
## Cooper, Lindsay 0.021 -0.039 -0.006 0.055 1.169 0.001 0.061
## Saechao, Suzanne 0.264 0.066 -0.323 -0.390 1.080 0.050 0.103
## Pham, Scott 0.039 -0.027 -0.041 -0.068 1.167 0.002 0.061
The k-fold cross-validation process evaluates a model estimated from training data on testing data. And, as we have seen, \(R^2_{PRESS}\) is the ultimate limit of k-fold, the testing data one row at a time. Still, good to know how to do a one-time divide of the full data set into training and testing data.
The key to dividing a sample into training and testing data is the base R sample() function. The first parameter is the values from which to sample, here the integers from 1 the number of rows of data in the data table. The second parameter, size, specifies the sample size, here set to 75% of the original rows of data. The parameter replace is FALSE by default, so all sampled values are unique if the population from which they are sampled only has unique values.
To illustrate the sample() function, randomly sample 3 integers out of the first 10 integers. You get a different result, in general, every time you run this function.
sample(1:10, size=3)
## [1] 5 1 10
Now apply sample() to the row numbers (indices) of data frame, d, defining the rows of data to be in the training data frame.
propTrain <- 0.75
set.seed(23) # set seed to reproduce the sample in additional runs
rowTrain <- sample(1:nrow(d), size=propTrain*nrow(d)) # train data rows
head(rowTrain)
## [1] 29 28 24 19 8 25
With the row indices of the training data established, create a separate data frame of training data. Then, indicated by the - sign, take all rows that are not part of the training data into the testing data frame.
d_train <- d[rowTrain, ] # train data
d_test <- d[-rowTrain, ] # what is not train data, the test data
cat("Number of rows of test data:", nrow(d_train))
## Number of rows of test data: 25
cat("Number of rows of train data:", nrow(d_test))
## Number of rows of train data: 9
Now estimate the model just from the training data. Save the results into R object lm_train.
lm_train <- lm(Salary ~ Years + Pre, data=d_train) # build the model
summary(lm_train)
##
## Call:
## lm(formula = Salary ~ Years + Pre, data = d_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14832 -8720 -2238 6660 18595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29338.4 15859.8 1.850 0.0778
## Years 3898.4 471.9 8.261 0.0000000345
## Pre 100.5 184.8 0.544 0.5919
##
## Residual standard error: 11510 on 22 degrees of freedom
## Multiple R-squared: 0.7562, Adjusted R-squared: 0.7341
## F-statistic: 34.12 on 2 and 22 DF, p-value: 0.0000001807
Get the forecasted values of \(y\), \(\hat y\), from the testing data applying the estimated model from the training data. Use the R function predict().
yhat_test <- predict(lm_train, d_test)
head(yhat_test)
## Ritchie, Darnell Downs, Deborah Knox, Michael
## 64870.78 65675.02 107652.82
## Gvakharia, Kimberly Anderson, David Capelle, Adam
## 49377.64 73873.97 131244.37
Get the \(y\) and \(\hat y\) values for the testing data together, then calculate the residuals, the squared residuals, and then their sum. Display with the R function cat() that lists information at the console.
resid_test <- cbind(d_test$Salary, yhat_test)
resid_test <- data.frame(resid_test)
names(resid_test)[1] <- "y_test"
resid_test$e_test <- resid_test$y_test - resid_test$yhat_test
resid_test$e_sq <- resid_test$e_test^2
head(resid_test)
## y_test yhat_test e_test e_sq
## Ritchie, Darnell 53788.26 64870.78 -11082.5154 122822147.5
## Downs, Deborah 57139.90 65675.02 -8535.1199 72848272.5
## Knox, Michael 99062.66 107652.82 -8590.1558 73790776.0
## Gvakharia, Kimberly 49868.68 49377.64 491.0362 241116.5
## Anderson, David 69547.60 73873.97 -4326.3733 18717505.9
## Capelle, Adam 108138.43 131244.37 -23105.9401 533884469.5
MSE_test <- sum(resid_test$e_sq)
cat("MSE_test:", MSE_test, "\n")
## MSE_test: 1445941398
We are done with R now, so to keep things simple, refer to the data frame d as the Python data frame.
d = pd.read_excel('http://lessRstats.com/data/employee.xlsx')
Unlike the R lm() regression function, Python regression analyses fail with missing data, so remove missing data with the dropna() function.
d = d.dropna()
Python regression functions do not automatically convert categorical variables to dummy variables, as does the R lm() function. So if the analysis contains one or more categorical variables, create the corresponding dummy variables with the pandas get_dummies() function. Here illustrate this function for the categorical variable Gender.
d = pd.get_dummies(d, columns=["Gender"])
Note, however, that this particular analysis does not involve Gender. This code is to illustrate the concept for analyses that do involve categorical predictor variables.
Here use the OLS() function from the Python statsmodel package. The OLS abbreviates Ordinary Least Squares. That is, the usual version of the least squares estimation algorithm. This Python regression analysis is basically a traditional analysis (compared to the machine learning approach).
Store the features, the predictor variables, in data structure X. Store the target variable in data structure \(y\).
To define a vector, R uses the c() function, such as
c('Years', 'Pre', 'Post')
Python uses square brackets, such as
['Years', 'Pre', 'Post']
Both R and Python use the square brackets to indicate subsetting of a data frame. So with Python, to indicate a subset of a vector of variables by name, use double square brackets.
y = d['Salary']
X = d[['Years', 'Pre', 'Post']]
By default, the OLS() function assumes a \(y\)-intercept of 0 unless there is constant value in the feature data. To compensate, before estimating the model, explicitly add a column of 1’s to the X data structure so that the estimated model will have a \(y\)-intercept, and therefore fit better without requiring the assumption of a value of 0. Add the constant with the statsmodels function add_constant().
import statsmodels.api as sm
X = sm.add_constant(X)
## /Users/davidgerbing/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
## return ptp(axis=axis, out=out, **kwargs)
X.head()
## const Years Pre Post
## 0 1.0 7.0 82 92
## 2 1.0 15.0 96 97
## 4 1.0 7.0 90 86
## 5 1.0 6.0 100 100
## 6 1.0 18.0 81 84
Estimate the model with the OLS() function from the statsmodel package, and then apply the fit method. Save not just the estimated coefficients of the model, but also a reasonably complete analysis of the model. Save this analysis of the regression model to a statsmodel data structure named results.
Note that here is no pre-analysis scaling, such as standardization, to set the features to more or less the same scale. So the sizes of the estimated coefficients cannot be directly compared. For example, measuring in inches would yield coefficients much smaller in size than measurement in millimeters, but the relationship of that variable with all other variables would be exactly the same.
results = sm.OLS(y, X).fit()
print(results.summary())
## OLS Regression Results
## ==============================================================================
## Dep. Variable: Salary R-squared: 0.759
## Model: OLS Adj. R-squared: 0.734
## Method: Least Squares F-statistic: 31.43
## Date: Wed, 04 Dec 2019 Prob (F-statistic): 2.17e-09
## Time: 11:20:12 Log-Likelihood: -363.85
## No. Observations: 34 AIC: 735.7
## Df Residuals: 30 BIC: 741.8
## Df Model: 3
## Covariance Type: nonrobust
## ==============================================================================
## coef std err t P>|t| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 4.401e+04 1.5e+04 2.928 0.006 1.33e+04 7.47e+04
## Years 3350.6074 362.606 9.240 0.000 2610.068 4091.147
## Pre 191.1900 451.283 0.424 0.675 -730.454 1112.834
## Post -221.5597 468.878 -0.473 0.640 -1179.137 736.018
## ==============================================================================
## Omnibus: 3.610 Durbin-Watson: 2.504
## Prob(Omnibus): 0.164 Jarque-Bera (JB): 2.946
## Skew: 0.607 Prob(JB): 0.229
## Kurtosis: 2.221 Cond. No. 886.
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
When computing the variance inflation factor for each feature, a technical requirement is the need to work with a matrix of numbers instead of the more sophisticated data frame, a higher-level data structure. To reduce the X data to just a set of numbers, use the values method.
print("X is a: ", type(X))
## X is a: <class 'pandas.core.frame.DataFrame'>
print("X.values is a: ", type(X.values))
## X.values is a: <class 'numpy.ndarray'>
Use the statsmodels function variance_inflation_factor() to compute the variance inflation factor for each predictor. The VIF’s are a property only of the X’s, so the target \(y\) is not used in this analysis. The variance_inflation_factor function does not compute all the VIF’s, but only one at a time. Create a data frame named vif, then fill each row of the data frame with the corresponding name of the predictor variable and its corresponding variance inflation factor.
To systematically calculate and retrieve the VIF’s, one for each feature, traverse through the variables in X one at a time with a programming structure known as a for loop, from the first X variable through the last X variable, where X.shape[1] is the number of rows of the data frame. As an even more technical programming note, because the loop cannot traverse through the original data frame, transfer the X data frame to a numeric matrix, obtained with the values method.
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["Predictor"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i)
for i in range(X.shape[1])]
vif
## Predictor VIF
## 0 const 58.648112
## 1 Years 1.072776
## 2 Pre 7.284852
## 3 Post 7.352515
Calculate the values of Cook’s Distance, D, and other residual based statistics, with the method get_influence. To be useful, sort the data with sort_values according to D, then list the first 15 or so rows of D and related indices. Set the parameter ascending to False to begin the sort with the largest values, that is, a descending sort.
from statsmodels.stats.outliers_influence import OLSInfluence
infl = results.get_influence()
smry = infl.summary_frame()
smry = smry.loc[:, ['standard_resid', 'student_resid', 'cooks_d']]
smry.sort_values(by="cooks_d", ascending=False).head(15)
## standard_resid student_resid cooks_d
## 17 2.196780 2.357806 0.323122
## 33 -1.407861 -1.432323 0.141680
## 18 1.893387 1.983866 0.118054
## 2 1.855212 1.938623 0.114792
## 27 1.207536 1.217189 0.108127
## 35 1.032383 1.033556 0.087721
## 32 1.639621 1.689540 0.074367
## 16 -1.109195 -1.113626 0.067304
## 15 1.529385 1.565965 0.053893
## 36 -1.452576 -1.481199 0.043860
## 10 -1.096571 -1.100419 0.036539
## 0 -0.846454 -0.842346 0.029792
## 5 0.783619 0.778456 0.022910
## 25 -0.824292 -0.819773 0.019243
## 34 0.724790 0.718930 0.019186
Create the X feature variable and \(y\) target variable data structures. Define the initial model by specifying the target and the features. Unlike the previous example, define the vector of variable names separate from creating the subset data frame X of just those variables.
y = d['Salary']
pred_vars = ['Years', 'Pre', 'Post']
X = d[pred_vars]
Cross-validation tests a model on a new data set, testing data different from the data on which the model was estimated, training data. The concept of cross-validation has applied to regression analysis for many decades, though perhaps often recommended more than actually accomplished. The machine learning framework provides for easily accessible cross-validation methods, and is considered a necessary component of the analysis.
Function train_test_split() randomly divides the original data into two sets, training data and testing data, here called X_train and X_test for the features and y_train and y_test for the target. Parameter test_size specifies the percentage of the original data set allocated to the test split. Parameter random_state specifies the initial seed so that the same “random” process can be repeated at some future time so that the same data split is obtained from train_test_split().
The input into the train_test_split() function are the X and \(y\) data structures. The function provides four outputs, X training and testing data, and y training and testing data. Python has the convention of listing the names for multiple outputs on the left side of the equals sign, separated by commas, in the correct order of the output.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=7)
print("size of X data structures: ", X_train.shape, X_test.shape)
## size of X data structures: (25, 3) (9, 3)
print("size of y data structures: ", y_train.shape, y_test.shape)
## size of y data structures: (25,) (9,)
The sklearn package provides many machine learning algorithms, all implemented with the same general procedure as illustrated here, a primary reason why Python has become the leading platform for machine learning. The LinearRegression module, called a class, provides the functions for the linear regression algorithm. Implement a general algorithm by making a specific instance of the algorithm, referred to by a specific name in the analysis. This process is called instantiation. Here instantiate with the reference model. All subsequent references to the linear regression algorithm are implemented via the name model.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
Fit the model, that is, estimate the parameters with the fit method on the training data applied to a linear regression model. Expressed yet another way, have the machine (i.e., algorithm) learn from the training data. Use the training data only at this point in the analysis.
model.fit(X_train, y_train)
## LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
The analysis was on a pandas data structures, but numpy data structures result. This is one of the potentially confusing issues working with Python.
print("model.intercept_: ", type(model.intercept_))
## model.intercept_: <class 'numpy.float64'>
print("model.coef_: ", type(model.coef_))
## model.coef_: <class 'numpy.ndarray'>
Display the estimated model coefficients, stored in the _lm.coefobject, a data structure created by thefit` method. The name of a variable whose values that the procedure creates ends in an underline. The coefficients of the final model, estimated with all of the data, are needed to apply the model to other situations.
The machine learning implementation of regression is not directed towards understanding and interpreting the model coefficients, but instead focuses on evaluating the extent of forecasting error. The analysis does not provide the usual output with the coefficients listed along with their corresponding \(t\)-tests of the null hypothesis of 0, and the associated confidence interval (as obtained from the statsmodels package).
print("intercept: %.3f" % (model.intercept_), "\n")
## intercept: 48302.176
print(np.round(model.coef_, 3))
## [3364.043 298.155 -378.642]
Given the estimated model, generate forecasts. Use the predict method to compute two sets of \(\hat{y}\) values: y_fit when the model is applied to the data om which it trained, and y_pred when the model is applied to the test data.
y_fit = model.predict(X_train)
y_pred = model.predict(X_test)
If there is only one predictor variable, plot the scatter plot of X and y, as well as the least-squares regression line through the scatterplot. If this multiple regression, then this code is not run.
The Python syntax for an if uses the double equal sign, ==, to evaluate the equality, and a single equal sign, =, to create the equality by assigning the value on the right to the variable on the left. Indicate the end of the conditional, here n_pred==1, with a colon, :, and then indent two spaces for the statements that are run if the conditional is true.
n_pred = len(pred_vars)
if n_pred == 1:
plt.scatter(X_train, y_train, color='gray')
plt.plot(X_train, y_fit, color='black', linewidth=2)
plt.xlabel("Prices: $X_i$")
plt.title("Y and Fitted $\hat{Y}_i$ Plotted Against X")
The basis of the assessment of the model is the comparison of the actual data values of \(y\) in the testing data, y_test, to the values of \(y\) calculated from the model, y_pred.
The metrics module in the sklearn package provides the computations for the fit indices. Compute fit with mean squared error, MSE, and \(R^2\) using metrics. For pedagogy, compute the standard deviation of the residuals two ways: manually from the data, and also as the square root of MSE.
This first application need not be done. It evaluates the fit of the model to the training data, comparing the actual data values, y_train, to the corresponding values computed by the model, y_fit. This is not the official evaluation of model fit and performance, but rather more for pedagogy. It is useful, however, to compare the fit indices for the training data to the testing data. A large drop indicates overfitting the model to the training data.
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_train, y_fit)
rsq = r2_score(y_train, y_fit)
e = y_train - y_fit
Display the fit indices. Display the standard deviation of the residuals from the mean squared error, MSE, two ways, with MSE as a computation, and as a reported value.
The %.3f instructs the print() function to print a floating point number (numeric with decimal digits) with three decimal digits.
print("stdev of residuals: %.3f " % np.sqrt(np.mean(e**2)))
## stdev of residuals: 10517.662
print("stdev of residuals: %.3f " % np.sqrt(mse))
## stdev of residuals: 10517.662
print("R-squared: %.3f" % rsq)
## R-squared: 0.775
Here do the actual evaluation of model performance, by seeing how well the actual data values for \(y\), y_test, match the forecasted or predicted values of \(y\), \(\hat y\). From this split of data, the value of \(R^2\) dropped from that obtained from the training data. (Sometimes, however, by chance, the testing data may outperform the training data, again due to chance.)
mse_f = mean_squared_error(y_test, y_pred)
rsq_f = r2_score(y_test, y_pred)
print('Forecasting Mean squared error: %.3f' % mse_f)
## Forecasting Mean squared error: 131830468.737
print('Forecasting R-squared: %.3f' % rsq_f)
## Forecasting R-squared: 0.680
As an alternative to the one hold-out cross-validation in the previous section, here evaluate model fit with cross-validation on multiple samples. The KFold procedure performs the cross-validation in which the model is estimated using \(k-1\) folds and then tested on the remaining fold. Repeat for each fold.
Specify the number of splits (folds) of the training data according to parameter n_splits, which can vary from 2 all the way to \(n-1\), where \(n\) is the total number of rows in the training data. The parameter shuffle indicates if to shuffle the data before splitting into the folds. The random_state parameter specifies the seed to recover the same “random” data set in a future analysis.
Here instantiate KFold with kf, invoking the desired parameters.
from sklearn.model_selection import KFold, cross_validate
model = LinearRegression()
kf = KFold(n_splits=5, shuffle=True, random_state=1)
The cross_validate method provides for multiple evaluation scores from the same cross-validation folds without having to repeat the computations for each score. Plus computation times are also provided.
To estimate the model (here five different estimates from five different samples) we need to specify the estimation algorithm. We have already instantiated the LinearRegression() estimator earlier as model, but repeat here again for clarity. The scoring parameter here specifies to obtain \(R^2\) and MSE scores for each of the true forecasts of applying the model, for each split, from the k-1 folds data to the hold-out fold.
The MSE criterion has the weird condition that MSE is reported in the negative. The reason is that the best score is always the largest across all scoring procedures and all estimation algorithms, and is thus consistent with other model-tuning algorithms that expect this behavior, and consistent with the internal code of the related sklearn functions. So here, the least negative is the largest value. In reality, MSE must be a non-negative number, so the sign of the real MSE is just flipped to go negative.
The training evaluations are not needed for the evaluation per se, which occurs on the testing data, but sometimes useful to compare to the corresponding testing scores. Training scores much larger than the corresponding testing scores indicates overfitting. Obtain the training information with the parameter return_train_score.
scores = cross_validate(model, X, y, cv=kf,
scoring=('r2', 'neg_mean_squared_error'),
return_train_score=True)
The scores array contains much information, but not so directly readable. Here convert to a data frame, rename the long column names, convert the MSE scores to positive numbers, and average the results. The display includes the time to fit the training data for each fold, and also the time to calculate the evaluation scores, which includes getting the predicted values.
The parameter inplace set to True means to make the change in the specified data frame and the save the data frame with those changes. That is, do not need to copy to a new data frame.
ds = pd.DataFrame(scores)
ds.rename(columns = {'test_neg_mean_squared_error': 'test_MSE',
'train_neg_mean_squared_error': 'train_MSE'},
inplace=True)
ds['test_MSE'] = -ds['test_MSE']
ds['train_MSE'] = -ds['train_MSE']
print(ds.round(4))
## fit_time score_time test_r2 train_r2 test_MSE train_MSE
## 0 0.0042 0.0038 0.6116 0.7631 7.534565e+07 1.300825e+08
## 1 0.0050 0.0043 0.7222 0.7484 1.418617e+08 1.178293e+08
## 2 0.0058 0.0061 -0.1253 0.7884 1.090965e+08 1.185459e+08
## 3 0.0037 0.0033 0.4513 0.8214 5.449724e+08 4.670810e+07
## 4 0.0037 0.0037 0.7008 0.7608 6.751014e+07 1.267584e+08
print('\n')
print('Mean of test R-squared scores: %.3f' % ds['test_r2'].mean())
## Mean of test R-squared scores: 0.472
print('\n')
print('Mean of test MSE scores: %.3f' % ds['test_MSE'].mean())
## Mean of test MSE scores: 187757281.913
se = np.sqrt(ds['test_MSE'].mean())
print('Standard deviation of mean test MSE scores: %.3f' % se)
## Standard deviation of mean test MSE scores: 13702.455