library("lessR")The Regression() function performs multiple aspects of a complete regression analysis. Abbreviate with reg(). To illustrate, first read the Employee data included as part of lessR. Read into the default lessR data frame d.
d <- Read("Employee")As an option, also read the table of variable labels. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need to be entered into the table. The table can be a csv file or an Excel file.
Read the label file into the l data frame, currently the only permitted name. The labels are displayed on both the text and visualization output. Each displayed label consists of the variable name juxtaposed with the corresponding label, as shown in the display of the label file.
l <- rd("Employee_lbl")l## label
## Years Time of Company Employment
## Gender Man or Woman
## Dept Department Employed
## Salary Annual Salary (USD)
## JobSat Satisfaction with Work Environment
## Plan 1=GoodHealth, 2=GetWell, 3=BestCare
## Pre Test score on legal issues before instruction
## Post Test score on legal issues after instruction
The brief version provides just the basic analysis, what Excel provides, plus a scatterplot with the regression line, which becomes a scatterplot matrix with multiple regression. Because d is the default name of the data frame that contains the variables for analysis, the data parameter that names the input data frame need not be specified. Here, specify Salary as the target or response variable with features, or predictor variables, Years and Pre.
reg_brief(Salary ~ Years + Pre)## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(Salary ~ Years + Pre, Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: d
##
## Response Variable: Salary
## Predictor Variable 1: Years
## Predictor Variable 2: Pre
##
## Number of cases (rows) of data: 37
## Number of cases retained for analysis: 36
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 54140.971 13666.115 3.962 0.000 26337.052 81944.891
## Years 3251.408 347.529 9.356 0.000 2544.355 3958.462
## Pre -18.265 167.652 -0.109 0.914 -359.355 322.825
##
## Standard deviation of Salary: 21,822.372
##
## Standard deviation of residuals: 11,753.478 for df=33
## 95% range of residuals: 47,825.260 = 2 * (2.035 * 11,753.478)
##
## R-squared: 0.726 Adjusted R-squared: 0.710 PRESS R-squared: 0.659
##
## Null hypothesis of all 0 population slope coefficients:
## F-statistic: 43.827 df: 2 and 33 p-value: 0.000
##
## -- Analysis of Variance
##
## df Sum Sq Mean Sq F-value p-value
## Years 1 12107157290.292 12107157290.292 87.641 0.000
## Pre 1 1639658.444 1639658.444 0.012 0.914
##
## Model 2 12108796948.736 6054398474.368 43.827 0.000
## Residuals 33 4558759843.773 138144237.690
## Salary 35 16667556792.508 476215908.357
##
##
## K-FOLD CROSS-VALIDATION
##
##
## RELATIONS AMONG THE VARIABLES
##
##
## RESIDUALS AND INFLUENCE
##
##
## PREDICTION ERROR
The full output is extensive: Summary of the analysis, estimated model, fit indices, ANOVA, correlation matrix, collinearity analysis, best subset regression, residuals and influence statistics, and prediction intervals. The motivation is to provide virtually all of the information needed for a proper regression analysis.
reg(Salary ~ Years + Pre)## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(Salary ~ Years + Pre, Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: d
##
## Response Variable: Salary
## Predictor Variable 1: Years
## Predictor Variable 2: Pre
##
## Number of cases (rows) of data: 37
## Number of cases retained for analysis: 36
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 54140.971 13666.115 3.962 0.000 26337.052 81944.891
## Years 3251.408 347.529 9.356 0.000 2544.355 3958.462
## Pre -18.265 167.652 -0.109 0.914 -359.355 322.825
##
## Standard deviation of Salary: 21,822.372
##
## Standard deviation of residuals: 11,753.478 for df=33
## 95% range of residuals: 47,825.260 = 2 * (2.035 * 11,753.478)
##
## R-squared: 0.726 Adjusted R-squared: 0.710 PRESS R-squared: 0.659
##
## Null hypothesis of all 0 population slope coefficients:
## F-statistic: 43.827 df: 2 and 33 p-value: 0.000
##
## -- Analysis of Variance
##
## df Sum Sq Mean Sq F-value p-value
## Years 1 12107157290.292 12107157290.292 87.641 0.000
## Pre 1 1639658.444 1639658.444 0.012 0.914
##
## Model 2 12108796948.736 6054398474.368 43.827 0.000
## Residuals 33 4558759843.773 138144237.690
## Salary 35 16667556792.508 476215908.357
##
##
## K-FOLD CROSS-VALIDATION
##
##
## RELATIONS AMONG THE VARIABLES
##
## Salary Years Pre
## Salary 1.00 0.85 0.03
## Years 0.85 1.00 0.05
## Pre 0.03 0.05 1.00
##
## Tolerance VIF
## Years 0.998 1.002
## Pre 0.998 1.002
##
## Years Pre R2adj X's
## 1 0 0.718 1
## 1 1 0.710 2
## 0 1 -0.028 1
##
## [based on Thomas Lumley's leaps function from the leaps package]
##
##
## RESIDUALS AND INFLUENCE
##
## -- Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
## [sorted by Cook's Distance]
## [n_res_rows = 20, out of 36 rows of data, or do n_res_rows="all"]
## -----------------------------------------------------------------------------------------
## Years Pre Salary fitted resid rstdnt dffits cooks
## Correll, Trevon 21 97 144419.230 120648.843 23770.387 2.424 1.217 0.430
## James, Leslie 18 70 132563.380 111387.773 21175.607 1.998 0.714 0.156
## Capelle, Adam 24 83 118138.430 130658.778 -12520.348 -1.211 -0.634 0.132
## Hoang, Binh 15 96 121074.860 101158.659 19916.201 1.860 0.649 0.131
## Korhalkar, Jessica 2 74 82502.500 59292.181 23210.319 2.171 0.638 0.122
## Billing, Susan 4 91 82675.260 65484.493 17190.767 1.561 0.472 0.071
## Singh, Niral 2 59 71055.440 59566.155 11489.285 1.064 0.452 0.068
## Skrotzki, Sara 18 63 101352.330 111515.627 -10163.297 -0.937 -0.397 0.053
## Saechao, Suzanne 8 98 65545.250 78362.271 -12817.021 -1.157 -0.390 0.050
## Kralik, Laura 10 74 102681.190 85303.447 17377.743 1.535 0.287 0.026
## Anastasiou, Crystal 2 59 66508.320 59566.155 6942.165 0.636 0.270 0.025
## Langston, Matthew 5 94 59188.960 68681.106 -9492.146 -0.844 -0.268 0.024
## Afshari, Anbar 6 100 79441.930 71822.925 7619.005 0.689 0.264 0.024
## Cassinelli, Anastis 10 80 67562.360 85193.857 -17631.497 -1.554 -0.265 0.022
## Osterman, Pascal 5 69 59704.790 69137.730 -9432.940 -0.826 -0.216 0.016
## Bellingar, Samantha 10 67 76337.830 85431.301 -9093.471 -0.793 -0.198 0.013
## LaRoe, Maria 10 80 71961.290 85193.857 -13232.567 -1.148 -0.195 0.013
## Ritchie, Darnell 7 82 63788.260 75403.102 -11614.842 -1.006 -0.190 0.012
## Sheppard, Cory 14 66 105027.550 98455.199 6572.351 0.579 0.176 0.011
## Downs, Deborah 7 90 67139.900 75256.982 -8117.082 -0.706 -0.174 0.010
##
##
## PREDICTION ERROR
##
## -- Data, Predicted, Standard Error of Prediction, 95% Prediction Intervals
## [sorted by lower bound of prediction interval]
## [to see all intervals add n_pred_rows="all"]
## ----------------------------------------------
##
## Years Pre Salary pred s_pred pi.lwr pi.upr width
## Hamide, Bita 1 83 61036.850 55876.388 12290.483 30871.211 80881.564 50010.352
## Singh, Niral 2 59 71055.440 59566.155 12619.291 33892.014 85240.296 51348.281
## Anastasiou, Crystal 2 59 66508.320 59566.155 12619.291 33892.014 85240.296 51348.281
## ...
## Link, Thomas 10 83 76312.890 85139.062 11933.518 60860.137 109417.987 48557.849
## LaRoe, Maria 10 80 71961.290 85193.857 11918.048 60946.405 109441.308 48494.903
## Cassinelli, Anastis 10 80 67562.360 85193.857 11918.048 60946.405 109441.308 48494.903
## ...
## Correll, Trevon 21 97 144419.230 120648.843 12881.876 94440.470 146857.217 52416.747
## Capelle, Adam 24 83 118138.430 130658.778 12955.608 104300.394 157017.161 52716.767
##
## ----------------------------------
## Plot 1: Distribution of Residuals
## Plot 2: Residuals vs Fitted Values
## ----------------------------------
Request a briefer output with the reg_brief() version of the function. Standardize the predictor variables in the model by setting the new_scale parameter to "z". Plot the residuals as a line connecting each data point to the corresponding point on the regression line as specified with the plot_errors parameter. To also standardize the response variable, set parameter scale_response to TRUE.
reg_brief(Salary ~ Years, new_scale="z", plot_errors=TRUE)##
## Rescaled Data, First Six Rows
## Salary Years
## Hamide, Bita 61036.85 -1.466
## Singh, Niral 71055.44 -1.291
## Korhalkar, Jessica 82502.50 -1.291
## Anastasiou, Crystal 66508.32 -1.291
## Gvakharia, Kimberly 59868.68 -1.116
## Stanley, Emma 56124.97 -1.116
## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(Salary ~ Years, new_scale="z", plot_errors=TRUE, Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: d
##
## Response Variable: Salary
## Predictor Variable: Years
##
## Number of cases (rows) of data: 37
## Number of cases retained for analysis: 36
##
## Data are Standardized
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 83219.551 1930.348 43.111 0.000 79296.612 87142.490
## Years 18595.810 1957.448 9.500 0.000 14617.797 22573.823
##
## Standard deviation of Salary: 21,822.372
##
## Standard deviation of residuals: 11,582.088 for df=34
## 95% range of residuals: 47,075.271 = 2 * (2.032 * 11,582.088)
##
## R-squared: 0.726 Adjusted R-squared: 0.718 PRESS R-squared: 0.681
##
## Null hypothesis of all 0 population slope coefficients:
## F-statistic: 90.251 df: 1 and 34 p-value: 0.000
##
## -- Analysis of Variance
##
## df Sum Sq Mean Sq F-value p-value
## Model 1 12106634568.544 12106634568.544 90.251 0.000
## Residuals 34 4560922223.964 134144771.293
## Salary 35 16667556792.508 476215908.357
##
##
## K-FOLD CROSS-VALIDATION
##
##
## RELATIONS AMONG THE VARIABLES
##
##
## RESIDUALS AND INFLUENCE
##
##
## PREDICTION ERROR
Specify a cross-validation with the kfold parameter. Here, specify three folds. The function automatically creates the training and testing data sets.
reg(Salary ~ Years, kfold=3)##
## 3-FOLD CROSS-VALIDATION
##
## Model from Training Data Applied to Testing Data
## ---------------------------------- ----------------------------------
## fold n se MSE Rsq n sp MSE Rsq
## 1 | 24 12224.163 149430170.310 0.753 | 12 12160.862 147886559.193 0.418
## 2 | 24 11324.428 128242662.857 0.756 | 12 13918.926 193736507.380 0.620
## 3 | 24 10360.532 107340618.328 0.706 | 12 16699.502 278873362.318 0.657
## ---------------------------------- ----------------------------------
## Mean 11303.041 128337817.165 0.738 14259.763 206832142.964 0.565
The standard output also includes \(R^2_{press}\), the value of \(R^2\) when applied to new, previously unseen data, a value comparable to the average \(R^2\) on test data.
The output of Regression() can be stored into an R object, here named r. The output object consists of various components that together define the output of a comprehensive regression analysis. R refers to the resulting output structure as a list object.
r <- reg(Salary ~ Years + Pre)Entering the name of the object displays the full output, the default output when the output is directed to the R console instead of saving into an R object.
r## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(Salary ~ Years + Pre, Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: d
##
## Response Variable: Salary
## Predictor Variable 1: Years
## Predictor Variable 2: Pre
##
## Number of cases (rows) of data: 37
## Number of cases retained for analysis: 36
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 54140.971 13666.115 3.962 0.000 26337.052 81944.891
## Years 3251.408 347.529 9.356 0.000 2544.355 3958.462
## Pre -18.265 167.652 -0.109 0.914 -359.355 322.825
##
## Standard deviation of Salary: 21,822.372
##
## Standard deviation of residuals: 11,753.478 for df=33
## 95% range of residuals: 47,825.260 = 2 * (2.035 * 11,753.478)
##
## R-squared: 0.726 Adjusted R-squared: 0.710 PRESS R-squared: 0.659
##
## Null hypothesis of all 0 population slope coefficients:
## F-statistic: 43.827 df: 2 and 33 p-value: 0.000
##
## -- Analysis of Variance
##
## df Sum Sq Mean Sq F-value p-value
## Years 1 12107157290.292 12107157290.292 87.641 0.000
## Pre 1 1639658.444 1639658.444 0.012 0.914
##
## Model 2 12108796948.736 6054398474.368 43.827 0.000
## Residuals 33 4558759843.773 138144237.690
## Salary 35 16667556792.508 476215908.357
##
##
## K-FOLD CROSS-VALIDATION
##
##
## RELATIONS AMONG THE VARIABLES
##
## Salary Years Pre
## Salary 1.00 0.85 0.03
## Years 0.85 1.00 0.05
## Pre 0.03 0.05 1.00
##
## Tolerance VIF
## Years 0.998 1.002
## Pre 0.998 1.002
##
## Years Pre R2adj X's
## 1 0 0.718 1
## 1 1 0.710 2
## 0 1 -0.028 1
##
## [based on Thomas Lumley's leaps function from the leaps package]
##
##
## RESIDUALS AND INFLUENCE
##
## -- Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
## [sorted by Cook's Distance]
## [n_res_rows = 20, out of 36 rows of data, or do n_res_rows="all"]
## -----------------------------------------------------------------------------------------
## Years Pre Salary fitted resid rstdnt dffits cooks
## Correll, Trevon 21 97 144419.230 120648.843 23770.387 2.424 1.217 0.430
## James, Leslie 18 70 132563.380 111387.773 21175.607 1.998 0.714 0.156
## Capelle, Adam 24 83 118138.430 130658.778 -12520.348 -1.211 -0.634 0.132
## Hoang, Binh 15 96 121074.860 101158.659 19916.201 1.860 0.649 0.131
## Korhalkar, Jessica 2 74 82502.500 59292.181 23210.319 2.171 0.638 0.122
## Billing, Susan 4 91 82675.260 65484.493 17190.767 1.561 0.472 0.071
## Singh, Niral 2 59 71055.440 59566.155 11489.285 1.064 0.452 0.068
## Skrotzki, Sara 18 63 101352.330 111515.627 -10163.297 -0.937 -0.397 0.053
## Saechao, Suzanne 8 98 65545.250 78362.271 -12817.021 -1.157 -0.390 0.050
## Kralik, Laura 10 74 102681.190 85303.447 17377.743 1.535 0.287 0.026
## Anastasiou, Crystal 2 59 66508.320 59566.155 6942.165 0.636 0.270 0.025
## Langston, Matthew 5 94 59188.960 68681.106 -9492.146 -0.844 -0.268 0.024
## Afshari, Anbar 6 100 79441.930 71822.925 7619.005 0.689 0.264 0.024
## Cassinelli, Anastis 10 80 67562.360 85193.857 -17631.497 -1.554 -0.265 0.022
## Osterman, Pascal 5 69 59704.790 69137.730 -9432.940 -0.826 -0.216 0.016
## Bellingar, Samantha 10 67 76337.830 85431.301 -9093.471 -0.793 -0.198 0.013
## LaRoe, Maria 10 80 71961.290 85193.857 -13232.567 -1.148 -0.195 0.013
## Ritchie, Darnell 7 82 63788.260 75403.102 -11614.842 -1.006 -0.190 0.012
## Sheppard, Cory 14 66 105027.550 98455.199 6572.351 0.579 0.176 0.011
## Downs, Deborah 7 90 67139.900 75256.982 -8117.082 -0.706 -0.174 0.010
##
##
## PREDICTION ERROR
##
## -- Data, Predicted, Standard Error of Prediction, 95% Prediction Intervals
## [sorted by lower bound of prediction interval]
## [to see all intervals add n_pred_rows="all"]
## ----------------------------------------------
##
## Years Pre Salary pred s_pred pi.lwr pi.upr width
## Hamide, Bita 1 83 61036.850 55876.388 12290.483 30871.211 80881.564 50010.352
## Singh, Niral 2 59 71055.440 59566.155 12619.291 33892.014 85240.296 51348.281
## Anastasiou, Crystal 2 59 66508.320 59566.155 12619.291 33892.014 85240.296 51348.281
## ...
## Link, Thomas 10 83 76312.890 85139.062 11933.518 60860.137 109417.987 48557.849
## LaRoe, Maria 10 80 71961.290 85193.857 11918.048 60946.405 109441.308 48494.903
## Cassinelli, Anastis 10 80 67562.360 85193.857 11918.048 60946.405 109441.308 48494.903
## ...
## Correll, Trevon 21 97 144419.230 120648.843 12881.876 94440.470 146857.217 52416.747
## Capelle, Adam 24 83 118138.430 130658.778 12955.608 104300.394 157017.161 52716.767
##
## ----------------------------------
## Plot 1: Distribution of Residuals
## Plot 2: Residuals vs Fitted Values
## ----------------------------------
Or, work with the components individually. Use the base R names() function to identify all of the output components. Component names that begin with out_ are part of the standard output. Other components include just data and statistics designed to be input in additional procedures, including R markdown documents.
names(r)## [1] "out_suggest" "call" "formula" "vars" "out_title_bck" "out_background"
## [7] "out_title_basic" "out_estimates" "out_fit" "out_anova" "out_title_mod" "out_mod"
## [13] "out_mdls" "out_title_kfold" "out_kfold" "out_title_rel" "out_cor" "out_collinear"
## [19] "out_subsets" "out_title_res" "out_residuals" "out_title_pred" "out_predict" "out_ref"
## [25] "out_Rmd" "out_Word" "out_pdf" "out_odt" "out_rtf" "out_plots"
## [31] "n.vars" "n.obs" "n.keep" "coefficients" "sterrs" "tvalues"
## [37] "pvalues" "cilb" "ciub" "anova_model" "anova_residual" "anova_total"
## [43] "se" "resid_range" "Rsq" "Rsqadj" "PRESS" "RsqPRESS"
## [49] "m_se" "m_MSE" "m_Rsq" "cor" "tolerances" "vif"
## [55] "resid.max" "pred_min_max" "residuals" "fitted" "cooks.distance" "model"
## [61] "terms"
Here, only display the estimates and their inferential analysis as part of the standard text output.
r$out_estimates## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 54140.971 13666.115 3.962 0.000 26337.052 81944.891
## Years 3251.408 347.529 9.356 0.000 2544.355 3958.462
## Pre -18.265 167.652 -0.109 0.914 -359.355 322.825
Here, display the numeric values of the coefficients.
r$coefficients## (Intercept) Years Pre
## 54140.97140 3251.40825 -18.26496
An analysis of hundreds or thousands of rows of data can make it difficult to locate a specific prediction interval of interest. To initiate a search for a specific row, first do the regression and request all prediction intervals with parameter pred_rows. Then convert that output to a data frame named dp with base R read.table(). As a data frame, do a standard search for an individual row for a specific prediction interval (see the Subset a Data Frame vignette for directions to subset).
This particular conversion to a data frame requires one more step. One or more spaces in the out_predict output delimit adjacent columns, but the names in this data set are formatted with a comma followed by a space. Use base R sub() to remove the space after the comma before converting to a data frame.
r <- reg(Salary ~ Years, pred_rows="all", graphics=FALSE)## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'pred_rows' will be disregarded
r$out_predict = sub(", ", ",", r$out_predict, fixed=TRUE)
dp <- read.table(text=r$out_predict)
dp[.(row.names(dp) == "Pham,Scott"),]## Years Salary pred s_pred pi.lwr pi.upr width
## Pham,Scott 13 91871.05 94955.08 11805.96 70962.48 118947.7 47985.2
Because reg() accomplishes its computations with base R function lm(), lm() parameters can be passed to reg(), which then passes the values to lm() to define the corresponding indicator variables. Here, first use base R function contr.sum() to calculate an effect coding contrast matrix for a categorical variable with three levels, such as the variable Plan in the Employee data set.
cnt <- contr.sum(n=3)
cnt## [,1] [,2]
## 1 1 0
## 2 0 1
## 3 -1 -1
Now use the lm() parameter contrasts to define the effect coding for JobSat, passed to reg_brief(). Contrasts only apply to factors, so convert JobSat to an R factor before the regression analysis, a task that should generally be done for all categorical variables in an analysis. Here, designate the order of the levels on output displays such as a bar graph.
d$JobSat <- factor(d$JobSat, levels=c("low", "med", "high"))
reg_brief(Salary ~ JobSat, contrasts=list(JobSat=cnt))##
## >>> JobSat is not numeric. Converted to indicator variables.
## Warning in model.matrix.default(mt, mf, contrasts): variable 'JobSat' is absent, its contrast will be ignored
## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(Salary ~ JobSat, contrasts=list(JobSat=cnt), Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: d
##
## Response Variable: Salary
## Predictor Variable 1: JobSatmed
## Predictor Variable 2: JobSathigh
##
## Number of cases (rows) of data: 37
## Number of cases retained for analysis: 35
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 95121.843 5521.853 17.226 0.000 83874.197 106369.490
## JobSatmed -8435.630 8156.317 -1.034 0.309 -25049.505 8178.245
## JobSathigh -25664.732 8156.317 -3.147 0.004 -42278.607 -9050.857
##
## Standard deviation of Salary: 22,157.414
##
## Standard deviation of residuals: 19,909.324 for df=32
## 95% range of residuals: 81,107.933 = 2 * (2.037 * 19,909.324)
##
## R-squared: 0.240 Adjusted R-squared: 0.193 PRESS R-squared: 0.098
##
## Null hypothesis of all 0 population slope coefficients:
## F-statistic: 5.056 df: 2 and 32 p-value: 0.012
##
## -- Analysis of Variance
##
## df Sum Sq Mean Sq F-value p-value
## JobSatmed 1 83510016.635 83510016.635 0.211 0.649
## JobSathigh 1 3924625926.927 3924625926.927 9.901 0.004
##
## Model 2 4008135943.563 2004067971.781 5.056 0.012
## Residuals 32 12684198101.098 396381190.659
## Salary 34 16692334044.661 490951001.314
##
##
## K-FOLD CROSS-VALIDATION
##
##
## RELATIONS AMONG THE VARIABLES
##
##
## RESIDUALS AND INFLUENCE
##
##
## PREDICTION ERROR
The \(R^2\) fit statistic compares the sum of the squared errors of the model with the X predictor variables to the sum of squared errors of the null model. The baseline of comparison, the null model, is a model with no X variables such that the fitted value for each set of X values is the mean of response variable \(y\). The corresponding slope intercept is the mean of \(y\), and the standard deviation of the residuals is the standard deviation of \(y\).
The following submits the null model for Salary, and plots the errors. Compare the variability of the residuals to a regression model of Salary with one or more predictor variables. To the extent that the inclusion of one or more predictor variables in the model reduces the variability of the data about the regression line compared to the null model, the model fits the data.
reg_brief(Salary ~ 1, plot_errors=TRUE)## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(Salary ~ 1, plot_errors=TRUE, Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: d
##
## Response Variable: Salary
##
## Number of cases (rows) of data: 37
## Number of cases retained for analysis: 37
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 83795.557 3583.821 23.382 0.000 76527.230 91063.883
##
## Standard deviation of Salary: NA
##
## Standard deviation of residuals: 21,799.534 for df=36
## 95% range of residuals: 88,423.006 = 2 * (2.028 * 21,799.534)
##
## -- Analysis of Variance
##
## df Sum Sq Mean Sq F-value p-value
## Residuals 36 17107907732.489 475219659.236
##
##
## K-FOLD CROSS-VALIDATION
##
##
## RELATIONS AMONG THE VARIABLES
##
##
## RESIDUALS AND INFLUENCE
##
##
## PREDICTION ERROR
Can also get the null model plot from the lessR function Plot() with the fit parameter set to "null".
The scatterplot is displayed as a bubble plot when both variables consist of less than 10 unique integer values. With the bubble plot, there is no overprinting of the same point so that the number of values that represent a point is displayed.
dd <- Read("Mach4")reg_brief(m10 ~ m02, data=dd)## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(m10 ~ m02, data=dd, Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: dd
##
## Response Variable: m10
## Predictor Variable: m02
##
## Number of cases (rows) of data: 351
## Number of cases retained for analysis: 351
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 4.285 0.092 46.642 0.000 4.104 4.466
## m02 -0.168 0.040 -4.184 0.000 -0.247 -0.089
##
## Standard deviation of m10: 1.1376
##
## Standard deviation of residuals: 1.1117 for df=349
## 95% range of residuals: 4.3731 = 2 * (1.967 * 1.1117)
##
## R-squared: 0.048 Adjusted R-squared: 0.045 PRESS R-squared: 0.037
##
## Null hypothesis of all 0 population slope coefficients:
## F-statistic: 17.502 df: 1 and 349 p-value: 0.000
##
## -- Analysis of Variance
##
## df Sum Sq Mean Sq F-value p-value
## Model 1 21.632 21.632 17.502 0.000
## Residuals 349 431.343 1.236
## m10 350 452.974 1.294
##
##
## K-FOLD CROSS-VALIDATION
##
##
## RELATIONS AMONG THE VARIABLES
##
##
## RESIDUALS AND INFLUENCE
##
##
## PREDICTION ERROR
Obtain an ANCOVA by entering categorical and continuous variables as predictor variables. For a single categorical variable and a single continuous variable, Regression() displays the regression line for each level of the categorical variable.
The ANCOVA assumes that the slopes for the different levels of the categorical variable are the same for the pairing of the continuous predictor variable and continuous response variable. Visually evaluate this assumption by plotting each separate slope and scatterplot.
Plot(Salary, Years, by=Dept, fit="lm")## [Interactive chart from the Plotly R package (Sievert, 2020)]
Then, if the slopes are not too dissimilar, run the ANCOVA. The categorical variable must be interpretable as a categorical variable, either as an R variable type factor or as a non-numerical type character string. If the categorical variable is coded numerically, convert to a factor, such as d$CatVar <- factor(d$CatVar) which retains the original numerical values as the value labels.
The ANCOVA displays the appropriate Type II Sum of Squares in its ANOVA table for properly evaluating the group effect that corresponds to the entered categorical variable. Note that this SS is only displayed for an ANOVA with a single categorical variable and a single covariate
reg_brief(Salary ~ Dept + Years)##
## >>> Dept is not numeric. Converted to indicator variables.
## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(Salary ~ Dept + Years, Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: d
##
## Response Variable: Salary
## Predictor Variable 1: DeptADMN
## Predictor Variable 2: DeptFINC
## Predictor Variable 3: DeptMKTG
## Predictor Variable 4: DeptSALE
## Predictor Variable 5: Years
##
## Number of cases (rows) of data: 37
## Number of cases retained for analysis: 35
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 53680.151 5634.233 9.527 0.000 42156.850 65203.452
## DeptADMN 4713.926 7302.390 0.646 0.524 -10221.137 19648.990
## DeptFINC -7822.049 8057.041 -0.971 0.340 -24300.547 8656.450
## DeptMKTG -5227.930 7275.625 -0.719 0.478 -20108.253 9652.394
## DeptSALE 762.933 6351.479 0.120 0.905 -12227.301 13753.167
## Years 3234.397 364.712 8.868 0.000 2488.478 3980.317
##
## Standard deviation of Salary: 21,881.046
##
## Standard deviation of residuals: 11,741.645 for df=29
## 95% range of residuals: 48,028.719 = 2 * (2.045 * 11,741.645)
##
## R-squared: 0.754 Adjusted R-squared: 0.712 PRESS R-squared: 0.623
##
## Null hypothesis of all 0 population slope coefficients:
## F-statistic: 17.815 df: 5 and 29 p-value: 0.000
##
## -- Analysis of Variance from Type II Sums of Squares
##
## df Sum Sq Mean Sq F-value p-value
## DeptADMN 4 534128871.049 133532217.762 0.969 0.440
## DeptFINC 1 47920138.873 47920138.873 0.348 0.560
## DeptMKTG 1 48610362.188 48610362.188 0.353 0.557
## DeptSALE 1 933561595.768 933561595.768 6.772 0.014
## Years 1 10842890397.792 10842890397.792 78.648 0.000
## Residuals 29 3998120326.431 137866218.153
##
## -- Test of Interaction
##
## Dept:Years df: 4 df resid: 25 SS: 784731402.138 F: 1.526 p-value: 0.225
##
## -- Assume parallel lines, no interaction of Dept with Years
##
## Level ACCT: y^_Salary = 53680.151 + 3234.397(x_Years)
## Level ADMN: y^_Salary = 58394.077 + 3234.397(x_Years)
## Level FINC: y^_Salary = 45858.103 + 3234.397(x_Years)
## Level MKTG: y^_Salary = 48452.221 + 3234.397(x_Years)
## Level SALE: y^_Salary = 54443.084 + 3234.397(x_Years)
##
## -- Visualize Separately Computed Regression Lines
##
## Plot(Years, Salary, by=Dept, fit="lm")
##
##
## K-FOLD CROSS-VALIDATION
##
##
## RELATIONS AMONG THE VARIABLES
##
##
## RESIDUALS AND INFLUENCE
##
##
## PREDICTION ERROR
To do a moderation analysis, specify one of (only) two predictor variables with the parameter mod.
reg_brief(Salary ~ Years + Pre, mod=Pre)## >>> Suggestion
## # Create an R markdown file for interpretative output with Rmd = "file_name"
## reg(Salary ~ Years + Pre, mod=Pre, Rmd="eg")
##
##
## BACKGROUND
##
## Data Frame: d
##
## Response Variable: Salary
## Predictor Variable 1: Years
## Predictor Variable 2: Pre
## Predictor Variable 3: Pre.Years
##
## Number of cases (rows) of data: 37
## Number of cases retained for analysis: 36
##
##
## BASIC ANALYSIS
##
## Estimate Std Err t-value p-value Lower 95% Upper 95%
## (Intercept) 83048.486 1891.523 43.906 0.000 79195.579 86901.392
## Years 3243.027 335.204 9.675 0.000 2560.238 3925.815
## Pre 4.246 162.142 0.026 0.979 -326.026 334.517
## Pre.Years 53.181 28.517 1.865 0.071 -4.907 111.268
##
## Standard deviation of Salary: 21,822.372
##
## Standard deviation of residuals: 11,335.624 for df=32
## 95% range of residuals: 46,179.821 = 2 * (2.037 * 11,335.624)
##
## R-squared: 0.753 Adjusted R-squared: 0.730 PRESS R-squared: 0.665
##
## Null hypothesis of all 0 population slope coefficients:
## F-statistic: 32.571 df: 3 and 32 p-value: 0.000
##
## -- Analysis of Variance
##
## df Sum Sq Mean Sq F-value p-value
## Years 1 12107157290.292 12107157290.292 94.222 0.000
## Pre 1 1639658.444 1639658.444 0.013 0.911
## Pre.Years 1 446876022.168 446876022.168 3.478 0.071
##
## Model 3 12555672970.904 4185224323.635 32.571 0.000
## Residuals 32 4111883821.605 128496369.425
## Salary 35 16667556792.508 476215908.357
##
##
## MODERATION ANALYSIS
##
## Mean of Pre: 0.000
## SD of Pre: 11.864
##
## mean+1SD for Pre: b0=83098.86 b1=3873.983
## mean for Pre: b0=83048.486 b1=3243.027
## mean-1SD for Pre: b0=82998.112 b1=2612.071
##
##
## K-FOLD CROSS-VALIDATION
##
##
## RELATIONS AMONG THE VARIABLES
##
##
## RESIDUALS AND INFLUENCE
##
##
## PREDICTION ERROR
In this analysis, Pre is not a moderator of the impact of Years on Salary. There is a tendency expressed by the non-parallel lines in the visualization, and an almost significant interaction, but the interaction was not detected at the \(\alpha=0.5\) level.
For a model with a binary response variable, \(y\), specify multiple logistic regression with the usual R formula syntax applied to the lessR function Logit(). The output includes the confusion matrix and various classification fit indices.
d <- Read("BodyMeas")Logit(Gender ~ Hand)##
## Response Variable: Gender
## Predictor Variable 1: Hand
##
## Number of cases (rows) of data: 340
## Number of cases retained for analysis: 340
##
##
## BASIC ANALYSIS
##
## -- Estimated Model of Gender for the Logit of Reference Group Membership
##
## Estimate Std Err z-value p-value Lower 95% Upper 95%
## (Intercept) -26.9237 2.7515 -9.785 0.000 -32.3166 -21.5308
## Hand 3.2023 0.3269 9.794 0.000 2.5615 3.8431
##
##
## -- Odds Ratios and Confidence Intervals
##
## Odds Ratio Lower 95% Upper 95%
## (Intercept) 0.0000 0.0000 0.0000
## Hand 24.5883 12.9547 46.6690
##
##
## -- Model Fit
##
## Null deviance: 471.340 on 339 degrees of freedom
## Residual deviance: 220.664 on 338 degrees of freedom
##
## AIC: 224.6641
##
## Number of iterations to convergence: 6
##
##
## ANALYSIS OF RESIDUALS AND INFLUENCE
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
## [sorted by Cook's Distance]
## [res_rows = 20 out of 340 cases (rows) of data]
## --------------------------------------------------------------------
## Hand Gender P(Y=1) residual rstudent dffits cooks
## 125 7.0 M 0.0109 0.9891 3.045 0.1930 0.11740
## 253 7.0 M 0.0109 0.9891 3.045 0.1930 0.11740
## 162 9.5 W 0.9706 -0.9706 -2.684 -0.2256 0.07555
## 313 9.5 W 0.9706 -0.9706 -2.684 -0.2256 0.07555
## 20 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 33 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 59 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 67 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 69 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 87 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 90 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 150 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 248 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 276 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 284 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 308 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 8 8.0 M 0.2132 0.7868 1.766 0.1948 0.01462
## 109 8.0 M 0.2132 0.7868 1.766 0.1948 0.01462
## 132 8.0 M 0.2132 0.7868 1.766 0.1948 0.01462
## 142 8.0 M 0.2132 0.7868 1.766 0.1948 0.01462
##
##
## PREDICTION
##
## Probability threshold for classification M: 0.5
##
## 0: W
## 1: M
##
## Data, Fitted Values, Standard Errors
## [sorted by fitted value]
## [pred_all=TRUE to see all intervals displayed]
## --------------------------------------------------------------------
## Hand Gender label fitted std.err
## 153 6.00 W 0 0.0004481 0.0003597
## 32 6.75 W 0 0.0049256 0.0027736
## 155 6.75 W 0 0.0049256 0.0027736
## 7 7.00 W 0 0.0109024 0.0052692
##
## ... for the rows of data where fitted is close to 0.5 ...
##
## Hand Gender label fitted std.err
## 293 8.25 M 0 0.3764 0.04188
## 311 8.25 M 0 0.3764 0.04188
## 1 8.50 W 1 0.5734 0.04274
## 9 8.50 M 1 0.5734 0.04274
## 16 8.50 M 1 0.5734 0.04274
##
## ... for the last 4 rows of sorted data ...
##
## Hand Gender label fitted std.err
## 151 11 M 1 0.9998 0.0002152
## 196 11 M 1 0.9998 0.0002152
## 257 11 M 1 0.9998 0.0002152
## 299 11 M 1 0.9998 0.0002152
## --------------------------------------------------------------------
##
##
## ----------------------------
## Specified confusion matrices
## ----------------------------
##
## Probability threshold for predicting M: 0.5
## Corresponding cutoff threshold for Hand: 8.408
##
## Baseline Predicted
## ---------------------------------------------------
## Total %Tot 0 1 %Correct
## ---------------------------------------------------
## 1 170 50.0 17 153 90.0
## Gender 0 170 50.0 147 23 86.5
## ---------------------------------------------------
## Total 340 88.2
##
## Accuracy: 88.24
## Sensitivity: 90.00
## Precision: 86.93
Specify additional probability thresholds for classification beyond just the default 0.5 with the prob_cut parameter.
Logit(Gender ~ Hand, prob_cut=c(.3, .5, .7))##
## Response Variable: Gender
## Predictor Variable 1: Hand
##
## Number of cases (rows) of data: 340
## Number of cases retained for analysis: 340
##
##
## BASIC ANALYSIS
##
## -- Estimated Model of Gender for the Logit of Reference Group Membership
##
## Estimate Std Err z-value p-value Lower 95% Upper 95%
## (Intercept) -26.9237 2.7515 -9.785 0.000 -32.3166 -21.5308
## Hand 3.2023 0.3269 9.794 0.000 2.5615 3.8431
##
##
## -- Odds Ratios and Confidence Intervals
##
## Odds Ratio Lower 95% Upper 95%
## (Intercept) 0.0000 0.0000 0.0000
## Hand 24.5883 12.9547 46.6690
##
##
## -- Model Fit
##
## Null deviance: 471.340 on 339 degrees of freedom
## Residual deviance: 220.664 on 338 degrees of freedom
##
## AIC: 224.6641
##
## Number of iterations to convergence: 6
##
##
## ANALYSIS OF RESIDUALS AND INFLUENCE
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
## [sorted by Cook's Distance]
## [res_rows = 20 out of 340 cases (rows) of data]
## --------------------------------------------------------------------
## Hand Gender P(Y=1) residual rstudent dffits cooks
## 125 7.0 M 0.0109 0.9891 3.045 0.1930 0.11740
## 253 7.0 M 0.0109 0.9891 3.045 0.1930 0.11740
## 162 9.5 W 0.9706 -0.9706 -2.684 -0.2256 0.07555
## 313 9.5 W 0.9706 -0.9706 -2.684 -0.2256 0.07555
## 20 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 33 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 59 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 67 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 69 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 87 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 90 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 150 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 248 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 276 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 284 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 308 9.0 W 0.8695 -0.8695 -2.031 -0.2229 0.02611
## 8 8.0 M 0.2132 0.7868 1.766 0.1948 0.01462
## 109 8.0 M 0.2132 0.7868 1.766 0.1948 0.01462
## 132 8.0 M 0.2132 0.7868 1.766 0.1948 0.01462
## 142 8.0 M 0.2132 0.7868 1.766 0.1948 0.01462
##
##
## PREDICTION
##
## Probability threshold for classification M: 0.5
##
## 0: W
## 1: M
##
## Data, Fitted Values, Standard Errors
## [sorted by fitted value]
## [pred_all=TRUE to see all intervals displayed]
## --------------------------------------------------------------------
## Hand Gender label fitted std.err
## 153 6.00 W 0 0.0004481 0.0003597
## 32 6.75 W 0 0.0049256 0.0027736
## 155 6.75 W 0 0.0049256 0.0027736
## 7 7.00 W 0 0.0109024 0.0052692
##
## ... for the rows of data where fitted is close to 0.5 ...
##
## Hand Gender label fitted std.err
## 293 8.25 M 0 0.3764 0.04188
## 311 8.25 M 0 0.3764 0.04188
## 1 8.50 W 1 0.5734 0.04274
## 9 8.50 M 1 0.5734 0.04274
## 16 8.50 M 1 0.5734 0.04274
##
## ... for the last 4 rows of sorted data ...
##
## Hand Gender label fitted std.err
## 151 11 M 1 0.9998 0.0002152
## 196 11 M 1 0.9998 0.0002152
## 257 11 M 1 0.9998 0.0002152
## 299 11 M 1 0.9998 0.0002152
## --------------------------------------------------------------------
##
##
## ----------------------------
## Specified confusion matrices
## ----------------------------
##
## Probability threshold for predicting M: 0.3
## Corresponding cutoff threshold for Hand: 8.143
##
## Baseline Predicted
## ---------------------------------------------------
## Total %Tot 0 1 %Correct
## ---------------------------------------------------
## 1 170 50.0 17 153 90.0
## Gender 0 170 50.0 147 23 86.5
## ---------------------------------------------------
## Total 340 88.2
##
## Accuracy: 88.24
## Sensitivity: 90.00
## Precision: 86.93
##
##
##
## Probability threshold for predicting M: 0.5
## Corresponding cutoff threshold for Hand: 8.408
##
## Baseline Predicted
## ---------------------------------------------------
## Total %Tot 0 1 %Correct
## ---------------------------------------------------
## 1 170 50.0 17 153 90.0
## Gender 0 170 50.0 147 23 86.5
## ---------------------------------------------------
## Total 340 88.2
##
## Accuracy: 88.24
## Sensitivity: 90.00
## Precision: 86.93
##
##
##
## Probability threshold for predicting M: 0.7
## Corresponding cutoff threshold for Hand: 8.672
##
## Baseline Predicted
## ---------------------------------------------------
## Total %Tot 0 1 %Correct
## ---------------------------------------------------
## 1 170 50.0 39 131 77.1
## Gender 0 170 50.0 156 14 91.8
## ---------------------------------------------------
## Total 340 84.4
##
## Accuracy: 84.41
## Sensitivity: 77.06
## Precision: 90.34
Categorize Hand size into six bins. Compute the conditional mean of Gender, scored as 0 and 1, at each level of Hand size. Both variables must be numeric. The visualization approximates the form of the sigmoid function from logistic regression. The point (bubble) size depends on the sample size for the corresponding bin.
d$Gender <- ifelse (d$Gender == "M", 1, 0)
Plot(Hand, Gender, n_bins=6)## [Interactive chart from the Plotly R package (Sievert, 2020)]
The parameter Rmd creates an R markdown file that is automatically generated and then the corresponding html document from knitting the various output components together with full interpretation. A new, much more complete form of computer output.
Not run here.
reg(Salary ~ Years + Pre, Rmd="eg")
Use the base R help() function to view the full manual for Regression(). Simply enter a question mark followed by the name of the function, or its abbreviation.
?reg