Regression Analysis

David Gerbing

library("lessR")

The Regression() function performs multiple aspects of a complete regression analysis. Abbreviate with reg(). To illustrate, first read the Employee data included as part of lessR. Read into the default lessR data frame d.

d <- Read("Employee")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
##  2    Gender character     37       0       2   M  M  W ... W  W  M
##  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  high ... high  low  high
##  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
## ------------------------------------------------------------------------------------------

As an option, also read the table of variable labels. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need to be entered into the table. The table can be a csv file or an Excel file.

Read the label file into the l data frame, currently the only permitted name. The labels are displayed on both the text and visualization output. Each displayed label consists of the variable name juxtaposed with the corresponding label, as shown in the display of the label file.

l <- rd("Employee_lbl")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     label character      8       0       8   Time of Company Employment ... Test score on legal issues after instruction
## ------------------------------------------------------------------------------------------
l
##                                                label
## Years                     Time of Company Employment
## Gender                                  Man or Woman
## Dept                             Department Employed
## Salary                           Annual Salary (USD)
## JobSat            Satisfaction with Work Environment
## Plan             1=GoodHealth, 2=GetWell, 3=BestCare
## Pre    Test score on legal issues before instruction
## Post    Test score on legal issues after instruction

Default Analysis

Brief Output

The brief version provides just the basic analysis, what Excel provides, plus a scatterplot with the regression line, which becomes a scatterplot matrix with multiple regression. Because d is the default name of the data frame that contains the variables for analysis, the data parameter that names the input data frame need not be specified. Here, specify Salary as the target or response variable with features, or predictor variables, Years and Pre.

reg_brief(Salary ~ Years + Pre)

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ Years + Pre, Rmd="eg")  
## 
## 
##   BACKGROUND 
## 
## Data Frame:  d 
##  
## Response Variable: Salary 
## Predictor Variable 1: Years 
## Predictor Variable 2: Pre 
##  
## Number of cases (rows) of data:  37 
## Number of cases retained for analysis:  36 
## 
## 
##   BASIC ANALYSIS 
## 
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95% 
## (Intercept) 44140.971  13666.115    3.230    0.003   16337.052   71944.891 
##       Years  3251.408    347.529    9.356    0.000    2544.355    3958.462 
##         Pre   -18.265    167.652   -0.109    0.914    -359.355     322.825 
## 
## Standard deviation of Salary: 21,822.372 
##  
## Standard deviation of residuals:  11,753.478 for df=33 
## 95% range of residuals:  47,825.260 = 2 * (2.035 * 11,753.478) 
##  
## R-squared: 0.726    Adjusted R-squared: 0.710    PRESS R-squared: 0.659 
## 
## Null hypothesis of all 0 population slope coefficients:
##   F-statistic: 43.827     df: 2 and 33     p-value:  0.000 
## 
## -- Analysis of Variance 
##  
##             df           Sum Sq          Mean Sq   F-value   p-value 
##     Years    1  12107157290.292  12107157290.292    87.641     0.000 
##       Pre    1      1639658.444      1639658.444     0.012     0.914 
##  
## Model        2  12108796948.736   6054398474.368    43.827     0.000 
## Residuals   33   4558759843.773    138144237.690 
## Salary      35  16667556792.508    476215908.357 
## 
## 
##   K-FOLD CROSS-VALIDATION 
## 
## 
##   RELATIONS AMONG THE VARIABLES 
## 
## 
##   RESIDUALS AND INFLUENCE 
## 
## 
##   PREDICTION ERROR

Full Output

The full output is extensive: Summary of the analysis, estimated model, fit indices, ANOVA, correlation matrix, collinearity analysis, best subset regression, residuals and influence statistics, and prediction intervals. The motivation is to provide virtually all of the information needed for a proper regression analysis.

reg(Salary ~ Years + Pre)

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ Years + Pre, Rmd="eg")  
## 
## 
##   BACKGROUND 
## 
## Data Frame:  d 
##  
## Response Variable: Salary 
## Predictor Variable 1: Years 
## Predictor Variable 2: Pre 
##  
## Number of cases (rows) of data:  37 
## Number of cases retained for analysis:  36 
## 
## 
##   BASIC ANALYSIS 
## 
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95% 
## (Intercept) 44140.971  13666.115    3.230    0.003   16337.052   71944.891 
##       Years  3251.408    347.529    9.356    0.000    2544.355    3958.462 
##         Pre   -18.265    167.652   -0.109    0.914    -359.355     322.825 
## 
## Standard deviation of Salary: 21,822.372 
##  
## Standard deviation of residuals:  11,753.478 for df=33 
## 95% range of residuals:  47,825.260 = 2 * (2.035 * 11,753.478) 
##  
## R-squared: 0.726    Adjusted R-squared: 0.710    PRESS R-squared: 0.659 
## 
## Null hypothesis of all 0 population slope coefficients:
##   F-statistic: 43.827     df: 2 and 33     p-value:  0.000 
## 
## -- Analysis of Variance 
##  
##             df           Sum Sq          Mean Sq   F-value   p-value 
##     Years    1  12107157290.292  12107157290.292    87.641     0.000 
##       Pre    1      1639658.444      1639658.444     0.012     0.914 
##  
## Model        2  12108796948.736   6054398474.368    43.827     0.000 
## Residuals   33   4558759843.773    138144237.690 
## Salary      35  16667556792.508    476215908.357 
## 
## 
##   K-FOLD CROSS-VALIDATION 
## 
## 
##   RELATIONS AMONG THE VARIABLES 
## 
##          Salary Years  Pre 
##   Salary   1.00  0.85 0.03 
##    Years   0.85  1.00 0.05 
##      Pre   0.03  0.05 1.00 
## 
##         Tolerance       VIF 
##   Years     0.998     1.002 
##     Pre     0.998     1.002 
## 
##  Years Pre    R2adj    X's 
##      1   0    0.718      1 
##      1   1    0.710      2 
##      0   1   -0.028      1 
##  
## [based on Thomas Lumley's leaps function from the leaps package] 
## 
## 
##   RESIDUALS AND INFLUENCE 
## 
## -- Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance 
##    [sorted by Cook's Distance] 
##    [n_res_rows = 20, out of 36 rows of data, or do n_res_rows="all"] 
## ----------------------------------------------------------------------------------------- 
##                        Years     Pre     Salary     fitted      resid rstdnt dffits cooks 
##       Correll, Trevon     21      97 134419.230 110648.843  23770.387  2.424  1.217 0.430 
##         James, Leslie     18      70 122563.380 101387.773  21175.607  1.998  0.714 0.156 
##         Capelle, Adam     24      83 108138.430 120658.778 -12520.348 -1.211 -0.634 0.132 
##           Hoang, Binh     15      96 111074.860  91158.659  19916.201  1.860  0.649 0.131 
##    Korhalkar, Jessica      2      74  72502.500  49292.181  23210.319  2.171  0.638 0.122 
##        Billing, Susan      4      91  72675.260  55484.493  17190.767  1.561  0.472 0.071 
##          Singh, Niral      2      59  61055.440  49566.155  11489.285  1.064  0.452 0.068 
##        Skrotzki, Sara     18      63  91352.330 101515.627 -10163.297 -0.937 -0.397 0.053 
##      Saechao, Suzanne      8      98  55545.250  68362.271 -12817.021 -1.157 -0.390 0.050 
##         Kralik, Laura     10      74  92681.190  75303.447  17377.743  1.535  0.287 0.026 
##   Anastasiou, Crystal      2      59  56508.320  49566.155   6942.165  0.636  0.270 0.025 
##     Langston, Matthew      5      94  49188.960  58681.106  -9492.146 -0.844 -0.268 0.024 
##        Afshari, Anbar      6     100  69441.930  61822.925   7619.005  0.689  0.264 0.024 
##   Cassinelli, Anastis     10      80  57562.360  75193.857 -17631.497 -1.554 -0.265 0.022 
##      Osterman, Pascal      5      69  49704.790  59137.730  -9432.940 -0.826 -0.216 0.016 
##   Bellingar, Samantha     10      67  66337.830  75431.301  -9093.471 -0.793 -0.198 0.013 
##          LaRoe, Maria     10      80  61961.290  75193.857 -13232.567 -1.148 -0.195 0.013 
##      Ritchie, Darnell      7      82  53788.260  65403.102 -11614.842 -1.006 -0.190 0.012 
##        Sheppard, Cory     14      66  95027.550  88455.199   6572.351  0.579  0.176 0.011 
##        Downs, Deborah      7      90  57139.900  65256.982  -8117.082 -0.706 -0.174 0.010 
## 
## 
##   PREDICTION ERROR 
## 
## -- Data, Predicted, Standard Error of Prediction, 95% Prediction Intervals 
##    [sorted by lower bound of prediction interval] 
##    [to see all intervals add n_pred_rows="all"] 
##  ---------------------------------------------- 
## 
##                        Years    Pre     Salary       pred    s_pred    pi.lwr     pi.upr     width 
##          Hamide, Bita      1     83  51036.850  45876.388 12290.483 20871.211  70881.564 50010.352 
##          Singh, Niral      2     59  61055.440  49566.155 12619.291 23892.014  75240.296 51348.281 
##   Anastasiou, Crystal      2     59  56508.320  49566.155 12619.291 23892.014  75240.296 51348.281 
## ... 
##          Link, Thomas     10     83  66312.890  75139.062 11933.518 50860.137  99417.987 48557.849 
##          LaRoe, Maria     10     80  61961.290  75193.857 11918.048 50946.405  99441.308 48494.903 
##   Cassinelli, Anastis     10     80  57562.360  75193.857 11918.048 50946.405  99441.308 48494.903 
## ... 
##       Correll, Trevon     21     97 134419.230 110648.843 12881.876 84440.470 136857.217 52416.747 
##         Capelle, Adam     24     83 108138.430 120658.778 12955.608 94300.394 147017.161 52716.767 
## 
## ---------------------------------- 
## Plot 1: Distribution of Residuals 
## Plot 2: Residuals vs Fitted Values 
## ----------------------------------

Standardize the Variables

Request a briefer output with the reg_brief() version of the function. Standardize the predictor variables in the model by setting the new_scale parameter to "z". Plot the residuals as a line connecting each data point to the corresponding point on the regression line as specified with the plot_errors parameter. To also standardize the response variable, set parameter scale_response to TRUE.

reg_brief(Salary ~ Years, new_scale="z", plot_errors=TRUE)
## 
## Rescaled Data, First Six Rows
##                       Salary  Years
## Hamide, Bita        51036.85 -1.466
## Singh, Niral        61055.44 -1.291
## Korhalkar, Jessica  72502.50 -1.291
## Anastasiou, Crystal 56508.32 -1.291
## Gvakharia, Kimberly 49868.68 -1.116
## Stanley, Emma       46124.97 -1.116

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ Years, new_scale="z", plot_errors=TRUE, Rmd="eg")  
## 
## 
##   BACKGROUND 
## 
## Data Frame:  d 
##  
## Response Variable: Salary 
## Predictor Variable: Years 
##  
## Number of cases (rows) of data:  37 
## Number of cases retained for analysis:  36 
##  
## Data are Standardized 
## 
## 
##   BASIC ANALYSIS 
## 
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95% 
## (Intercept) 73219.551   1930.348   37.931    0.000   69296.612   77142.490 
##       Years 18595.810   1957.448    9.500    0.000   14617.797   22573.823 
## 
## Standard deviation of Salary: 21,822.372 
##  
## Standard deviation of residuals:  11,582.088 for df=34 
## 95% range of residuals:  47,075.271 = 2 * (2.032 * 11,582.088) 
##  
## R-squared: 0.726    Adjusted R-squared: 0.718    PRESS R-squared: 0.681 
## 
## Null hypothesis of all 0 population slope coefficients:
##   F-statistic: 90.251     df: 1 and 34     p-value:  0.000 
## 
## -- Analysis of Variance 
##  
##             df           Sum Sq          Mean Sq   F-value   p-value 
## Model        1  12106634568.544  12106634568.544    90.251     0.000 
## Residuals   34   4560922223.964    134144771.293 
## Salary      35  16667556792.508    476215908.357 
## 
## 
##   K-FOLD CROSS-VALIDATION 
## 
## 
##   RELATIONS AMONG THE VARIABLES 
## 
## 
##   RESIDUALS AND INFLUENCE 
## 
## 
##   PREDICTION ERROR

k-fold Cross-validation

Specify a cross-validation with the kfold parameter. Here, specify three folds. The function automatically creates the training and testing data sets.

reg(Salary ~ Years, kfold=3)
## 
##   3-FOLD CROSS-VALIDATION 
## 
##        Model from Training Data              Applied to Testing Data 
##        ----------------------------------   ---------------------------------- 
## fold   n        se           MSE    Rsq     n        sp           MSE    Rsq 
##   1 | 24 11660.989 135978665.027  0.801 |  12 12838.368 164823701.898 -0.217 
##   2 | 24  8999.503  80991061.305  0.671 |  12 19788.803 391596728.186  0.632 
##   3 | 24 12329.144 152007787.595  0.734 |  12 13144.990 172790769.294  0.571 
##       ----------------------------------    ---------------------------------- 
## Mean     10996.545 122992504.642  0.735       15257.387 243070399.793  0.329

The standard output also includes \(R^2_{press}\), the value of \(R^2\) when applied to new, previously unseen data, a value comparable to the average \(R^2\) on test data.

Output as a Stored Object

The output of Regression() can be stored into an R object, here named r. The output object consists of various components that together define the output of a comprehensive regression analysis. R refers to the resulting output structure as a list object.

r <- reg(Salary ~ Years + Pre)

Entering the name of the object displays the full output, the default output when the output is directed to the R console instead of saving into an R object.

r
## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ Years + Pre, Rmd="eg")  
## 
## 
##   BACKGROUND 
## 
## Data Frame:  d 
##  
## Response Variable: Salary 
## Predictor Variable 1: Years 
## Predictor Variable 2: Pre 
##  
## Number of cases (rows) of data:  37 
## Number of cases retained for analysis:  36 
## 
## 
##   BASIC ANALYSIS 
## 
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95% 
## (Intercept) 44140.971  13666.115    3.230    0.003   16337.052   71944.891 
##       Years  3251.408    347.529    9.356    0.000    2544.355    3958.462 
##         Pre   -18.265    167.652   -0.109    0.914    -359.355     322.825 
## 
## Standard deviation of Salary: 21,822.372 
##  
## Standard deviation of residuals:  11,753.478 for df=33 
## 95% range of residuals:  47,825.260 = 2 * (2.035 * 11,753.478) 
##  
## R-squared: 0.726    Adjusted R-squared: 0.710    PRESS R-squared: 0.659 
## 
## Null hypothesis of all 0 population slope coefficients:
##   F-statistic: 43.827     df: 2 and 33     p-value:  0.000 
## 
## -- Analysis of Variance 
##  
##             df           Sum Sq          Mean Sq   F-value   p-value 
##     Years    1  12107157290.292  12107157290.292    87.641     0.000 
##       Pre    1      1639658.444      1639658.444     0.012     0.914 
##  
## Model        2  12108796948.736   6054398474.368    43.827     0.000 
## Residuals   33   4558759843.773    138144237.690 
## Salary      35  16667556792.508    476215908.357 
## 
## 
##   K-FOLD CROSS-VALIDATION 
## 
## 
##   RELATIONS AMONG THE VARIABLES 
## 
##          Salary Years  Pre 
##   Salary   1.00  0.85 0.03 
##    Years   0.85  1.00 0.05 
##      Pre   0.03  0.05 1.00 
## 
##         Tolerance       VIF 
##   Years     0.998     1.002 
##     Pre     0.998     1.002 
## 
##  Years Pre    R2adj    X's 
##      1   0    0.718      1 
##      1   1    0.710      2 
##      0   1   -0.028      1 
##  
## [based on Thomas Lumley's leaps function from the leaps package] 
## 
## 
##   RESIDUALS AND INFLUENCE 
## 
## -- Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance 
##    [sorted by Cook's Distance] 
##    [n_res_rows = 20, out of 36 rows of data, or do n_res_rows="all"] 
## ----------------------------------------------------------------------------------------- 
##                        Years     Pre     Salary     fitted      resid rstdnt dffits cooks 
##       Correll, Trevon     21      97 134419.230 110648.843  23770.387  2.424  1.217 0.430 
##         James, Leslie     18      70 122563.380 101387.773  21175.607  1.998  0.714 0.156 
##         Capelle, Adam     24      83 108138.430 120658.778 -12520.348 -1.211 -0.634 0.132 
##           Hoang, Binh     15      96 111074.860  91158.659  19916.201  1.860  0.649 0.131 
##    Korhalkar, Jessica      2      74  72502.500  49292.181  23210.319  2.171  0.638 0.122 
##        Billing, Susan      4      91  72675.260  55484.493  17190.767  1.561  0.472 0.071 
##          Singh, Niral      2      59  61055.440  49566.155  11489.285  1.064  0.452 0.068 
##        Skrotzki, Sara     18      63  91352.330 101515.627 -10163.297 -0.937 -0.397 0.053 
##      Saechao, Suzanne      8      98  55545.250  68362.271 -12817.021 -1.157 -0.390 0.050 
##         Kralik, Laura     10      74  92681.190  75303.447  17377.743  1.535  0.287 0.026 
##   Anastasiou, Crystal      2      59  56508.320  49566.155   6942.165  0.636  0.270 0.025 
##     Langston, Matthew      5      94  49188.960  58681.106  -9492.146 -0.844 -0.268 0.024 
##        Afshari, Anbar      6     100  69441.930  61822.925   7619.005  0.689  0.264 0.024 
##   Cassinelli, Anastis     10      80  57562.360  75193.857 -17631.497 -1.554 -0.265 0.022 
##      Osterman, Pascal      5      69  49704.790  59137.730  -9432.940 -0.826 -0.216 0.016 
##   Bellingar, Samantha     10      67  66337.830  75431.301  -9093.471 -0.793 -0.198 0.013 
##          LaRoe, Maria     10      80  61961.290  75193.857 -13232.567 -1.148 -0.195 0.013 
##      Ritchie, Darnell      7      82  53788.260  65403.102 -11614.842 -1.006 -0.190 0.012 
##        Sheppard, Cory     14      66  95027.550  88455.199   6572.351  0.579  0.176 0.011 
##        Downs, Deborah      7      90  57139.900  65256.982  -8117.082 -0.706 -0.174 0.010 
## 
## 
##   PREDICTION ERROR 
## 
## -- Data, Predicted, Standard Error of Prediction, 95% Prediction Intervals 
##    [sorted by lower bound of prediction interval] 
##    [to see all intervals add n_pred_rows="all"] 
##  ---------------------------------------------- 
## 
##                        Years    Pre     Salary       pred    s_pred    pi.lwr     pi.upr     width 
##          Hamide, Bita      1     83  51036.850  45876.388 12290.483 20871.211  70881.564 50010.352 
##          Singh, Niral      2     59  61055.440  49566.155 12619.291 23892.014  75240.296 51348.281 
##   Anastasiou, Crystal      2     59  56508.320  49566.155 12619.291 23892.014  75240.296 51348.281 
## ... 
##          Link, Thomas     10     83  66312.890  75139.062 11933.518 50860.137  99417.987 48557.849 
##          LaRoe, Maria     10     80  61961.290  75193.857 11918.048 50946.405  99441.308 48494.903 
##   Cassinelli, Anastis     10     80  57562.360  75193.857 11918.048 50946.405  99441.308 48494.903 
## ... 
##       Correll, Trevon     21     97 134419.230 110648.843 12881.876 84440.470 136857.217 52416.747 
##         Capelle, Adam     24     83 108138.430 120658.778 12955.608 94300.394 147017.161 52716.767 
## 
## ---------------------------------- 
## Plot 1: Distribution of Residuals 
## Plot 2: Residuals vs Fitted Values 
## ----------------------------------

Or, work with the components individually. Use the base R names() function to identify all of the output components. Component names that begin with out_ are part of the standard output. Other components include just data and statistics designed to be input in additional procedures, including R markdown documents.

names(r)
##  [1] "out_suggest"     "call"            "formula"         "vars"            "out_title_bck"   "out_background" 
##  [7] "out_title_basic" "out_estimates"   "out_fit"         "out_anova"       "out_title_mod"   "out_mod"        
## [13] "out_mdls"        "out_title_kfold" "out_kfold"       "out_title_rel"   "out_cor"         "out_collinear"  
## [19] "out_subsets"     "out_title_res"   "out_residuals"   "out_title_pred"  "out_predict"     "out_ref"        
## [25] "out_Rmd"         "out_Word"        "out_pdf"         "out_odt"         "out_rtf"         "out_plots"      
## [31] "n.vars"          "n.obs"           "n.keep"          "coefficients"    "sterrs"          "tvalues"        
## [37] "pvalues"         "cilb"            "ciub"            "anova_model"     "anova_residual"  "anova_total"    
## [43] "se"              "resid_range"     "Rsq"             "Rsqadj"          "PRESS"           "RsqPRESS"       
## [49] "m_se"            "m_MSE"           "m_Rsq"           "cor"             "tolerances"      "vif"            
## [55] "resid.max"       "pred_min_max"    "residuals"       "fitted"          "cooks.distance"  "model"          
## [61] "terms"

Here, only display the estimates and their inferential analysis as part of the standard text output.

r$out_estimates
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95%
## (Intercept) 44140.971  13666.115    3.230    0.003   16337.052   71944.891
##       Years  3251.408    347.529    9.356    0.000    2544.355    3958.462
##         Pre   -18.265    167.652   -0.109    0.914    -359.355     322.825

Here, display the numeric values of the coefficients.

r$coefficients
## (Intercept)       Years         Pre 
## 44140.97140  3251.40825   -18.26496

An analysis of hundreds or thousands of rows of data can make it difficult to locate a specific prediction interval of interest. To initiate a search for a specific row, first do the regression and request all prediction intervals with parameter pred_rows. Then convert that output to a data frame named dp with base R read.table(). As a data frame, do a standard search for an individual row for a specific prediction interval (see the Subset a Data Frame vignette for directions to subset).

This particular conversion to a data frame requires one more step. One or more spaces in the out_predict output delimit adjacent columns, but the names in this data set are formatted with a comma followed by a space. Use base R sub() to remove the space after the comma before converting to a data frame.

r <- reg(Salary ~ Years, pred_rows="all", graphics=FALSE)
r$out_predict = sub(", ", ",", r$out_predict, fixed=TRUE)
dp <- read.table(text=r$out_predict)
dp[.(row.names(dp) == "Pham,Scott"),]
##            Years   Salary     pred   s_pred   pi.lwr   pi.upr   width
## Pham,Scott    13 81871.05 84955.08 11805.96 60962.48 108947.7 47985.2

Contrasts

Because reg() accomplishes its computations with base R function lm(), lm() parameters can be passed to reg(), which then passes the values to lm() to define the corresponding indicator variables. Here, first use base R function contr.sum() to calculate an effect coding contrast matrix for a categorical variable with three levels, such as the variable Plan in the Employee data set.

cnt <- contr.sum(n=3)
cnt
##   [,1] [,2]
## 1    1    0
## 2    0    1
## 3   -1   -1

Now use the lm() parameter contrasts to define the effect coding for JobSat, passed to reg_brief(). Contrasts only apply to factors, so convert JobSat to an R factor before the regression analysis, a task that should generally be done for all categorical variables in an analysis. Here, designate the order of the levels on output displays such as a bar graph.

d$JobSat <- factor(d$JobSat, levels=c("low", "med", "high"))
reg_brief(Salary ~ JobSat, contrasts=list(JobSat=cnt))
## 
## >>>  JobSat is not numeric. Converted to indicator variables.

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ JobSat, contrasts=list(JobSat=cnt), Rmd="eg")  
## 
## 
##   BACKGROUND 
## 
## Data Frame:  d 
##  
## Response Variable: Salary 
## Predictor Variable 1: JobSatmed 
## Predictor Variable 2: JobSathigh 
##  
## Number of cases (rows) of data:  37 
## Number of cases retained for analysis:  35 
## 
## 
##   BASIC ANALYSIS 
## 
##               Estimate    Std Err  t-value  p-value    Lower 95%   Upper 95% 
## (Intercept)  85121.843   5521.853   15.415    0.000    73874.197   96369.490 
##   JobSatmed  -8435.630   8156.317   -1.034    0.309   -25049.505    8178.245 
##  JobSathigh -25664.732   8156.317   -3.147    0.004   -42278.607   -9050.857 
## 
## Standard deviation of Salary: 22,157.414 
##  
## Standard deviation of residuals:  19,909.324 for df=32 
## 95% range of residuals:  81,107.933 = 2 * (2.037 * 19,909.324) 
##  
## R-squared: 0.240    Adjusted R-squared: 0.193    PRESS R-squared: 0.098 
## 
## Null hypothesis of all 0 population slope coefficients:
##   F-statistic: 5.056     df: 2 and 32     p-value:  0.012 
## 
## -- Analysis of Variance 
##  
##              df           Sum Sq         Mean Sq   F-value   p-value 
##  JobSatmed    1     83510016.635    83510016.635     0.211     0.649 
## JobSathigh    1   3924625926.927  3924625926.927     9.901     0.004 
##  
## Model         2   4008135943.563  2004067971.781     5.056     0.012 
## Residuals    32  12684198101.098   396381190.659 
## Salary       34  16692334044.661   490951001.314 
## 
## 
##   K-FOLD CROSS-VALIDATION 
## 
## 
##   RELATIONS AMONG THE VARIABLES 
## 
## 
##   RESIDUALS AND INFLUENCE 
## 
## 
##   PREDICTION ERROR

Null Model

The \(R^2\) fit statistic compares the sum of the squared errors of the model with the X predictor variables to the sum of squared errors of the null model. The baseline of comparison, the null model, is a model with no X variables such that the fitted value for each set of X values is the mean of response variable \(y\). The corresponding slope intercept is the mean of \(y\), and the standard deviation of the residuals is the standard deviation of \(y\).

The following submits the null model for Salary, and plots the errors. Compare the variability of the residuals to a regression model of Salary with one or more predictor variables. To the extent that the inclusion of one or more predictor variables in the model reduces the variability of the data about the regression line compared to the null model, the model fits the data.

reg_brief(Salary ~ 1, plot_errors=TRUE)

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ 1, plot_errors=TRUE, Rmd="eg")  
## 
## 
##   BACKGROUND 
## 
## Data Frame:  d 
##  
## Response Variable: Salary 
##  
## Number of cases (rows) of data:  37 
## Number of cases retained for analysis:  37 
## 
## 
##   BASIC ANALYSIS 
## 
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95% 
## (Intercept) 73795.557   3583.821   20.591    0.000   66527.230   81063.883 
## 
## Standard deviation of Salary: 21,799.533 
##  
## Standard deviation of residuals:  21,799.533 for df=36 
## 95% range of residuals:  88,423.006 = 2 * (2.028 * 21,799.533) 
## 
## -- Analysis of Variance 
##  
##             df           Sum Sq        Mean Sq   F-value   p-value 
## Residuals   36  17107907732.489  475219659.236 
## 
## 
##   K-FOLD CROSS-VALIDATION 
## 
## 
##   RELATIONS AMONG THE VARIABLES 
## 
## 
##   RESIDUALS AND INFLUENCE 
## 
## 
##   PREDICTION ERROR

Can also get the null model plot from the lessR function Plot() with the fit parameter set to "null".

Likert Type Data

The scatterplot is displayed as a bubble plot when both variables consist of less than 10 unique integer values. With the bubble plot, there is no overprinting of the same point so that the number of values that represent a point is displayed.

dd <- Read("Mach4")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## integer: Numeric data values, integers only
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1    Gender   integer    351       0       2   0  0  1 ... 0  0  1
##  2       m01   integer    351       0       6   0  0  2 ... 2  1  3
##  3       m02   integer    351       0       6   4  1  1 ... 3  4  3
##  4       m03   integer    351       0       6   1  4  0 ... 3  4  3
##  5       m04   integer    351       0       6   5  4  5 ... 3  4  4
##  6       m05   integer    351       0       6   0  0  4 ... 2  3  3
##  7       m06   integer    351       0       6   5  3  4 ... 4  4  2
##  8       m07   integer    351       0       6   4  3  0 ... 4  4  2
##  9       m08   integer    351       0       6   1  0  5 ... 3  2  3
## 10       m09   integer    351       0       6   5  4  3 ... 3  3  3
## 11       m10   integer    351       0       6   4  4  4 ... 3  4  3
## 12       m11   integer    351       0       6   0  0  1 ... 1  1  2
## 13       m12   integer    351       0       6   0  1  4 ... 2  1  3
## 14       m13   integer    351       0       6   0  1  0 ... 3  1  2
## 15       m14   integer    351       0       6   0  1  0 ... 2  2  2
## 16       m15   integer    351       0       6   4  2  2 ... 3  5  3
## 17       m16   integer    351       0       6   0  4  0 ... 0  2  5
## 18       m17   integer    351       0       6   1  4  2 ... 0  2  2
## 19       m18   integer    351       0       6   3  3  4 ... 4  4  3
## 20       m19   integer    351       0       6   2  1  0 ... 0  0  1
## 21       m20   integer    351       0       6   4  0  1 ... 1  0  3
## ------------------------------------------------------------------------------------------
reg_brief(m10 ~ m02, data=dd)

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(m10 ~ m02, data=dd, Rmd="eg")  
## 
## 
##   BACKGROUND 
## 
## Data Frame:  dd 
##  
## Response Variable: m10 
## Predictor Variable: m02 
##  
## Number of cases (rows) of data:  351 
## Number of cases retained for analysis:  351 
## 
## 
##   BASIC ANALYSIS 
## 
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95% 
## (Intercept)     4.285      0.092   46.642    0.000       4.104       4.466 
##         m02    -0.168      0.040   -4.184    0.000      -0.247      -0.089 
## 
## Standard deviation of m10: 1.138 
##  
## Standard deviation of residuals:  1.112 for df=349 
## 95% range of residuals:  4.373 = 2 * (1.967 * 1.112) 
##  
## R-squared: 0.048    Adjusted R-squared: 0.045    PRESS R-squared: 0.037 
## 
## Null hypothesis of all 0 population slope coefficients:
##   F-statistic: 17.502     df: 1 and 349     p-value:  0.000 
## 
## -- Analysis of Variance 
##  
##             df    Sum Sq   Mean Sq   F-value   p-value 
## Model        1    21.632    21.632    17.502     0.000 
## Residuals  349   431.343     1.236 
## m10        350   452.974     1.294 
## 
## 
##   K-FOLD CROSS-VALIDATION 
## 
## 
##   RELATIONS AMONG THE VARIABLES 
## 
## 
##   RESIDUALS AND INFLUENCE 
## 
## 
##   PREDICTION ERROR

Analysis of Covariance

Obtain an ANCOVA by entering categorical and continuous variables as predictor variables. For a single categorical variable and a single continuous variable, Regression() displays the regression line for each level of the categorical variable.

The ANCOVA assumes that the slopes for the different levels of the categorical variable are the same for the pairing of the continuous predictor variable and continuous response variable. Visually evaluate this assumption by plotting each separate slope and scatterplot.

Plot(Salary, Years, by=Dept, fit="lm")

## 
## Dept: ACCT   Line: b0 = -0.8    b1 = 0.0    Fit: MSE = 12   Rsq = 0.161
##  
## Dept: ADMN   Line: b0 = -12.3    b1 = 0.0    Fit: MSE = 24   Rsq = 0.752
##  
## Dept: FINC   Line: b0 = 0.2    b1 = 0.0    Fit: MSE = 2   Rsq = 0.824
##  
## Dept: MKTG   Line: b0 = -14.0    b1 = 0.0    Fit: MSE = 5   Rsq = 0.916
##  
## Dept: SALE   Line: b0 = -4.6    b1 = 0.0    Fit: MSE = 5   Rsq = 0.813
## 

Then, if the slopes are not too dissimilar, run the ANCOVA. The categorical variable must be interpretable as a categorical variable, either as an R variable type factor or as a non-numerical type character string. If the categorical variable is coded numerically, convert to a factor, such as d$CatVar <- factor(d$CatVar) which retains the original numerical values as the value labels.

The ANCOVA displays the appropriate Type II Sum of Squares in its ANOVA table for properly evaluating the group effect that corresponds to the entered categorical variable. Note that this SS is only displayed for an ANOVA with a single categorical variable and a single covariate

reg_brief(Salary ~ Dept + Years)
## 
## >>>  Dept is not numeric. Converted to indicator variables.

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ Dept + Years, Rmd="eg")  
## 
## 
##   BACKGROUND 
## 
## Data Frame:  d 
##  
## Response Variable: Salary 
## Predictor Variable 1: DeptADMN 
## Predictor Variable 2: DeptFINC 
## Predictor Variable 3: DeptMKTG 
## Predictor Variable 4: DeptSALE 
## Predictor Variable 5: Years 
##  
## Number of cases (rows) of data:  37 
## Number of cases retained for analysis:  35 
## 
## 
##   BASIC ANALYSIS 
## 
##              Estimate    Std Err  t-value  p-value    Lower 95%   Upper 95% 
## (Intercept) 43680.151   5634.233    7.753    0.000    32156.850   55203.452 
##    DeptADMN  4713.926   7302.390    0.646    0.524   -10221.137   19648.990 
##    DeptFINC -7822.049   8057.041   -0.971    0.340   -24300.547    8656.450 
##    DeptMKTG -5227.930   7275.625   -0.719    0.478   -20108.253    9652.394 
##    DeptSALE   762.933   6351.479    0.120    0.905   -12227.301   13753.167 
##       Years  3234.397    364.712    8.868    0.000     2488.478    3980.317 
## 
## Standard deviation of Salary: 21,881.046 
##  
## Standard deviation of residuals:  11,741.645 for df=29 
## 95% range of residuals:  48,028.719 = 2 * (2.045 * 11,741.645) 
##  
## R-squared: 0.754    Adjusted R-squared: 0.712    PRESS R-squared: NA 
## 
## Null hypothesis of all 0 population slope coefficients:
##   F-statistic: 17.815     df: 5 and 29     p-value:  0.000 
## 
## -- Analysis of Variance from Type II Sums of Squares 
##  
##             df           Sum Sq          Mean Sq   F-value   p-value 
##  DeptADMN    4    534128871.049    133532217.762     0.969     0.440 
##  DeptFINC    1     47920138.873     47920138.873     0.348     0.560 
##  DeptMKTG    1     48610362.188     48610362.188     0.353     0.557 
##  DeptSALE    1    933561595.768    933561595.768     6.772     0.014 
##     Years    1  10842890397.792  10842890397.792    78.648     0.000 
## Residuals   29   3998120326.431    137866218.153 
## 
## -- Test of Interaction 
##   
## Dept:Years  df: 4  df resid: 25  SS: 784731402.138  F: 1.526  p-value: 0.225 
##   
## -- Assume parallel lines, no interaction of Dept with Years 
##  
## Level ACCT: y^_Salary = 43680.151 + 3234.397(x_Years) 
## Level ADMN: y^_Salary = 48394.077 + 3234.397(x_Years) 
## Level FINC: y^_Salary = 35858.103 + 3234.397(x_Years) 
## Level MKTG: y^_Salary = 38452.221 + 3234.397(x_Years) 
## Level SALE: y^_Salary = 44443.084 + 3234.397(x_Years) 
##   
## -- Visualize Separately Computed Regression Lines 
## 
## Plot(Years, Salary, by=Dept, fit="lm") 
## 
## 
##   K-FOLD CROSS-VALIDATION 
## 
## 
##   RELATIONS AMONG THE VARIABLES 
## 
## 
##   RESIDUALS AND INFLUENCE 
## 
## 
##   PREDICTION ERROR

Moderation Analysis

To do a moderation analysis, specify one of (only) two predictor variables with the parameter mod.

reg_brief(Salary ~ Years + Pre, mod=Pre)

## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(Salary ~ Years + Pre, mod=Pre, Rmd="eg")  
## 
## 
##   BACKGROUND 
## 
## Data Frame:  d 
##  
## Response Variable: Salary 
## Predictor Variable 1: Years 
## Predictor Variable 2: Pre 
## Predictor Variable 3: Pre.Years 
##  
## Number of cases (rows) of data:  37 
## Number of cases retained for analysis:  36 
## 
## 
##   BASIC ANALYSIS 
## 
##              Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95% 
## (Intercept) 73048.486   1891.523   38.619    0.000   69195.579   76901.392 
##       Years  3243.027    335.204    9.675    0.000    2560.238    3925.815 
##         Pre     4.246    162.142    0.026    0.979    -326.026     334.517 
##   Pre.Years    53.181     28.517    1.865    0.071      -4.907     111.268 
## 
## Standard deviation of Salary: 21,822.372 
##  
## Standard deviation of residuals:  11,335.624 for df=32 
## 95% range of residuals:  46,179.821 = 2 * (2.037 * 11,335.624) 
##  
## R-squared: 0.753    Adjusted R-squared: 0.730    PRESS R-squared: 0.665 
## 
## Null hypothesis of all 0 population slope coefficients:
##   F-statistic: 32.571     df: 3 and 32     p-value:  0.000 
## 
## -- Analysis of Variance 
##  
##             df           Sum Sq          Mean Sq   F-value   p-value 
##     Years    1  12107157290.292  12107157290.292    94.222     0.000 
##       Pre    1      1639658.444      1639658.444     0.013     0.911 
## Pre.Years    1    446876022.168    446876022.168     3.478     0.071 
##  
## Model        3  12555672970.904   4185224323.635    32.571     0.000 
## Residuals   32   4111883821.605    128496369.425 
## Salary      35  16667556792.508    476215908.357 
## 
## 
##   MODERATION ANALYSIS 
## 
## Mean of Pre: 0.000 
## SD of   Pre: 11.864 
##  
## mean+1SD for Pre:  b0=73098.86  b1=3873.983 
## mean     for Pre:  b0=73048.486  b1=3243.027 
## mean-1SD for Pre:  b0=72998.112  b1=2612.071 
## 
## 
##   K-FOLD CROSS-VALIDATION 
## 
## 
##   RELATIONS AMONG THE VARIABLES 
## 
## 
##   RESIDUALS AND INFLUENCE 
## 
## 
##   PREDICTION ERROR

In this analysis, Pre is not a moderator of the impact of Years on Salary. There is a tendency expressed by the non-parallel lines in the visualization, and an almost significant interaction, but the interaction was not detected at the \(\alpha=0.5\) level.

Logistic Regression

For a model with a binary response variable, \(y\), specify multiple logistic regression with the usual R formula syntax applied to the lessR function Logit(). The output includes the confusion matrix and various classification fit indices.

Default Analysis

d <- Read("BodyMeas")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1    Gender character    340       0       2   W  M  W ... M  W  M
##  2    Weight   integer    340       0     100   200  200  155 ... 185  114  180
##  3    Height   integer    340       0      20   71  71  66 ... 72  63  69
##  4     Waist   integer    340       0      35   43  40  31 ... 37  32  35
##  5      Hips   integer    340       0      28   46  42  43 ... 38  39  44
##  6     Chest   integer    340       0      27   45  42  37 ... 44  36  40
##  7      Hand    double    340       0      18   8.5  9.75  8 ... 9  7.5  8.75
##  8      Shoe    double    340       0      20   7.5  11  8 ... 10.5  7.5  8
## ------------------------------------------------------------------------------------------
Logit(Gender ~ Hand)
## 
## Response Variable:   Gender
## Predictor Variable 1:  Hand
## 
## Number of cases (rows) of data:  340 
## Number of cases retained for analysis:  340 
## 
## 
##    BASIC ANALYSIS 
## 
## -- Estimated Model of Gender for the Logit of Reference Group Membership
## 
##              Estimate    Std Err  z-value  p-value   Lower 95%   Upper 95%
## (Intercept)  -26.9237     2.7515   -9.785    0.000    -32.3166    -21.5308 
##        Hand    3.2023     0.3269    9.794    0.000      2.5615      3.8431 
## 
## 
## -- Odds Ratios and Confidence Intervals
## 
##              Odds Ratio   Lower 95%   Upper 95%
## (Intercept)      0.0000      0.0000      0.0000 
##        Hand     24.5883     12.9547     46.6690 
## 
## 
## -- Model Fit
## 
##     Null deviance: 471.340 on 339 degrees of freedom
## Residual deviance: 220.664 on 338 degrees of freedom
## 
## AIC: 224.6641 
## 
## Number of iterations to convergence: 6 
## 
## 
##    ANALYSIS OF RESIDUALS AND INFLUENCE 
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
##    [sorted by Cook's Distance]
##    [res_rows = 20 out of 340 cases (rows) of data]
## --------------------------------------------------------------------
##     Hand Gender P(Y=1) residual rstudent  dffits   cooks
## 125  7.0      M 0.0109   0.9891    3.045  0.1930 0.11740
## 253  7.0      M 0.0109   0.9891    3.045  0.1930 0.11740
## 162  9.5      W 0.9706  -0.9706   -2.684 -0.2256 0.07555
## 313  9.5      W 0.9706  -0.9706   -2.684 -0.2256 0.07555
## 20   9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 33   9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 59   9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 67   9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 69   9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 87   9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 90   9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 150  9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 248  9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 276  9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 284  9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 308  9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 8    8.0      M 0.2132   0.7868    1.766  0.1948 0.01462
## 109  8.0      M 0.2132   0.7868    1.766  0.1948 0.01462
## 132  8.0      M 0.2132   0.7868    1.766  0.1948 0.01462
## 142  8.0      M 0.2132   0.7868    1.766  0.1948 0.01462
## 
## 
##    PREDICTION 
## 
## Probability threshold for classification M: 0.5
## 
##  0: W
##  1: M
## 
## Data, Fitted Values, Standard Errors
##    [sorted by fitted value]
##    [pred_all=TRUE to see all intervals displayed]
## --------------------------------------------------------------------
##     Hand Gender label    fitted   std.err
## 153 6.00      W     0 0.0004481 0.0003597
## 32  6.75      W     0 0.0049256 0.0027736
## 155 6.75      W     0 0.0049256 0.0027736
## 7   7.00      W     0 0.0109024 0.0052692
## 
## ... for the rows of data where fitted is close to 0.5 ...
## 
##     Hand Gender label fitted std.err
## 293 8.25      M     0 0.3764 0.04188
## 311 8.25      M     0 0.3764 0.04188
## 1   8.50      W     1 0.5734 0.04274
## 9   8.50      M     1 0.5734 0.04274
## 16  8.50      M     1 0.5734 0.04274
## 
## ... for the last 4 rows of sorted data ...
## 
##     Hand Gender label fitted   std.err
## 151   11      M     1 0.9998 0.0002152
## 196   11      M     1 0.9998 0.0002152
## 257   11      M     1 0.9998 0.0002152
## 299   11      M     1 0.9998 0.0002152
## --------------------------------------------------------------------
## 
## 
## ----------------------------
## Specified confusion matrices
## ----------------------------
## 
## Probability threshold for predicting M: 0.5
## Corresponding cutoff threshold for Hand: 8.408
## 
##                Baseline         Predicted 
## ---------------------------------------------------
##               Total  %Tot        0      1  %Correct 
## ---------------------------------------------------
##          1      170  50.0       17    153     90.0 
## Gender   0      170  50.0      147     23     86.5 
## ---------------------------------------------------
##        Total    340                           88.2 
## 
## Accuracy: 88.24 
## Sensitivity: 90.00 
## Precision: 86.93

Change Classification Threshold

Specify additional probability thresholds for classification beyond just the default 0.5 with the prob_cut parameter.

Logit(Gender ~ Hand, prob_cut=c(.3, .5, .7))
## 
## Response Variable:   Gender
## Predictor Variable 1:  Hand
## 
## Number of cases (rows) of data:  340 
## Number of cases retained for analysis:  340 
## 
## 
##    BASIC ANALYSIS 
## 
## -- Estimated Model of Gender for the Logit of Reference Group Membership
## 
##              Estimate    Std Err  z-value  p-value   Lower 95%   Upper 95%
## (Intercept)  -26.9237     2.7515   -9.785    0.000    -32.3166    -21.5308 
##        Hand    3.2023     0.3269    9.794    0.000      2.5615      3.8431 
## 
## 
## -- Odds Ratios and Confidence Intervals
## 
##              Odds Ratio   Lower 95%   Upper 95%
## (Intercept)      0.0000      0.0000      0.0000 
##        Hand     24.5883     12.9547     46.6690 
## 
## 
## -- Model Fit
## 
##     Null deviance: 471.340 on 339 degrees of freedom
## Residual deviance: 220.664 on 338 degrees of freedom
## 
## AIC: 224.6641 
## 
## Number of iterations to convergence: 6 
## 
## 
##    ANALYSIS OF RESIDUALS AND INFLUENCE 
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance
##    [sorted by Cook's Distance]
##    [res_rows = 20 out of 340 cases (rows) of data]
## --------------------------------------------------------------------
##     Hand Gender P(Y=1) residual rstudent  dffits   cooks
## 125  7.0      M 0.0109   0.9891    3.045  0.1930 0.11740
## 253  7.0      M 0.0109   0.9891    3.045  0.1930 0.11740
## 162  9.5      W 0.9706  -0.9706   -2.684 -0.2256 0.07555
## 313  9.5      W 0.9706  -0.9706   -2.684 -0.2256 0.07555
## 20   9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 33   9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 59   9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 67   9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 69   9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 87   9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 90   9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 150  9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 248  9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 276  9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 284  9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 308  9.0      W 0.8695  -0.8695   -2.031 -0.2229 0.02611
## 8    8.0      M 0.2132   0.7868    1.766  0.1948 0.01462
## 109  8.0      M 0.2132   0.7868    1.766  0.1948 0.01462
## 132  8.0      M 0.2132   0.7868    1.766  0.1948 0.01462
## 142  8.0      M 0.2132   0.7868    1.766  0.1948 0.01462
## 
## 
##    PREDICTION 
## 
## Probability threshold for classification M: 0.5
## 
##  0: W
##  1: M
## 
## Data, Fitted Values, Standard Errors
##    [sorted by fitted value]
##    [pred_all=TRUE to see all intervals displayed]
## --------------------------------------------------------------------
##     Hand Gender label    fitted   std.err
## 153 6.00      W     0 0.0004481 0.0003597
## 32  6.75      W     0 0.0049256 0.0027736
## 155 6.75      W     0 0.0049256 0.0027736
## 7   7.00      W     0 0.0109024 0.0052692
## 
## ... for the rows of data where fitted is close to 0.5 ...
## 
##     Hand Gender label fitted std.err
## 293 8.25      M     0 0.3764 0.04188
## 311 8.25      M     0 0.3764 0.04188
## 1   8.50      W     1 0.5734 0.04274
## 9   8.50      M     1 0.5734 0.04274
## 16  8.50      M     1 0.5734 0.04274
## 
## ... for the last 4 rows of sorted data ...
## 
##     Hand Gender label fitted   std.err
## 151   11      M     1 0.9998 0.0002152
## 196   11      M     1 0.9998 0.0002152
## 257   11      M     1 0.9998 0.0002152
## 299   11      M     1 0.9998 0.0002152
## --------------------------------------------------------------------
## 
## 
## ----------------------------
## Specified confusion matrices
## ----------------------------
## 
## Probability threshold for predicting M: 0.3
## Corresponding cutoff threshold for Hand: 8.143
## 
##                Baseline         Predicted 
## ---------------------------------------------------
##               Total  %Tot        0      1  %Correct 
## ---------------------------------------------------
##          1      170  50.0       17    153     90.0 
## Gender   0      170  50.0      147     23     86.5 
## ---------------------------------------------------
##        Total    340                           88.2 
## 
## Accuracy: 88.24 
## Sensitivity: 90.00 
## Precision: 86.93 
## 
## 
## 
## Probability threshold for predicting M: 0.5
## Corresponding cutoff threshold for Hand: 8.408
## 
##                Baseline         Predicted 
## ---------------------------------------------------
##               Total  %Tot        0      1  %Correct 
## ---------------------------------------------------
##          1      170  50.0       17    153     90.0 
## Gender   0      170  50.0      147     23     86.5 
## ---------------------------------------------------
##        Total    340                           88.2 
## 
## Accuracy: 88.24 
## Sensitivity: 90.00 
## Precision: 86.93 
## 
## 
## 
## Probability threshold for predicting M: 0.7
## Corresponding cutoff threshold for Hand: 8.672
## 
##                Baseline         Predicted 
## ---------------------------------------------------
##               Total  %Tot        0      1  %Correct 
## ---------------------------------------------------
##          1      170  50.0       39    131     77.1 
## Gender   0      170  50.0      156     14     91.8 
## ---------------------------------------------------
##        Total    340                           84.4 
## 
## Accuracy: 84.41 
## Sensitivity: 77.06 
## Precision: 90.34

Plot Conditional Means across Bins

Categorize Hand size into six bins. Compute the conditional mean of Gender, scored as 0 and 1, at each level of Hand size. Both variables must be numeric. The visualization approximates the form of the sigmoid function from logistic regression. The point (bubble) size depends on the sample size for the corresponding bin.

d$Gender <- ifelse (d$Gender == "M", 1, 0)
Plot(Hand, Gender, n_bins=6)

## 
## Table: Summary Stats 
##  
##             Hand   Gender 
## -------  -------  ------- 
## n            340      340 
## n.miss         0        0 
## min        6.000        0 
## max       11.000        1 
## mean       8.437    0.500 
## 
##  
## Table: mean of Gender for levels of Hand 
##  
##                   bin     n    midpt    mean 
## ---  ----------------  ----  -------  ------ 
## 1       [5.995,6.833]     3    6.414   0.000 
## 2       (6.833,7.667]    81    7.250   0.025 
## 3       (7.667,8.500]   111    8.083   0.333 
## 4       (8.500,9.333]    84    8.917   0.857 
## 5      (9.333,10.167]    51    9.750   0.961 
## 6     (10.167,11.005]    10   10.586   1.000

Interpreted Output

The parameter Rmd creates an R markdown file that is automatically generated and then the corresponding html document from knitting the various output components together with full interpretation. A new, much more complete form of computer output.

Not run here.

reg(Salary ~ Years + Pre, Rmd="eg")

Full Manual

Use the base R help() function to view the full manual for Regression(). Simply enter a question mark followed by the name of the function, or its abbreviation.

?reg