Set up analyses.

suppressPackageStartupMessages(library(lessR))
suppressPackageStartupMessages(library(tidyverse))
d <- Read("Employee")  # get some data
## 
## >>> Suggestions
## Details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  15 ... 1  2  10
##  2    Gender character     37       0       2   M  M  M ... F  F  M
##  3      Dept character     36       1       5   ADMN  SALE  SALE ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  low ... high  low  high
##  6      Plan   integer     37       0       3   1  1  3 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  96 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  97 ... 90  71  87
## ------------------------------------------------------------------------------------------

Standard Numeric x-y Scatterplot

With the lessR function Plot(), scatterplots are defined quite generally, considered as any combination of numeric (continuous) or categorical variables. Here we focus on the traditional concept as the plot of two-continuous variables.

Default Scatterplots

The base R function is plot(), with a lower-case p.

plot(d$Years, d$Salary)

The lessR function is Plot(), here specified with two variables.

Plot(Years, Salary)

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923

Enhance the scatterplot with several attributes with the enhance parameters: Detect outliers, get the 95% confidence ellipse, divide the plot into quadrants based on the means, and display the best-fit least-squares line, both with and without the outliers. The dotted line is without the outliers.

Plot(Years, Salary, enhance=TRUE)
## [Ellipse with Murdoch and Chow's function ellipse from the ellipse package]

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923
## >>> Outlier analysis with Mahalanobis Distance 
##  
##   MD                  ID 
## -----               ----- 
## 8.14     Correll, Trevon 
## 7.84       Capelle, Adam 
##  
## 5.63  Korhalkar, Jessica 
## 5.58       James, Leslie 
## 3.75         Hoang, Binh 
## ...                 ...

With ggplot, the first argument is the data frame, and the second argument is the aes() (aesthetics) function. At a minimum, specify the x variable, and any y variable, with aes(). For a scatterplot, use geom_point().

ggplot(d, aes(Years, Salary)) + geom_point()

Add an Ellipse

lessR

Now that we have seen the console output for the accompanying statistical analysis, turn off console output for the remaining lessR output for a more compact presentation.

style(quiet=TRUE)

The parameter ellipse specifies a data ellipse. Setting its value to TRUE sets the default level value of 0.95. A translucent fill color is also provided consistent with the current color theme. Potential outliers, consistent with the level of the ellipse are labeled according to their row name by default.

Plot(Years, Salary, ellipse=TRUE)

One or more individual levels of the ellipse can be specified to obtain one or more data ellipses. Here define a vector of three values with the R combine function c() to generate three ellipses at three different levels.

Plot(Years, Salary, ellipse=c(.3,.5,.9))

Or, specify many ellipses, here with the base R function seq(). Remove the points from the plot by setting the size parameter, for point size, to 0.

Plot(Years, Salary, ellipse=seq(.1,.9,.1), size=0)

Can also turn off the drawing of the ellipse itself, leaving just the translucent fill color, which darkens as more ellipses overlap. Here only the overlapping translucent interiors of the 0.05, 0.10, 0.15 … 0.90, 0.95 data ellipses are all displayed. The result is a kind of idealized bivariate relationship estimated from the data.

style(ellipse_color="off")
Plot(Years, Salary, ellipse=seq(.05,.95,.05), size=0)

Reset the style back to default, which is theme="colors".

style()
## theme set to "colors"
style(ellipse_color="off")
Plot(Years, Salary, ellipse=seq(.05,.95,.05), size=0)
## [Ellipse with Murdoch and Chow's function ellipse from the ellipse package]

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923
style()
## theme set to "colors"

The number of labeled points furtherest from the ellipse center can be changed with the out_cut parameter. Here the most extreme 20% of the plotted points are identified and labeled.

Plot(Years, Salary, ellipse=0.95, out_cut=.20)
## [Ellipse with Murdoch and Chow's function ellipse from the ellipse package]

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923
## >>> Outlier analysis with Mahalanobis Distance 
##  
##   MD                  ID 
## -----               ----- 
## 8.14     Correll, Trevon 
## 7.84       Capelle, Adam 
## 5.63  Korhalkar, Jessica 
## 5.58       James, Leslie 
## 3.75         Hoang, Binh 
## 3.10      Skrotzki, Sara 
## 2.95      Billing, Susan 
##  
## 2.64        Singh, Niral 
## 2.56 Cassinelli, Anastis 
## 2.32       Knox, Michael 
## ...                 ...

If the labels are to be the values of a column of the data table instead of the row names, as if a variable, specify the variable name with the ID parameter.

ggplot2

Add default 0.95 data ellipse with the stat_ellipse function. Here save the layers one at a time in a ggplot object, named p.

p <- ggplot(d, aes(Years, Salary)) + geom_point() + stat_ellipse()
p

Add the 0.75 data ellipse as another layer to p.

p <- p + stat_ellipse(level=0.75)

Display the graph with all its layers.

p

Filled ellipse with narrow border (size=.5).

ggplot(d, aes(Years, Salary)) + geom_point() +
  stat_ellipse(geom="polygon", color="black", size=.5, alpha=.1)

Multiple filled ellipses.

ggplot(d, aes(Years, Salary)) + geom_point() +
  stat_ellipse(type="norm", level=.9,
      geom="polygon", color="black", size=.5, alpha=.05) +
  stat_ellipse(type="norm", level=.6,
      geom="polygon", color="black", size=.5, alpha=.05) +
  stat_ellipse(type="norm", level=.3,
      geom="polygon", color="black", size=.5, alpha=.05)

Change the plot background color and border of the background with the theme() function.

ggplot(d, aes(Years, Salary)) +
  geom_point() +
  theme(panel.background = element_rect(fill="powderblue", color="blue"))

Fit Lines

lessR

The loess fit line is the default, from the fit parameter.

Plot(Years, Salary, fit=TRUE)

## >>> Suggestions
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923

Add standard errors, here for three confidence intervals, with the fit_se parameter. If standard errors of the fit line are specified, there is no need to specify the parameter fit.

Plot(Years, Salary, fit_se=c(.90, .95, .99))

## >>> Suggestions
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923

For a brightly colored fit line in the ggplot2 default style.

style(fit.color="darkred")
Plot(Years, Salary, size=0, fit_se=.95)

## >>> Suggestions
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923
style()
## theme set to "colors"

ggplot2

Default stat for geom_smooth() is stat_smooth, which provides the smoothed conditional mean with .95 data interval. Here the points are not plotted as there is no geom_point().

ggplot(d, aes(Years, Salary)) + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Do both the smoothed line and the scatterplot.

ggplot(d, aes(Years, Salary)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Add a smoothed line with a 0.5 confidence interval.

ggplot(d, aes(Years, Salary)) + geom_smooth(level=0.5)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Can specify a linear regression line.

ggplot(d, aes(Years, Salary)) + geom_point() + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

Custom linear regression line, remove shaded confidence region.

ggplot(d, aes(Years, Salary)) + geom_point() +
   geom_smooth(method=lm, color="darkgray", linetype="dashed", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

Add the 0.95 confidence ellipse with the quadrants delineated using geom_vline() and geom_hline().

ggplot(d, aes(Years, Salary)) + geom_point() +
  stat_ellipse(type="norm") +
  geom_vline(aes(xintercept=mean(Years, na.rm=TRUE)), color="gray70") +
  geom_hline(aes(yintercept=mean(Salary), na.rm=TRUE), color="gray70")

Scatterplot with a Third Variable: Categorical

Can create visualizations of a relationship between two variables across levels of a third, categorical variable. Here, plot multiple scatterplots between two continuous variables, one scatterplot for each level, that is, group, for the third variable, the grouping variable.

lessR

With lessR, introduce the grouping variable with the by parameter.

Plot(Years, Salary, by=Gender)

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923

The parameter pt_fill overrides the color from the current color theme. Specify as many colors as their are levels.

style(pt_fill=c("green", "brown"))

The by parameter plots points from different levels of the corresponding variable with different fill colors. For only two fill colors the current color theme is maintained by plotting the points from one group with the fill color and the other set of points without.

Plot(Years, Salary, by=Gender)

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923
style()
## theme set to "colors"

Can also specify a separate shape for each group with the shape parameter.

Plot(Years, Salary, by=Gender, shape=c("F","M"))

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923

Fit lines can be added as well, with a fit line for each group. Here with just two levels the color theme is preserved as well.

Plot(Years, Salary, by=Gender, fit="ls")

## >>> Suggestions
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923

ggplot2

Specify color as the aspect of the graph that demonstrates the grouping.

ggplot(d, aes(Years, Salary)) + geom_point(aes(color=Gender))

Or, specify shape as the grouping feature.

ggplot(d, aes(Years, Salary)) + geom_point(aes(shape=Gender))

Scatterplot with a Third Variable: Continuous

lessR

Pre is another continuous variable in the data table. The parameter size can be a constant to change the size of the plotted points but can also be set to a variable, where it specifies that the size of each plotted bubble corresponds to the corresponding value of size.

Plot(Years, Salary, size=3)

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923
Plot(Years, Salary, size=Pre)

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923

Should reduce the absolute size of the bubbles, scaled in inches with the radius parameter. Default is 0.25, apparently the radius, though the documentation does not specify. Also, the power parameter changes the size of the bubbles relative to each other. Increasing its value more effectively differentiates among values.

Plot(Years, Salary, size=Pre, radius=0.10, power=2)

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character
## Plot(x=Years, y=Salary, size=Pre, radius=0.1, power=2, radius=0.15) # larger bubbles 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923

ggplot2

ggplot2 allows several different mappings from variable to aesthetic. Scale by area.

ggplot(d, aes(Years, Salary)) + geom_point(aes(size=Pre))

Scale by color.

ggplot(d, aes(Years, Salary)) + geom_point(aes(color=Pre))

Multiple Plots on One Graph with lessR

Specify either the x-variable or the y-variable as a vector, specified with the standard R c() function. The input to the function still follows the listing of two variables, but now the first variable is a vector, here defined as the variables Pre and Post.

Plot(c(Pre, Post), Salary)

Generate a fit line for each plotted pair of variables.

Plot(c(Pre, Post), Salary, fit="ls")

Can customize, such as by increasing the size of the plotted points, turning off the background color, turning off the grid. style(panel_fill=“off”, grid_color=“off”)

Plot(c(Pre, Post), Salary, size=3, fit=TRUE)

style()
## theme set to "colors"

Can specify the y-variable as the vector.

Plot(Salary, c(Pre, Post))

## >>> Suggestions
## Plot(Salary, c(Pre, Post), fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Salary, c(Pre, Post), out_cut=.10)  # label top 10% potential outliers
## Plot(Salary, c(Pre, Post), ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Salary, c(Pre, Post), enhance=TRUE)  # many options, including the above
## Plot(Salary, c(Pre, Post), shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 37 
## 
## 
## Sample Correlation of Salary and Pre: r = -0.007 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: -0.043,  df: 35,  p-value: 0.966 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: -0.330      Upper Bound:  0.318 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 37 
## 
## 
## Sample Correlation of Salary and Post: r = -0.070 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: -0.416,  df: 35,  p-value: 0.680 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: -0.385      Upper Bound:  0.260

Scatterplot Matrix

The scatterplot matrix plots the scatterplot, and perhaps additional information such as a fit line, for each pairwise combination of the specified variables. lessR multiple regression also automatically provides a scatterplot matrix of all the variables in the model, as part of a comprehensive regression analysis.

To obtain a scatterplot matrix, use the lessR Plot() function. Specify a vector of numeric variables.

Plot(c(Salary, Years, Pre, Post))

Many of the standard lessR visualization parameters also transfer over to the scatterplot matrix. Here change the color of the background and put a red least squares line through each plot.

style("darkred")
style(panel_fill="aliceblue", fit_color="red")
Plot(c(Salary, Years, Pre, Post), fit="ls")

style()
## theme set to "colors"

This plot also is generated by the lessR Regression() function, abbreviated reg(). Comment out here as we are not running this analysis.

Regression(Salary ~ Years + Pre + Post)

Trellis Graphics

One common scenario is to examine a relationship between two variables at different levels of one or more other categorical variables. We have already seen this type of display in which the relationship is plotted on the same scatterplot for each of the levels of a third variable, a categorical variable. Trellis graphics extends this idea by plotting the relationship of the variables of interest for each level of another variable in a separate plot called a panel. A primary purpose of the lattice package is to provide these Trellis graphs. The categorical variables are called conditioned variables.

lessR and lattice

The lessR package directly accesses the lattice functions for Trellis graphs, which provides an interface consistent with all the regular lessR visualization functions, including the current color theme. Activate Trellis graphics for one categorical variable with the by1 parameter.

Plot(Years, Salary, by1=Gender)
## [Trellis graphics from Deepayan Sarkar's lattice package]

Most of the standard lessR parameters transfer over to the Trellis panels. Here change the background color and put a regression line through each panel. Nothing changes from the standard 1-panel lessR display, except for the inclusion of the by1 parameter.

style(panel.fill="aliceblue")
Plot(Years, Salary, by1=Gender, fit="ls")
## [Trellis graphics from Deepayan Sarkar's lattice package]

Reset style to default if desired

style()
## theme set to "colors"

Control the layout of the panels with the n.col and n.row parameters. Here stack the two panels into a single column, instead of the default of a single row.

Plot(Years, Salary, by1=Gender, fit="ls", n_col=1)
## [Trellis graphics from Deepayan Sarkar's lattice package]

Compare these parameters with the by parameter, which groups all of the plots on the same graph.

Plot(Years, Salary, by=Gender, fit="ls")

## >>> Suggestions
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923

Can also combine conditioning with by1 and by2, with grouping specified by by. Here we have by1 and by. The variable Plan has three levels.

Plot(Years, Salary, by1=Gender, by=Plan, fit="ls")
## [Trellis graphics from Deepayan Sarkar's lattice package]

Running out of data values for such a small data set, but here we have two levels of conditioning and grouping, along with the least squares line.

Plot(Years, Salary, by1=Dept, by2=Gender, by=Plan, fit="ls")
## [Trellis graphics from Deepayan Sarkar's lattice package]

Now for two conditioned variables. A panel is produced for each cross-classification of the levels of the two categorical variables.

Plot(Years, Salary, by1=Dept, by2=Gender)
## [Trellis graphics from Deepayan Sarkar's lattice package]

ggplot2

ggplot2 version of the same concept of Trellis graphics invokes what it calls facets. Get the numeric-numeric scatterplot separated with Gender facets.

ggplot(d, aes(Years, Salary)) + geom_point() + facet_grid(rows=vars(Gender))

ggplot(d, aes(Years, Salary)) + geom_point() + facet_grid(cols=vars(Gender))

Include linear best fit line without standard errors for each plot.

ggplot(d, aes(Years, Salary)) +
  geom_point() + geom_smooth(method="lm", se=FALSE) + facet_grid(rows=vars(Gender))
## `geom_smooth()` using formula 'y ~ x'

Smoothed Scatterplots

lessR

For many, many points to plot, the over-plotting problem becomes severe, even with translucent points. By default, when there are at least 2500 rows of data to analyze, the Plot function implements bivariate smoothing across the scatterplot.

This example also illustrates that the data entered into a lessR function need not be in a data frame. Here x and y are defined in the R user workspace, more formally called the global environment. The contents of this workspace are shown in RStudio with the Environment tab at the top-right corner, its default location.

x <- rnorm(5000)
y <- rnorm(5000)

First the base R plot() function.

plot(x,y)

The lessR Plot() function automatically smooths from 2500 points on. It also references variables in the general R user workspace, as in this example, as well as variables in a data frame.

Plot(x, y)
## >>> Note: x is from the workspace, not in a data frame (table)
## >>> Note: y is from the workspace, not in a data frame (table)

## >>> Suggestions
## Plot(x, y, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(x, y, out_cut=.10)  # label top 10% potential outliers
## Plot(x, y, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(x, y, enhance=TRUE)  # many options, including the above
## Plot(x, y, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 5000 
## 
## 
## Sample Correlation of x and y: r = -0.044 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: -3.092,  df: 4998,  p-value: 0.002 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: -0.071      Upper Bound: -0.016

Blue smoothing cloud.

style("dodgerblue")
Plot(x, y)
## >>> Note: x is from the workspace, not in a data frame (table)
## >>> Note: y is from the workspace, not in a data frame (table)

## >>> Suggestions
## Plot(x, y, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(x, y, out_cut=.10)  # label top 10% potential outliers
## Plot(x, y, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(x, y, enhance=TRUE)  # many options, including the above
## Plot(x, y, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 5000 
## 
## 
## Sample Correlation of x and y: r = -0.044 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: -3.092,  df: 4998,  p-value: 0.002 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: -0.071      Upper Bound: -0.016
style()
## theme set to "colors"

If desired, smoothing can be turned off, which follows the general lessR guidelines that the defaults are assumed, but can always be reverted.

Plot(x, y, smooth=FALSE)
## >>> Note: x is from the workspace, not in a data frame (table)
## >>> Note: y is from the workspace, not in a data frame (table)

## >>> Suggestions
## Plot(x, y, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(x, y, out_cut=.10)  # label top 10% potential outliers
## Plot(x, y, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(x, y, enhance=TRUE)  # many options, including the above 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 5000 
## 
## 
## Sample Correlation of x and y: r = -0.044 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: -3.092,  df: 4998,  p-value: 0.002 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: -0.071      Upper Bound: -0.016

Besides the general smooth parameter for turning smoothing on or off, there are three other parameters.

  • smooth.points: Number of points superimposed on the density plot in the areas of the lowest density to help identify outliers. Defaults to 100.
  • smooth.exp: The exponent of the function that maps the density scale to the color scale, which controls the darkness of the smoothed points. Default is 0.20.
  • smooth.bins: Number of bins in both directions for the density estimation. The more bins the smoother the plot.

Here darken the smoothing gradient by decreasing the value of smooth.exp from its default.

Plot(x, y, smooth.exp=.1)
## >>> Note: x is from the workspace, not in a data frame (table)
## >>> Note: y is from the workspace, not in a data frame (table)

## >>> Suggestions
## Plot(x, y, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(x, y, out_cut=.10)  # label top 10% potential outliers
## Plot(x, y, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(x, y, enhance=TRUE)  # many options, including the above 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 5000 
## 
## 
## Sample Correlation of x and y: r = -0.044 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: -3.092,  df: 4998,  p-value: 0.002 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: -0.071      Upper Bound: -0.016

ggplot2

ggplot2 wants the input data in a data frame. Create with the standard R function data.frame().

myd <- data.frame(x,y)
names(myd) <- c("x", "y")

Do 2-dimensional binning with the stat_binhex() function.

ggplot(myd, aes(x,y)) + stat_binhex(bins=10, aes(alpha = ..count..))

Or do contour curves with the stat_density2d() functions.

ggplot(myd, aes(x,y)) + stat_density2d()

Text Annotation

#d <- rd("~/Documents/511/BookNew/data/Employee/employee.csv")
d <- rd("http://lessRstats.com/data/employee.csv")
## 
## >>> Suggestions
## To read a csv or Excel file of variable labels, var_labels=TRUE
##    Each row of the file:  Variable Name, Variable Label
## Details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1      Name character     37       0      37   Ritchie, Darnell ... Cassinelli, Anastis
##  2     Years   integer     36       1      16   7  NA  15 ... 1  2  10
##  3    Gender character     37       0       2   M  M  M ... F  F  M
##  4      Dept character     36       1       5   ADMN  SALE  SALE ... MKTG  SALE  FINC
##  5    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  6    JobSat character     35       2       3   med  low  low ... high  low  high
##  7      Plan   integer     37       0       3   1  1  3 ... 2  2  1
##  8       Pre   integer     37       0      27   82  62  96 ... 83  59  80
##  9      Post   integer     37       0      22   92  74  97 ... 90  71  87
## ------------------------------------------------------------------------------------------
style(quiet=TRUE)

lessR

The add parameter specifies an annotation.

Plot(Years, Salary, add="v.line", x1="mean.x")

Plot(Years, Salary, add="v.line", x1=seq(6,18,3))

Plot(Years, Salary, add=c("v.line", "h.line"), x1="mean.x", y1="mean.y")

style(add.trans=.8, add.fill="gold", add.color="gold4", add.lwd=3.5)
Plot(Years, Salary, add="rect", x1=12, y1=80000, x2=16, y2=115000)

style()
## theme set to "colors"
style(add.fill=c("gold3", "green"), add.trans=c(.8,.4))
Plot(Years, Salary, add=c("rect", "rect"),
     x1=c(10, 2), y1=c(60000, 45000), x2=c(12, 75000), y2=c(80000, 55000))

## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, standard errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means")  # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE)  # many options, including the above
## Plot(Years, Salary, shape="diamond")  # change plot character 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 36 
## 
## 
## Sample Correlation of Years and Salary: r = 0.852 
## 
## 
## Alternative Hypothesis: True correlation is not equal to 0 
##   t-value: 9.501,  df: 34,  p-value: 0.000 
##  
## 95% Confidence Interval of Population Correlation 
##   Lower Bound: 0.727      Upper Bound: 0.923
style()
## theme set to "colors"

ggplot2

geom_text() draws a text label at a given x and y coordinate, here for all points. vjust is a vertical adjustment to separate the text from the point.

ggplot(d, aes(Years, Salary)) + geom_point() +
  geom_text(aes(label=Name), size=3, color="darkgray", vjust=1.5)

Annotate at a specified location provides text, segment, rect, or pointrange.

ggplot(d, aes(Years, Salary)) + geom_point() +
  annotate("text", x=20, y=121000, label="Highest Earner", color="blue")

ggplot(d, aes(Years, Salary)) + geom_point() +
  annotate("segment", x=5, xend=14, y=22000, yend=30000, color="blue")

ggplot(d, aes(Years, Salary)) + geom_point() +
  annotate("rect", xmin=3, xmax=8, ymin=40000, ymax=65000, alpha=.2)

ggplot(d, aes(Years, Salary)) + geom_point() +
  annotate("pointrange", x=3, y=45000, ymin=30000, ymax=60000, alpha=.4, color="blue")

Heat Maps

A correlation matrix organizes all the pair-wise correlations among a set of variables into a square, symmetric matrix, a summary of the linear relationships among the variables. To search for patterns among this table of numbers, represent the correlations visually, here with a heat map, which replaces each correlation in a correlation matrix with a colored tile, a small square of color. The larger the correlation, the more intense the tile color.

First subset the responses just for the first 16 items for the Mach IV scale. The responses are on Likert scale from 0 to 5, Strongly Disagree to Strongly Agree.

d <- rd("Mach4", quiet=TRUE)
d <- subset(d, select=c(m01:m16))
head(d)
##   m01 m02 m03 m04 m05 m06 m07 m08 m09 m10 m11 m12 m13 m14 m15 m16
## 1   0   4   1   5   0   5   4   1   5   4   0   0   0   0   4   0
## 2   0   1   4   4   0   3   3   0   4   4   0   1   1   1   2   4
## 3   2   1   0   5   4   4   0   5   3   4   1   4   0   0   2   0
## 4   0   5   2   4   0   4   4   4   5   2   0   0   0   1   1   1
## 5   2   2   3   3   2   3   2   3   1   4   1   3   1   2   2   2
## 6   1   3   3   5   1   3   3   1   5   3   0   0   0   3   2   5

Now compute the correlation matrix with lessR function Correlation(), abbreviated cr().

cr(d)

##  
## The correlation matrix contains 16 variables 
## 
## 
## Missing data deletion:  pairwise 
## 
## >>> No missing data 
## 
## 
## Variables in the correlation matrix: 
## m01, m02, m03, m04, m05, m06, m07, m08, m09, m10, m11, m12, m13, m14, m15, m16 
## 
## To view the correlation matrix, enter the name of the returned object
## followed by  $R  such as  mycor$R
## 
cr(d, bottom=5, right=5)

##  
## The correlation matrix contains 16 variables 
## 
## 
## Missing data deletion:  pairwise 
## 
## >>> No missing data 
## 
## 
## Variables in the correlation matrix: 
## m01, m02, m03, m04, m05, m06, m07, m08, m09, m10, m11, m12, m13, m14, m15, m16 
## 
## To view the correlation matrix, enter the name of the returned object
## followed by  $R  such as  mycor$R
## 

Reorder the correlations to provide a more convenient heat map. Then perform a hierarchical cluster analysis. Specify four clusters.

R <- cr(d)$R

corReorder(R)

corReorder(R, n.clusters=4)
## 
##  4 Cluster Solution 
## -------------------
## m01 m05 m08 m12 m13 m02 m15 m03 m06 m07 m09 m10 m04 m11 m14 m16 
##   1   1   1   1   1   2   2   3   3   3   3   3   4   4   4   4