Set up analyses.
suppressPackageStartupMessages(library(lessR))
suppressPackageStartupMessages(library(tidyverse))
d <- Read("Employee") # get some data
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 15 ... 1 2 10
## 2 Gender character 37 0 2 M M M ... F F M
## 3 Dept character 36 1 5 ADMN SALE SALE ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low low ... high low high
## 6 Plan integer 37 0 3 1 1 3 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 96 ... 83 59 80
## 8 Post integer 37 0 22 92 74 97 ... 90 71 87
## ------------------------------------------------------------------------------------------
With the lessR function Plot()
, scatterplots are defined quite generally, considered as any combination of numeric (continuous) or categorical variables. Here we focus on the traditional concept as the plot of two-continuous variables.
The base R function is plot()
, with a lower-case p.
plot(d$Years, d$Salary)
The lessR function is Plot()
, here specified with two variables.
Plot(Years, Salary)
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
Enhance the scatterplot with several attributes with the enhance
parameters: Detect outliers, get the 95% confidence ellipse, divide the plot into quadrants based on the means, and display the best-fit least-squares line, both with and without the outliers. The dotted line is without the outliers.
Plot(Years, Salary, enhance=TRUE)
## [Ellipse with Murdoch and Chow's function ellipse from the ellipse package]
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
## >>> Outlier analysis with Mahalanobis Distance
##
## MD ID
## ----- -----
## 8.14 Correll, Trevon
## 7.84 Capelle, Adam
##
## 5.63 Korhalkar, Jessica
## 5.58 James, Leslie
## 3.75 Hoang, Binh
## ... ...
With ggplot, the first argument is the data frame, and the second argument is the aes()
(aesthetics) function. At a minimum, specify the x variable, and any y variable, with aes()
. For a scatterplot, use geom_point()
.
ggplot(d, aes(Years, Salary)) + geom_point()
Now that we have seen the console output for the accompanying statistical analysis, turn off console output for the remaining lessR output for a more compact presentation.
style(quiet=TRUE)
The parameter ellipse
specifies a data ellipse. Setting its value to TRUE
sets the default level value of 0.95. A translucent fill color is also provided consistent with the current color theme. Potential outliers, consistent with the level of the ellipse are labeled according to their row name by default.
Plot(Years, Salary, ellipse=TRUE)
One or more individual levels of the ellipse can be specified to obtain one or more data ellipses. Here define a vector of three values with the R combine function c()
to generate three ellipses at three different levels.
Plot(Years, Salary, ellipse=c(.3,.5,.9))
Or, specify many ellipses, here with the base R function seq()
. Remove the points from the plot by setting the size
parameter, for point size, to 0.
Plot(Years, Salary, ellipse=seq(.1,.9,.1), size=0)
Can also turn off the drawing of the ellipse itself, leaving just the translucent fill color, which darkens as more ellipses overlap. Here only the overlapping translucent interiors of the 0.05, 0.10, 0.15 … 0.90, 0.95 data ellipses are all displayed. The result is a kind of idealized bivariate relationship estimated from the data.
style(ellipse_color="off")
Plot(Years, Salary, ellipse=seq(.05,.95,.05), size=0)
Reset the style back to default, which is theme="colors"
.
style()
## theme set to "colors"
style(ellipse_color="off")
Plot(Years, Salary, ellipse=seq(.05,.95,.05), size=0)
## [Ellipse with Murdoch and Chow's function ellipse from the ellipse package]
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
style()
## theme set to "colors"
The number of labeled points furtherest from the ellipse center can be changed with the out_cut
parameter. Here the most extreme 20% of the plotted points are identified and labeled.
Plot(Years, Salary, ellipse=0.95, out_cut=.20)
## [Ellipse with Murdoch and Chow's function ellipse from the ellipse package]
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
## >>> Outlier analysis with Mahalanobis Distance
##
## MD ID
## ----- -----
## 8.14 Correll, Trevon
## 7.84 Capelle, Adam
## 5.63 Korhalkar, Jessica
## 5.58 James, Leslie
## 3.75 Hoang, Binh
## 3.10 Skrotzki, Sara
## 2.95 Billing, Susan
##
## 2.64 Singh, Niral
## 2.56 Cassinelli, Anastis
## 2.32 Knox, Michael
## ... ...
If the labels are to be the values of a column of the data table instead of the row names, as if a variable, specify the variable name with the ID
parameter.
Add default 0.95 data ellipse with the stat_ellipse
function. Here save the layers one at a time in a ggplot object, named p.
p <- ggplot(d, aes(Years, Salary)) + geom_point() + stat_ellipse()
p
Add the 0.75 data ellipse as another layer to p.
p <- p + stat_ellipse(level=0.75)
Display the graph with all its layers.
p
Filled ellipse with narrow border (size=.5).
ggplot(d, aes(Years, Salary)) + geom_point() +
stat_ellipse(geom="polygon", color="black", size=.5, alpha=.1)
Multiple filled ellipses.
ggplot(d, aes(Years, Salary)) + geom_point() +
stat_ellipse(type="norm", level=.9,
geom="polygon", color="black", size=.5, alpha=.05) +
stat_ellipse(type="norm", level=.6,
geom="polygon", color="black", size=.5, alpha=.05) +
stat_ellipse(type="norm", level=.3,
geom="polygon", color="black", size=.5, alpha=.05)
Change the plot background color and border of the background with the theme()
function.
ggplot(d, aes(Years, Salary)) +
geom_point() +
theme(panel.background = element_rect(fill="powderblue", color="blue"))
The loess fit line is the default, from the fit
parameter.
Plot(Years, Salary, fit=TRUE)
## >>> Suggestions
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
Add standard errors, here for three confidence intervals, with the fit_se
parameter. If standard errors of the fit line are specified, there is no need to specify the parameter fit
.
Plot(Years, Salary, fit_se=c(.90, .95, .99))
## >>> Suggestions
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
For a brightly colored fit line in the ggplot2 default style.
style(fit.color="darkred")
Plot(Years, Salary, size=0, fit_se=.95)
## >>> Suggestions
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
style()
## theme set to "colors"
Default stat for geom_smooth()
is stat_smooth
, which provides the smoothed conditional mean with .95 data interval. Here the points are not plotted as there is no geom_point()
.
ggplot(d, aes(Years, Salary)) + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Do both the smoothed line and the scatterplot.
ggplot(d, aes(Years, Salary)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Add a smoothed line with a 0.5 confidence interval.
ggplot(d, aes(Years, Salary)) + geom_smooth(level=0.5)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Can specify a linear regression line.
ggplot(d, aes(Years, Salary)) + geom_point() + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
Custom linear regression line, remove shaded confidence region.
ggplot(d, aes(Years, Salary)) + geom_point() +
geom_smooth(method=lm, color="darkgray", linetype="dashed", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
Add the 0.95 confidence ellipse with the quadrants delineated using geom_vline()
and geom_hline()
.
ggplot(d, aes(Years, Salary)) + geom_point() +
stat_ellipse(type="norm") +
geom_vline(aes(xintercept=mean(Years, na.rm=TRUE)), color="gray70") +
geom_hline(aes(yintercept=mean(Salary), na.rm=TRUE), color="gray70")
Can create visualizations of a relationship between two variables across levels of a third, categorical variable. Here, plot multiple scatterplots between two continuous variables, one scatterplot for each level, that is, group, for the third variable, the grouping variable.
With lessR, introduce the grouping variable with the by
parameter.
Plot(Years, Salary, by=Gender)
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
The parameter pt_fill
overrides the color from the current color theme. Specify as many colors as their are levels.
style(pt_fill=c("green", "brown"))
The by
parameter plots points from different levels of the corresponding variable with different fill colors. For only two fill colors the current color theme is maintained by plotting the points from one group with the fill color and the other set of points without.
Plot(Years, Salary, by=Gender)
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
style()
## theme set to "colors"
Can also specify a separate shape for each group with the shape
parameter.
Plot(Years, Salary, by=Gender, shape=c("F","M"))
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
Fit lines can be added as well, with a fit line for each group. Here with just two levels the color theme is preserved as well.
Plot(Years, Salary, by=Gender, fit="ls")
## >>> Suggestions
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
Specify color as the aspect of the graph that demonstrates the grouping.
ggplot(d, aes(Years, Salary)) + geom_point(aes(color=Gender))
Or, specify shape as the grouping feature.
ggplot(d, aes(Years, Salary)) + geom_point(aes(shape=Gender))
Pre is another continuous variable in the data table. The parameter size
can be a constant to change the size of the plotted points but can also be set to a variable, where it specifies that the size of each plotted bubble corresponds to the corresponding value of size
.
Plot(Years, Salary, size=3)
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
Plot(Years, Salary, size=Pre)
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
Should reduce the absolute size of the bubbles, scaled in inches with the radius
parameter. Default is 0.25, apparently the radius, though the documentation does not specify. Also, the power
parameter changes the size of the bubbles relative to each other. Increasing its value more effectively differentiates among values.
Plot(Years, Salary, size=Pre, radius=0.10, power=2)
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
## Plot(x=Years, y=Salary, size=Pre, radius=0.1, power=2, radius=0.15) # larger bubbles
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
ggplot2 allows several different mappings from variable to aesthetic. Scale by area.
ggplot(d, aes(Years, Salary)) + geom_point(aes(size=Pre))
Scale by color.
ggplot(d, aes(Years, Salary)) + geom_point(aes(color=Pre))
Specify either the x-variable or the y-variable as a vector, specified with the standard R c()
function. The input to the function still follows the listing of two variables, but now the first variable is a vector, here defined as the variables Pre and Post.
Plot(c(Pre, Post), Salary)
Generate a fit line for each plotted pair of variables.
Plot(c(Pre, Post), Salary, fit="ls")
Can customize, such as by increasing the size of the plotted points, turning off the background color, turning off the grid. style(panel_fill=“off”, grid_color=“off”)
Plot(c(Pre, Post), Salary, size=3, fit=TRUE)
style()
## theme set to "colors"
Can specify the y-variable as the vector.
Plot(Salary, c(Pre, Post))
## >>> Suggestions
## Plot(Salary, c(Pre, Post), fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Salary, c(Pre, Post), out_cut=.10) # label top 10% potential outliers
## Plot(Salary, c(Pre, Post), ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Salary, c(Pre, Post), enhance=TRUE) # many options, including the above
## Plot(Salary, c(Pre, Post), shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 37
##
##
## Sample Correlation of Salary and Pre: r = -0.007
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: -0.043, df: 35, p-value: 0.966
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: -0.330 Upper Bound: 0.318
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 37
##
##
## Sample Correlation of Salary and Post: r = -0.070
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: -0.416, df: 35, p-value: 0.680
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: -0.385 Upper Bound: 0.260
The scatterplot matrix plots the scatterplot, and perhaps additional information such as a fit line, for each pairwise combination of the specified variables. lessR multiple regression also automatically provides a scatterplot matrix of all the variables in the model, as part of a comprehensive regression analysis.
To obtain a scatterplot matrix, use the lessR Plot()
function. Specify a vector of numeric variables.
Plot(c(Salary, Years, Pre, Post))
Many of the standard lessR visualization parameters also transfer over to the scatterplot matrix. Here change the color of the background and put a red least squares line through each plot.
style("darkred")
style(panel_fill="aliceblue", fit_color="red")
Plot(c(Salary, Years, Pre, Post), fit="ls")
style()
## theme set to "colors"
This plot also is generated by the lessR Regression()
function, abbreviated reg()
. Comment out here as we are not running this analysis.
Regression(Salary ~ Years + Pre + Post)
One common scenario is to examine a relationship between two variables at different levels of one or more other categorical variables. We have already seen this type of display in which the relationship is plotted on the same scatterplot for each of the levels of a third variable, a categorical variable. Trellis graphics extends this idea by plotting the relationship of the variables of interest for each level of another variable in a separate plot called a panel. A primary purpose of the lattice package is to provide these Trellis graphs. The categorical variables are called conditioned variables.
The lessR package directly accesses the lattice functions for Trellis graphs, which provides an interface consistent with all the regular lessR visualization functions, including the current color theme. Activate Trellis graphics for one categorical variable with the by1
parameter.
Plot(Years, Salary, by1=Gender)
## [Trellis graphics from Deepayan Sarkar's lattice package]
Most of the standard lessR parameters transfer over to the Trellis panels. Here change the background color and put a regression line through each panel. Nothing changes from the standard 1-panel lessR display, except for the inclusion of the by1
parameter.
style(panel.fill="aliceblue")
Plot(Years, Salary, by1=Gender, fit="ls")
## [Trellis graphics from Deepayan Sarkar's lattice package]
Reset style to default if desired
style()
## theme set to "colors"
Control the layout of the panels with the n.col
and n.row
parameters. Here stack the two panels into a single column, instead of the default of a single row.
Plot(Years, Salary, by1=Gender, fit="ls", n_col=1)
## [Trellis graphics from Deepayan Sarkar's lattice package]
Compare these parameters with the by
parameter, which groups all of the plots on the same graph.
Plot(Years, Salary, by=Gender, fit="ls")
## >>> Suggestions
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
Can also combine conditioning with by1
and by2
, with grouping specified by by
. Here we have by1
and by
. The variable Plan has three levels.
Plot(Years, Salary, by1=Gender, by=Plan, fit="ls")
## [Trellis graphics from Deepayan Sarkar's lattice package]
Running out of data values for such a small data set, but here we have two levels of conditioning and grouping, along with the least squares line.
Plot(Years, Salary, by1=Dept, by2=Gender, by=Plan, fit="ls")
## [Trellis graphics from Deepayan Sarkar's lattice package]
Now for two conditioned variables. A panel is produced for each cross-classification of the levels of the two categorical variables.
Plot(Years, Salary, by1=Dept, by2=Gender)
## [Trellis graphics from Deepayan Sarkar's lattice package]
ggplot2 version of the same concept of Trellis graphics invokes what it calls facets. Get the numeric-numeric scatterplot separated with Gender facets.
ggplot(d, aes(Years, Salary)) + geom_point() + facet_grid(rows=vars(Gender))
ggplot(d, aes(Years, Salary)) + geom_point() + facet_grid(cols=vars(Gender))
Include linear best fit line without standard errors for each plot.
ggplot(d, aes(Years, Salary)) +
geom_point() + geom_smooth(method="lm", se=FALSE) + facet_grid(rows=vars(Gender))
## `geom_smooth()` using formula 'y ~ x'
For many, many points to plot, the over-plotting problem becomes severe, even with translucent points. By default, when there are at least 2500 rows of data to analyze, the Plot function implements bivariate smoothing across the scatterplot.
This example also illustrates that the data entered into a lessR function need not be in a data frame. Here x and y are defined in the R user workspace, more formally called the global environment. The contents of this workspace are shown in RStudio with the Environment tab at the top-right corner, its default location.
x <- rnorm(5000)
y <- rnorm(5000)
First the base R plot()
function.
plot(x,y)
The lessR Plot()
function automatically smooths from 2500 points on. It also references variables in the general R user workspace, as in this example, as well as variables in a data frame.
Plot(x, y)
## >>> Note: x is from the workspace, not in a data frame (table)
## >>> Note: y is from the workspace, not in a data frame (table)
## >>> Suggestions
## Plot(x, y, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(x, y, out_cut=.10) # label top 10% potential outliers
## Plot(x, y, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(x, y, enhance=TRUE) # many options, including the above
## Plot(x, y, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 5000
##
##
## Sample Correlation of x and y: r = -0.044
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: -3.092, df: 4998, p-value: 0.002
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: -0.071 Upper Bound: -0.016
Blue smoothing cloud.
style("dodgerblue")
Plot(x, y)
## >>> Note: x is from the workspace, not in a data frame (table)
## >>> Note: y is from the workspace, not in a data frame (table)
## >>> Suggestions
## Plot(x, y, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(x, y, out_cut=.10) # label top 10% potential outliers
## Plot(x, y, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(x, y, enhance=TRUE) # many options, including the above
## Plot(x, y, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 5000
##
##
## Sample Correlation of x and y: r = -0.044
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: -3.092, df: 4998, p-value: 0.002
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: -0.071 Upper Bound: -0.016
style()
## theme set to "colors"
If desired, smoothing can be turned off, which follows the general lessR guidelines that the defaults are assumed, but can always be reverted.
Plot(x, y, smooth=FALSE)
## >>> Note: x is from the workspace, not in a data frame (table)
## >>> Note: y is from the workspace, not in a data frame (table)
## >>> Suggestions
## Plot(x, y, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(x, y, out_cut=.10) # label top 10% potential outliers
## Plot(x, y, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(x, y, enhance=TRUE) # many options, including the above
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 5000
##
##
## Sample Correlation of x and y: r = -0.044
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: -3.092, df: 4998, p-value: 0.002
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: -0.071 Upper Bound: -0.016
Besides the general smooth parameter for turning smoothing on or off, there are three other parameters.
smooth.points
: Number of points superimposed on the density plot in the areas of the lowest density to help identify outliers. Defaults to 100.smooth.exp
: The exponent of the function that maps the density scale to the color scale, which controls the darkness of the smoothed points. Default is 0.20.smooth.bins
: Number of bins in both directions for the density estimation. The more bins the smoother the plot.Here darken the smoothing gradient by decreasing the value of smooth.exp
from its default.
Plot(x, y, smooth.exp=.1)
## >>> Note: x is from the workspace, not in a data frame (table)
## >>> Note: y is from the workspace, not in a data frame (table)
## >>> Suggestions
## Plot(x, y, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(x, y, out_cut=.10) # label top 10% potential outliers
## Plot(x, y, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(x, y, enhance=TRUE) # many options, including the above
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 5000
##
##
## Sample Correlation of x and y: r = -0.044
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: -3.092, df: 4998, p-value: 0.002
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: -0.071 Upper Bound: -0.016
ggplot2 wants the input data in a data frame. Create with the standard R function data.frame()
.
myd <- data.frame(x,y)
names(myd) <- c("x", "y")
Do 2-dimensional binning with the stat_binhex()
function.
ggplot(myd, aes(x,y)) + stat_binhex(bins=10, aes(alpha = ..count..))
Or do contour curves with the stat_density2d()
functions.
ggplot(myd, aes(x,y)) + stat_density2d()
#d <- rd("~/Documents/511/BookNew/data/Employee/employee.csv")
d <- rd("http://lessRstats.com/data/employee.csv")
##
## >>> Suggestions
## To read a csv or Excel file of variable labels, var_labels=TRUE
## Each row of the file: Variable Name, Variable Label
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Name character 37 0 37 Ritchie, Darnell ... Cassinelli, Anastis
## 2 Years integer 36 1 16 7 NA 15 ... 1 2 10
## 3 Gender character 37 0 2 M M M ... F F M
## 4 Dept character 36 1 5 ADMN SALE SALE ... MKTG SALE FINC
## 5 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 6 JobSat character 35 2 3 med low low ... high low high
## 7 Plan integer 37 0 3 1 1 3 ... 2 2 1
## 8 Pre integer 37 0 27 82 62 96 ... 83 59 80
## 9 Post integer 37 0 22 92 74 97 ... 90 71 87
## ------------------------------------------------------------------------------------------
style(quiet=TRUE)
The add
parameter specifies an annotation.
Plot(Years, Salary, add="v.line", x1="mean.x")
Plot(Years, Salary, add="v.line", x1=seq(6,18,3))
Plot(Years, Salary, add=c("v.line", "h.line"), x1="mean.x", y1="mean.y")
style(add.trans=.8, add.fill="gold", add.color="gold4", add.lwd=3.5)
Plot(Years, Salary, add="rect", x1=12, y1=80000, x2=16, y2=115000)
style()
## theme set to "colors"
style(add.fill=c("gold3", "green"), add.trans=c(.8,.4))
Plot(Years, Salary, add=c("rect", "rect"),
x1=c(10, 2), y1=c(60000, 45000), x2=c(12, 75000), y2=c(80000, 55000))
## >>> Suggestions
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, standard errors
## Plot(Years, Salary, out_cut=.10) # label top 10% potential outliers
## Plot(Years, Salary, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Years, Salary, enhance=TRUE) # many options, including the above
## Plot(Years, Salary, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 36
##
##
## Sample Correlation of Years and Salary: r = 0.852
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 9.501, df: 34, p-value: 0.000
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.727 Upper Bound: 0.923
style()
## theme set to "colors"
geom_text()
draws a text label at a given x and y coordinate, here for all points. vjust is a vertical adjustment to separate the text from the point.
ggplot(d, aes(Years, Salary)) + geom_point() +
geom_text(aes(label=Name), size=3, color="darkgray", vjust=1.5)
Annotate at a specified location provides text, segment, rect, or pointrange.
ggplot(d, aes(Years, Salary)) + geom_point() +
annotate("text", x=20, y=121000, label="Highest Earner", color="blue")
ggplot(d, aes(Years, Salary)) + geom_point() +
annotate("segment", x=5, xend=14, y=22000, yend=30000, color="blue")
ggplot(d, aes(Years, Salary)) + geom_point() +
annotate("rect", xmin=3, xmax=8, ymin=40000, ymax=65000, alpha=.2)
ggplot(d, aes(Years, Salary)) + geom_point() +
annotate("pointrange", x=3, y=45000, ymin=30000, ymax=60000, alpha=.4, color="blue")
A correlation matrix organizes all the pair-wise correlations among a set of variables into a square, symmetric matrix, a summary of the linear relationships among the variables. To search for patterns among this table of numbers, represent the correlations visually, here with a heat map, which replaces each correlation in a correlation matrix with a colored tile, a small square of color. The larger the correlation, the more intense the tile color.
First subset the responses just for the first 16 items for the Mach IV scale. The responses are on Likert scale from 0 to 5, Strongly Disagree to Strongly Agree.
d <- rd("Mach4", quiet=TRUE)
d <- subset(d, select=c(m01:m16))
head(d)
## m01 m02 m03 m04 m05 m06 m07 m08 m09 m10 m11 m12 m13 m14 m15 m16
## 1 0 4 1 5 0 5 4 1 5 4 0 0 0 0 4 0
## 2 0 1 4 4 0 3 3 0 4 4 0 1 1 1 2 4
## 3 2 1 0 5 4 4 0 5 3 4 1 4 0 0 2 0
## 4 0 5 2 4 0 4 4 4 5 2 0 0 0 1 1 1
## 5 2 2 3 3 2 3 2 3 1 4 1 3 1 2 2 2
## 6 1 3 3 5 1 3 3 1 5 3 0 0 0 3 2 5
Now compute the correlation matrix with lessR function Correlation()
, abbreviated cr()
.
cr(d)
##
## The correlation matrix contains 16 variables
##
##
## Missing data deletion: pairwise
##
## >>> No missing data
##
##
## Variables in the correlation matrix:
## m01, m02, m03, m04, m05, m06, m07, m08, m09, m10, m11, m12, m13, m14, m15, m16
##
## To view the correlation matrix, enter the name of the returned object
## followed by $R such as mycor$R
##
cr(d, bottom=5, right=5)
##
## The correlation matrix contains 16 variables
##
##
## Missing data deletion: pairwise
##
## >>> No missing data
##
##
## Variables in the correlation matrix:
## m01, m02, m03, m04, m05, m06, m07, m08, m09, m10, m11, m12, m13, m14, m15, m16
##
## To view the correlation matrix, enter the name of the returned object
## followed by $R such as mycor$R
##
Reorder the correlations to provide a more convenient heat map. Then perform a hierarchical cluster analysis. Specify four clusters.
R <- cr(d)$R
corReorder(R)
corReorder(R, n.clusters=4)
##
## 4 Cluster Solution
## -------------------
## m01 m05 m08 m12 m13 m02 m15 m03 m06 m07 m09 m10 m04 m11 m14 m16
## 1 1 1 1 1 2 2 3 3 3 3 3 4 4 4 4