lessR provides many versions of a scatter plot with
its Plot()
function for one or two variables with an option
to provide a separate scatterplot for each level of one or two
categorical variables. Access all scatterplots with the same simple
syntax. The first variable listed without a parameter name, the
x
parameter, is plotted along the x-axis. Any second
variable listed without a parameter name, the y
parameter,
is plotted along the y-axis. Each parameter may be represented by a
continuous or categorical variable, a single variable or a vector of
variables.
Plot()
also plots time series data when the x-axis
variable is a Date
variable. See the Time
vignette for those examples.
Illustrate with the Employee data included as part of lessR.
##
## >>> Suggestions
## Recommended binary format for data files: feather
## Create with Write(d, "your_file", format="feather")
## More details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 7 ... 1 2 10
## 2 Gender character 37 0 2 M M W ... W W M
## 3 Dept character 36 1 5 ADMN SALE FINC ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low high ... high low high
## 6 Plan integer 37 0 3 1 1 2 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 90 ... 83 59 80
## 8 Post integer 37 0 22 92 74 86 ... 90 71 87
## ------------------------------------------------------------------------------------------
As an option, lessR also supports variable labels.
The labels are displayed on both the text and visualization output. Each
displayed label consists of the variable name juxtaposed with the
corresponding label. Create the table formatted as two columns. The
first column is the variable name and the second column is the
corresponding variable label. Not all variables need to be entered into
the table. The table can be stored as either a csv
file or
an Excel file.
Read the variable label file into the l data frame, currently the only permissible name for the label file.
##
## >>> Suggestions
## Recommended binary format for data files: feather
## Create with Write(d, "your_file", format="feather")
## More details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 label character 8 0 8 Time of Company Employment ... Test score on legal issues after instruction
## ------------------------------------------------------------------------------------------
Display the available labels.
## label
## Years Time of Company Employment
## Gender Man or Woman
## Dept Department Employed
## Salary Annual Salary (USD)
## JobSat Satisfaction with Work Environment
## Plan 1=GoodHealth, 2=GetWell, 3=BestCare
## Pre Test score on legal issues before instruction
## Post Test score on legal issues after instruction
A typical scatterplot visualizes the relationship of two continuous
variables, here Years worked at a company, and annual
Salary. Following is the function call to Plot()
for the default visualization.
Because d is the default name of the data frame that
contains the variables for analysis, the data
parameter
that names the input data frame need not be specified. That is, no need
to specify data=d
, though this parameter can be explicitly
included in the function call if desired.
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Years, Salary, enhance=TRUE) # many options
## Plot(Years, Salary, color="red") # exterior edge color of points
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, stnd errors
## Plot(Years, Salary, MD_cut=6) # Mahalanobis distance from center > 6 is an outlier
##
##
## >>> Pearson's product-moment correlation
##
## Years: Time of Company Employment
## Salary: Annual Salary (USD)
##
## Number of paired values with neither missing, n = 36
## Sample Correlation of Years and Salary: r = 0.852
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
##
Enhance the default scatterplot with parameter enhance
.
The visualization includes the mean of each variable indicated by the
respective line through the scatterplot, the 95% confidence ellipse,
labeled outliers, least-squares regression line with 95% confidence
interval, and the corresponding regression line with the outliers
removed.
## [Ellipse with Murdoch and Chow's function ellipse from their ellipse package]
##
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Years, Salary, fill="skyblue") # interior fill color of points
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, stnd errors
## Plot(Years, Salary, out_cut=.10) # label top 10% from center as outliers
##
## >>> Outlier analysis with Mahalanobis Distance
##
## MD ID
## ----- -----
## 8.14 18
## 7.84 34
##
## 5.63 31
## 5.58 19
## 3.75 4
## ... ...
##
##
## >>> Pearson's product-moment correlation
##
## Years: Time of Company Employment
## Salary: Annual Salary (USD)
##
## Number of paired values with neither missing, n = 36
## Sample Correlation of Years and Salary: r = 0.852
##
## Hypothesis Test of 0 Correlation: t = 9.501, df = 34, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.727 to 0.923
##
A variety of fit lines can be plotted. The available values:
"loess"
for general non-linear fit, "lm"
for
linear least squares, "null"
for the null (flat line)
model, "exp"
for the exponential growth and decay,
"quad"
for the quadratic model, and power
for
the general power beyond 2. Setting fit
to
TRUE
plots the "loess"
line. With the value of
power
, specify the value of the root with parameter
fit_power
.
Here, plot the general non-linear fit. For emphasis set
plot_errors
to TRUE
to plot the residuals from
the line. The sum of the squared errors is displayed to facilitate the
comparison of different models.
##
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Years, Salary, enhance=TRUE) # many options
## Plot(Years, Salary, color="red") # exterior edge color of points
## Plot(Years, Salary, MD_cut=6) # Mahalanobis distance from center > 6 is an outlier
##
## Fit: Mean Squared Error, MSE = 100,834,065
##
Next, plot the exponential fit and show the residuals from the exponential curve. These data are approximately linear so the exponential curve does not vary far from a straight line. The function displays the corresponding sum of squared errors to assist in comparing various models to each other.
##
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Years, Salary, enhance=TRUE) # many options
## Plot(Years, Salary, color="red") # exterior edge color of points
## Plot(Years, Salary, MD_cut=6) # Mahalanobis distance from center > 6 is an outlier
##
## Regressed linearized data of transformed data values of Salary with log()
## For predicted values, back transform with exp() of regression model
##
## Line: b0 = 10.777 b1 = 0.041 Fit: MSE = 0.022 Rsq = 0.722
##
The parameter transforms the y variable to the specified
power from the default of 1
before doing the regression
analysis. The availability of this parameter provides for a wide range
of modifications to the underlying functional form of the fit curve.
Map a continuous variable, such as Pre, to the plotted
points with the size
parameter, a bubble plot.
##
##
##
## Some Parameter values (can be manually set)
## -------------------------------------------------------
## radius: 0.12 size of largest bubble
## power: 0.50 relative bubble sizes
Indicate multiple variables to plot along either axis with a vector
defined according to the base R function c()
. Plot the
linear model for each variable according to the fit
parameter set to "lm"
. By default, when multiple lines are
plotted on the same panel, the confidence interval is turned off by
internally setting the parameter fit_se
set to
0
. Explicitly override this parameter value as needed.
##
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(c(Pre, Post), Salary, enhance=TRUE) # many options
## Plot(c(Pre, Post), Salary, color="red") # exterior edge color of points
## Plot(c(Pre, Post), Salary, out_cut=.10) # label top 10% from center as outliers
##
##
## >>> Pearson's product-moment correlation
##
## Post: Test score on legal issues after instruction
## Salary: Annual Salary (USD)
##
## Number of paired values with neither missing, n = 37
## Sample Correlation of Post and Salary: r = -0.070
##
## Hypothesis Test of 0 Correlation: t = -0.416, df = 35, p-value = 0.680
## 95% Confidence Interval for Correlation: -0.385 to 0.260
##
Multiple variables for the first parameter value, x
, and
no values for y
, plot as a scatterplot matrix. Pass a
single vector, such as defined by c()
. Request the
non-linear fit line and corresponding confidence interval by specifying
TRUE
or loess
for the fit
parameter. Request a linear fit line with the value of
"lm"
.
Smoothing and binning are two procedures for visualizing a relationship with many data values.
To obtain a larger data set, in this example generate random data
with base R rnorm()
, then plot. Plot()
first
checks the presence of the specified variables in the global environment
(workspace). If not there, then from a data frame, of which the default
value is d. Here, randomly generate values from normal
populations for x and y in the workspace.
## >>> Note: x is not in a data frame (table)
## >>> Note: y is not in a data frame (table)
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(x, y, enhance=TRUE) # many options
## Plot(x, y, color="red") # exterior edge color of points
## Plot(x, y, fit="lm", fit_se=c(.90,.99)) # fit line, stnd errors
## Plot(x, y, out_cut=.10) # label top 10% from center as outliers
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 4000
## Sample Correlation of x and y: r = 0.251
##
## Hypothesis Test of 0 Correlation: t = 16.397, df = 3998, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.222 to 0.280
##
With large data sets, even for continuous variables there can be much
over-plotting of points. One strategy to address this issue smooths the
scatterplot by turning on the smooth
parameter. The
individual points superimposed on the smoothed plot are potential
outliers. The default number of plotted outliers is 100. Turn off the
plotting of outliers completely by setting parameter
smooth_points
to 0
. Show the linear trend with
fit
set to "lm"
.
## >>> Note: x is not in a data frame (table)
## >>> Note: y is not in a data frame (table)
##
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(x, y, enhance=TRUE) # many options
## Plot(x, y, fill="skyblue") # interior fill color of points
## Plot(x, y, MD_cut=6) # Mahalanobis distance from center > 6 is an outlier
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 4000
## Sample Correlation of x and y: r = 0.251
##
## Hypothesis Test of 0 Correlation: t = 16.397, df = 3998, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.222 to 0.280
##
##
## Line: b0 = 1.03068757 b1 = 7.91963664 Fit: MSE = 917.032 Rsq = 0.063
##
Another strategy for alleviating over-plotting makes the fill color
mostly transparent with the transparency
parameter, or turn
off completely by setting fill
to "off"
. The
closer the value of trans
is to 1, the more transparent is
the fill.
## >>> Note: x is not in a data frame (table)
## >>> Note: y is not in a data frame (table)
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(x, y, enhance=TRUE) # many options
## Plot(x, y, color="red") # exterior edge color of points
## Plot(x, y, fit="lm", fit_se=c(.90,.99)) # fit line, stnd errors
## Plot(x, y, MD_cut=6) # Mahalanobis distance from center > 6 is an outlier
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 4000
## Sample Correlation of x and y: r = 0.251
##
## Hypothesis Test of 0 Correlation: t = 16.397, df = 3998, p-value = 0.000
## 95% Confidence Interval for Correlation: 0.222 to 0.280
##
Another way to visualize a relationship when there are many data
points is to bin the x-axis. Specify the number of bins with
parameter n_bins
. Plot() then computes the mean of
y for each bin and connects the means by line segments. This
procedure plots the conditional means by default without any assumption
of form such as linearity. Specify the stat
parameter for
median
to compute the median of y for each bin. The
standard Plot()
parameters fill
,
color
, size
and segments
also
apply.
## >>> Note: x is not in a data frame (table)
## >>> Note: y is not in a data frame (table)
##
## Table: Summary Stats
##
## x y
## ------- ------- ---------
## n 4000 4000
## n.miss 0 0
## min -3.239 -104.740
## max 3.589 112.460
## mean -0.003 1.006
##
##
## Table: mean of y for levels of x
##
## bin n midpt mean
## --- ---------------- ----- ------- --------
## 1 [-3.246,-1.873] 116 -2.560 -16.734
## 2 (-1.873,-0.508] 1090 -1.191 -5.699
## 3 (-0.508,0.858] 2001 0.175 0.848
## 4 (0.858,2.223] 743 1.541 12.374
## 5 (2.223,3.596] 50 2.909 25.696
The default plot for a single continuous variable includes not only the scatterplot, but also the superimposed violin plot and box plot, with outliers identified. Call this plot the VBS plot.
## [Violin/Box/Scatterplot graphics from Deepayan Sarkar's lattice package]
##
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## --- Salary ---
## Present: 37
## Missing: 0
## Total : 37
##
## Mean : 73795.557
## Stnd Dev : 21799.533
## IQR : 31012.560
## Skew : 0.190 [medcouple, -1 to 1]
##
## Minimum : 46124.970
## Lower Whisker: 46124.970
## 1st Quartile : 56772.950
## Median : 69547.600
## 3rd Quartile : 87785.510
## Upper Whisker: 122563.380
## Maximum : 134419.230
##
##
## --- Outliers --- from the box plot: 1
##
## Small Large
## ----- -----
## 134419.23
##
## Number of duplicated values: 0
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.61 size of plotted points
## out_size: 0.82 size of plotted outlier points
## jitter_y: 0.45 random vertical movement of points
## jitter_x: 0.00 random horizontal movement of points
## bw: 9529.04 set bandwidth higher for smoother edges
Control the choice of the three superimposed plots – violin, box, and
scatter – with the vbs_plot
parameter. The default setting
is "vbs"
for all three plots. Here, for example, obtain
just the box plot. Or, use the alias BoxPlot()
in place of
Plot()
.
## [Violin/Box/Scatterplot graphics from Deepayan Sarkar's lattice package]
##
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## --- Salary ---
## Present: 37
## Missing: 0
## Total : 37
##
## Mean : 73795.557
## Stnd Dev : 21799.533
## IQR : 31012.560
## Skew : 0.190 [medcouple, -1 to 1]
##
## Minimum : 46124.970
## Lower Whisker: 46124.970
## 1st Quartile : 56772.950
## Median : 69547.600
## 3rd Quartile : 87785.510
## Upper Whisker: 122563.380
## Maximum : 134419.230
##
##
## --- Outliers --- from the box plot: 1
##
## Small Large
## ----- -----
## 134419.23
##
## Number of duplicated values: 0
Do a frequency distribution by specifying the value of parameter
stat_x
, either "count"
or if the y-axis is
proportion, then "proportion"
or "%"
. Can
specify a custom bin width if desired with the parameter
bin_width
.
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Salary, stat_x="%", bin_width=13000, size=0) # just line segments, no points
##
## --- Salary ---
##
## n miss mean sd min mdn max
## 37 0 73795.557 21799.533 46124.970 69547.600 134419.230
##
##
##
## Bin Width: 13000
## Number of Bins: 8
##
## Bin Midpnt Count Prop Cumul.c Cumul.p
## ---------------------------------------------------------
## 40000 > 53000 46500 5 0.14 5 0.14
## 53000 > 66000 59500 10 0.27 15 0.41
## 66000 > 79000 72500 10 0.27 25 0.68
## 79000 > 92000 85500 4 0.11 29 0.78
## 92000 > 105000 98500 4 0.11 33 0.89
## 105000 > 118000 111500 2 0.05 35 0.95
## 118000 > 131000 124500 1 0.03 36 0.97
## 131000 > 144000 137500 1 0.03 37 1.00
##
##
## No (Box plot) outliers
Create a Cleveland dot plot when one of the variables has unique (ID)
values. In this example, for a single variable, row names are on the
y-axis. The default plots sorts by the value plotted with the default
value of parameter sort_yx
of "+"
for an
ascending plot. Set to "-"
for a descending plot and
"0"
for no sorting.
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Salary, row_names, sort_yx="0") # do not sort y-axis variable by x-axis variable
## Plot(Salary, row_names, segments_y=FALSE) # drop the line segments
## Plot(Salary, row_names, fill="red") # red point interiors
##
The standard scatterplot version of a Cleveland dot plot follows, with no sorting and no line segments.
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Salary, row_names, segments_y=FALSE, sort_yx="0", fill="red") # red point interiors
##
This Cleveland dot plot has two x-variables, indicated as a standard
R vector with the c()
function. In this situation, the two
points on each row are connected with a line segment. By default the
rows are sorted by distance between the successive points.
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(c(Pre, Post), row_names, sort_yx="0") # do not sort y-axis variable by x-axis variable
## Plot(c(Pre, Post), row_names, segments_y=FALSE) # drop the line segments
## Plot(c(Pre, Post), row_names, fill="red") # red point interiors
##
A mixture of categorical and continuous variables can be plotted a variety of ways, as illustrated below.
Plot a scatterplot of two continuous variables for each level of a
categorical variable on the same panel with the by
parameter. Here, plot Years and Salary each for the
two levels of Gender in the data. Colors and geometric plot
shapes can distinguish between the plots. For all variables except an
ordered factor, the default plots according to the default qualitative
color palette, "hues"
, with the geometric shape of a
point.
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Years, Salary, enhance=TRUE) # many options
## Plot(Years, Salary, fill="skyblue") # interior fill color of points
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, stnd errors
## Plot(Years, Salary, out_cut=.10) # label top 10% from center as outliers
Change the plot colors with the fill
(interior) and
color
(exterior or edge) parameters. Because there are two
levels of the by
variable, specify two fill colors and two
edge colors each with an R vector defined by the c()
function. Also, include the regression line for each group with the
fit
parameter and increase the size of the plotted points
with the size
parameter.
Plot(Years, Salary, by=Gender, size=2, fit="lm",
fill=c("olivedrab3", "gold1"),
color=c("darkgreen", "gold4")
)
##
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Years, Salary, enhance=TRUE) # many options
## Plot(Years, Salary, MD_cut=6) # Mahalanobis distance from center > 6 is an outlier
##
## Gender: M Line: b0 = 30842.335 b1 = 4047.307 Fit: MSE = 107,647,877 Rsq = 0.819
##
## Gender: W Line: b0 = 47109.787 b1 = 2882.272 Fit: MSE = 144,700,625 Rsq = 0.598
##
Change the plotted shapes with the shape
parameter. The
default value is "circle"
with both an exterior color and
filled interior, specified with "color"
and
"fill"
. Other possible values, with fillable interiors, are
"circle"
, "square"
, "diamond"
,
"triup"
(triangle up), and "tridown"
(triangle
down). Other possible values include all uppercase and lowercase
letters, all digits, and most punctuation characters. The numbers 0
through 25 defined by the R points()
function also apply.
If plotting levels according to by
, then list one shape for
each level to be plotted.
Or, request default shapes across the different by
groups by setting parameter shapes
to
"vary"
.
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Years, Salary, enhance=TRUE) # many options
## Plot(Years, Salary, color="red") # exterior edge color of points
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99)) # fit line, stnd errors
## Plot(Years, Salary, MD_cut=6) # Mahalanobis distance from center > 6 is an outlier
A Trellis (facet) plot creates a separate panel for the plot of each
level of the categorical variable. Generate Trellis plots with the
facet1
parameter. In this example, plot the best-fit linear
model for the data in each panel according to the fit
parameter. By default, the 95% confidence interval for each line is also
displayed.
## [Trellis (facet) graphics from Deepayan Sarkar's lattice package]
##
## Regression analysis of linearized Salary values
## Need back transformation of regression model to compute predicted values
##
## Gender 1 Line: b0 = 30842.335 b1 = 4047.307 Fit: MSE = 107,647,877 Rsq = 0.819
##
## Gender 2 Line: b0 = 47109.787 b1 = 2882.272 Fit: MSE = 144,700,625 Rsq = 0.598
Turn off the confidence interval by setting the parameter
fit_se
to 0 for the value of the confidence level.
A categorical variable plotted with a continuous variable results in a traditional scatterplot though, of course, the scatter is confined to the straight lines that represent the levels of the categorical variable, its values.
The first two parameters of Plot()
are x
and y
. In this example, the categorical variable,
Dept, listed second, specifies the y
variable, as
in y=Dept. There is no distinction in this function call for
two continues variables or one continuous and one categorical. The
Plot()
function evaluates each variable for continuity and
responds appropriately.
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Salary, Dept, means=FALSE) # do not plot means
## Plot(Salary, Dept, stat="mean") # only plot means
## ANOVA(Salary ~ Dept) # inferential analysis
##
## Salary: Annual Salary (USD)
## - by levels of -
## Dept: Department Employed
##
## n miss mean sd min mdn max
## ACCT 5 0 61792.776 12774.606 46124.970 69547.600 72502.500
## ADMN 6 0 81277.117 27585.151 53788.260 71058.595 122563.380
## FINC 4 0 69010.675 17852.498 57139.900 61937.625 95027.550
## MKTG 6 0 70257.128 19869.812 51036.850 61658.990 99062.660
## SALE 15 0 78830.065 23476.839 49188.960 77714.850 134419.230
##
To avoid point overlap, if there is at least one duplicated value of
continuous y
for any level of categorical x
,
by default some horizontal jitter for each plotted point is added, which
was not needed in this example. Manually adjust the jitter with either
parameter jitter_x
or, if x
is continuous and
y
categorical, the jitter_y
parameter. In
addition, if the categorical variable is an R factor
or a
variable of type character
, by default the mean of the
continuous variable is displayed at each level of the categorical
variable, as well in the text output. If the categorical variable is
numeric, better to convert the variable to a factor to have just the
categories on the axis and not a continuous scale. For example,
d$Gender <- factor(d$Gender)
.
Another helpful technique for large data sets is to add some fill
transparency with the transparency
parameter, with values
such as 0.8 and 0.9. The combination of jitter and transparency allows
for plotting many thousands of points.
An alternative display of the distribution of a continuous variable
and a categorical variable is a superimposed violin, box, and scatter
plot, a VBS plot. To plot the points in different colors according to
the level of the categorical variable, invoke the by
parameter. Here, plot Salary across the levels of
Gender on the same panel.
## [Violin/Box/Scatterplot graphics from Deepayan Sarkar's lattice package]
##
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## ttest(Salary ~ Gender) # Add the data parameter if not the d data frame
## Salary: Annual Salary (USD)
## - by levels of -
## Gender
##
## n miss mean sd min mdn max
## M 18 0 81147.458 23128.436 49188.960 79792.950 134419.230
## W 19 0 66830.598 18438.456 46124.970 61356.690 122563.380
##
##
## Max Dupli-
## Level cations Values
## ------------------------------
## M 0
## W 0
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.56 size of plotted points
## out_size: 0.80 size of plotted outlier points
## jitter_y: 0.54 random vertical movement of points
## jitter_x: random horizontal movement of points
## bw: 9529.04 set bandwidth higher for smoother edges
Or, create a Trellis plot that consists of a VBS plot on a separate
panel for each level of the categorical variable. Accomplish the Trellis
plot with the facet1
parameter. Here, plot Salary
across the levels of Dept. Again, specify one, two, or, by
default, all three superimposed plots: violin, box, and scatter. In this
example, the categorical variable, Gender specifies the
facet1 variable.
## [Trellis (facet) graphics from Deepayan Sarkar's lattice package]
##
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## ttest(Salary ~ Gender) # Add the data parameter if not the d data frame
## Salary: Annual Salary (USD)
## - by levels of -
## Gender: Man or Woman
##
## n miss mean sd min mdn max
## M 18 0 81147.458 23128.436 49188.960 79792.950 134419.230
## W 19 0 66830.598 18438.456 46124.970 61356.690 122563.380
##
##
## Max Dupli-
## Level cations Values
## ------------------------------
## M 0
## W 0
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.56 size of plotted points
## out_size: 0.80 size of plotted outlier points
## jitter_y: 0.54 random vertical movement of points
## jitter_x: random horizontal movement of points
## bw: 9529.04 set bandwidth higher for smoother edges
The default coloring of the boxes for variables other than an ordered
factor follows the default qualitative palette, "hues"
. For
an ordered factor, the fill color follows the default sequential palette
of the corresponding theme, such as "blues"
. Customize
colors with the box_fill
parameter.
Just show the box plots according to the vbs_plot
parameter, which has a default setting of vbs
for the
superimposed violin, box, and scatter plots. Set vbs_plot
to "b"
. Or, use the alias BoxPlot()
. Change
the fill color of each box with the box_fill
parameter. In
addition to the traditional median for a box plot, show the mean as well
as with the vbs_mean
parameter. If specifying just one fill
color, then all boxes are filled with that color.
Or, drop the box plot and only plot the violins and the scatter
plots. Without the boxes, the violins take on the default colors.
Specify a value of "vs"
for the vbs_plot
parameter. If only plotting the violins, then can also use the alias
ViolinPlot()
. Change the fill color of the violins with
parameter violin_fill
.
## [Trellis (facet) graphics from Deepayan Sarkar's lattice package]
##
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## ttest(Salary ~ Gender) # Add the data parameter if not the d data frame
## Salary: Annual Salary (USD)
## - by levels of -
## Gender: Man or Woman
##
## n miss mean sd min mdn max
## M 18 0 81147.458 23128.436 49188.960 79792.950 134419.230
## W 19 0 66830.598 18438.456 46124.970 61356.690 122563.380
##
##
## Max Dupli-
## Level cations Values
## ------------------------------
## M 0
## W 0
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.56 size of plotted points
## out_size: 0.80 size of plotted outlier points
## jitter_y: 0.54 random vertical movement of points
## jitter_x: random horizontal movement of points
## bw: 9529.04 set bandwidth higher for smoother edges
The following plot uses the alias BoxPlot()
to generate
a Trellis plot of boxplots only across the levels of
Gender.
## [Trellis (facet) graphics from Deepayan Sarkar's lattice package]
##
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## ttest(Salary ~ Gender) # Add the data parameter if not the d data frame
## Salary: Annual Salary (USD)
## - by levels of -
## Gender: Man or Woman
##
## n miss mean sd min mdn max
## M 18 0 81147.458 23128.436 49188.960 79792.950 134419.230
## W 19 0 66830.598 18438.456 46124.970 61356.690 122563.380
##
##
## Max Dupli-
## Level cations Values
## ------------------------------
## M 0
## W 0
Show the different distributions of the continuous variable across the levels of the categorical variable with a scatterplot. Here, show the distribution of Salary for Males and Females across the various departments.
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Salary, Dept, by=Gender, means=FALSE) # do not plot means
## Plot(Salary, Dept, by=Gender, stat="mean") # only plot means
## ANOVA(Salary ~ Dept) # inferential analysis
##
## Salary: Annual Salary (USD)
## - by levels of -
## Dept: Department Employed
##
## n miss mean sd min mdn max
## ACCT 5 0 61792.776 12774.606 46124.970 69547.600 72502.500
## ADMN 6 0 81277.117 27585.151 53788.260 71058.595 122563.380
## FINC 4 0 69010.675 17852.498 57139.900 61937.625 95027.550
## MKTG 6 0 70257.128 19869.812 51036.850 61658.990 99062.660
## SALE 15 0 78830.065 23476.839 49188.960 77714.850 134419.230
##
To do a Trellis plot with two categorical variables, invoke the
facet2
parameter in addition to the facet1
parameter. By default, the box fill colors are unique for each level of
the facet1
variable, and then the colors cycle through all
the values of the facet2
variable. With so many panels to
plot, explicitly set them in a single column by setting parameter
n_col
to 1
. The corresponding row parameter is
n_row
.
## [Trellis (facet) graphics from Deepayan Sarkar's lattice package]
##
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## ttest(Salary ~ Gender) # Add the data parameter if not the d data frame
## Salary: Annual Salary (USD)
## - by levels of -
## Gender: Man or Woman
##
## n miss mean sd min mdn max
## M 18 0 81147.458 23128.436 49188.960 79792.950 134419.230
## W 19 0 66830.598 18438.456 46124.970 61356.690 122563.380
##
##
## Max Dupli-
## Level cations Values
## ------------------------------
## M 0
## W 0
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.56 size of plotted points
## out_size: 0.80 size of plotted outlier points
## jitter_y: 0.54 random vertical movement of points
## jitter_x: random horizontal movement of points
## bw: 9529.04 set bandwidth higher for smoother edges
To specify custom colors with the box_fill
parameter,
specify the number of colors according to the number of levels of the
facet1
variable. The colors for facet1
then
cycle over the facet2
values.
Alternatively, invoke the by
parameter and the
facet1
parameter. The values of the by
variable plot as separate panels, the Trellis part, and the
by
variable plot for each panel.
## [Trellis (facet) graphics from Deepayan Sarkar's lattice package]
##
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## ttest(Salary ~ Gender) # Add the data parameter if not the d data frame
## ANOVA(Salary ~ Dept) # Add the data parameter if not the d data frame
## Salary: Annual Salary (USD)
## - by levels of -
## Gender: Department Employed
##
## n miss mean sd min mdn max
## ACCT 5 0 61792.776 12774.606 46124.970 69547.600 72502.500
## ADMN 6 0 81277.117 27585.151 53788.260 71058.595 122563.380
## FINC 4 0 69010.675 17852.498 57139.900 61937.625 95027.550
## MKTG 6 0 70257.128 19869.812 51036.850 61658.990 99062.660
## SALE 15 0 78830.065 23476.839 49188.960 77714.850 134419.230
##
##
## Max Dupli-
## Level cations Values
## ------------------------------
## ACCT 0
## ADMN 0
## FINC 0
## MKTG 0
## SALE 0
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.52 size of plotted points
## out_size: 0.79 size of plotted outlier points
## jitter_y: 0.50 random vertical movement of points
## jitter_x: random horizontal movement of points
## bw: 9529.04 set bandwidth higher for smoother edges
Plot()
can display the relationships for up to five
variables. The two primary variables, x and y, that
form the basis of the scatter plot, are continuous. Usually these two
variables are listed first in the function call and so do not need their
parameter names specified. Indicate two categorical variables that form
the Trellis panels with parameters facet1
and
facet2
. Call these two variables the Trellis variables,
which define a Trellis panel for each combination of their values.
Finally, there can be a categorical grouping variable, the
by
variable, which plots different groups within each
Trellis panel.
To illustrate, first, the data. Use the Cars93 data set that is installed with lessR, which describes characteristics of 1993 cars.
##
## >>> Suggestions
## Recommended binary format for data files: feather
## Create with Write(d, "your_file", format="feather")
## More details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Make character 93 0 32 Acura Acura ... Volvo Volvo
## 2 Type character 93 0 6 Small Midsize ... Compact Midsize
## 3 MinPrice double 93 0 79 12.9 29.2 25.9 ... 22.9 21.8 24.8
## 4 MidPrice double 93 0 81 15.9 33.9 29.1 ... 23.3 22.7 26.7
## 5 MaxPrice double 93 0 79 18.8 38.7 32.3 ... 23.7 23.5 28.5
## 6 MPGcity integer 93 0 21 25 18 20 ... 18 21 20
## 7 MPGhiway integer 93 0 22 31 25 26 ... 25 28 28
## 8 Airbags integer 93 0 3 0 2 1 ... 0 1 2
## 9 DriveTrain integer 93 0 3 1 1 1 ... 1 0 1
## 10 Cylinders integer 92 1 5 4 6 6 ... 6 4 5
## 11 Engine double 93 0 26 1.8 3.2 2.8 ... 2.8 2.3 2.4
## 12 HP integer 93 0 57 140 200 172 ... 178 114 168
## 13 RPM integer 93 0 24 6300 5500 5500 ... 5800 5400 6200
## 14 RevMile integer 93 0 78 2890 2335 2280 ... 2385 2215 2310
## 15 Manual integer 93 0 2 1 1 1 ... 1 1 1
## 16 FuelCap double 93 0 38 13.2 18 16.9 ... 18.5 15.8 19.3
## 17 PassCap integer 93 0 6 5 5 5 ... 4 5 5
## 18 Length integer 93 0 51 177 195 180 ... 159 190 184
## 19 Wheelbase integer 93 0 27 102 115 102 ... 97 104 105
## 20 Width integer 93 0 16 68 71 67 ... 66 67 69
## 21 Uturn integer 93 0 14 37 38 37 ... 36 37 38
## 22 RearSeat double 91 2 24 26.5 30 28 ... 26 29.5 30
## 23 LugCap integer 82 11 16 11 15 14 ... 15 14 15
## 24 Weight integer 93 0 81 2705 3560 3375 ... 2810 2985 3245
## 25 Source integer 93 0 2 0 0 0 ... 0 0 0
## ------------------------------------------------------------------------------------------
The categorical variables are integer coded 0 and 1, so recode to R factors to obtain more descriptive labels.
d$Transmission <- factor(d$Manual, levels=0:1, labels=c("Auto", "Manual"))
d$Source <- factor(d$Source, levels=0:1, labels=c("Foreign", "Domestic"))
Plot MPGcity according to Weight. Specify the number of Cylinders and Manual transmission or not as Trellis conditioning variables to form the Trellis plot. Specify the Source of the vehicle, Foreign or Domestic as a grouping variable to plot with separate colors on each panel.
## [Trellis (facet) graphics from Deepayan Sarkar's lattice package]
From the visualization the patterns emerge. As Weight increases city MPG decreases. Domestic cars tend to weigh more. Foreign cars tend to have fewer cylinders, which also leads to better fuel mileage.
To avoid over-plotting, the plot of two categorical variables results in a bubble plot of their joint frequencies.
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Dept, Gender, size_cut=FALSE)
## Plot(Dept, Gender, trans=.8, bg="off", grid="off")
## SummaryStats(Dept, Gender) # or ss
##
## Joint and Marginal Frequencies
## ------------------------------
##
## Dept
## Gender ACCT ADMN FINC MKTG SALE Sum
## M 2 2 3 1 10 18
## W 3 4 1 5 5 18
## Sum 5 6 4 6 15 36
##
## Cramer's V: 0.415
##
## Chi-square Test of Independence:
## Chisq = 6.200, df = 4, p-value = 0.185
## >>> Low cell expected frequencies, chi-squared approximation may not be accurate
##
## Some Parameter values (can be manually set)
## -------------------------------------------------------
## radius: 0.22 size of largest bubble
## power: 0.50 relative bubble sizes
The parameter radius
scales the size of the bubbles
according to the size of the largest displayed bubble in inches. The
power
parameter sets the relative size of the bubbles. The
default power
value of 0.5 scales the bubbles so that the
area of each bubble is the value of the corresponding sizing variable. A
value of 1 scales so the radius of each bubble is the value of the
sizing variable, increasing the discrepancy of size between the
variables.
In this example, increase the absolute size of the bubbles as well as
the relative discrepancy in their sizes. If the bubbles become too
large, so that the largest bubbles become truncated, increase the
spacing of the respective axes with the pad_x
and/or
pad_y
parameters.
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Dept, Gender, radius=0.6, power=0.8, pad_x=0.05, pad_y=0.05, size_cut=FALSE)
## Plot(Dept, Gender, radius=0.6, power=0.8, pad_x=0.05, pad_y=0.05, trans=.8, bg="off", grid="off")
## SummaryStats(Dept, Gender) # or ss
##
## Joint and Marginal Frequencies
## ------------------------------
##
## Dept
## Gender ACCT ADMN FINC MKTG SALE Sum
## M 2 2 3 1 10 18
## W 3 4 1 5 5 18
## Sum 5 6 4 6 15 36
##
## Cramer's V: 0.415
##
## Chi-square Test of Independence:
## Chisq = 6.200, df = 4, p-value = 0.185
## >>> Low cell expected frequencies, chi-squared approximation may not be accurate
##
## Some Parameter values (can be manually set)
## -------------------------------------------------------
## radius: 0.60 size of largest bubble
## power: 0.80 relative bubble sizes
Alternatively, plot two categorical variables with a Trellis (facet)
chart by invoking the facet1
parameter. If the first listed
variable in the function call, the x
parameter, is
categorical, the result is a dot chart for each level of the
facet1
variable.
## [Trellis graphics from Deepayan Sarkar's lattice package]
Plotting a single categorical variable yields the corresponding bubble plot of frequencies.
##
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Dept, color_low="lemonchiffon2", color_hi="maroon3")
## Plot(Dept, stat="count") # scatter plot of counts
##
## --- Dept ---
##
## ACCT ADMN FINC MKTG SALE Total
## Frequencies: 5 6 4 6 15 36
## Proportions: 0.139 0.167 0.111 0.167 0.417 1.000
##
## Chi-squared test of null hypothesis of equal probabilities
## Chisq = 10.944, df = 4, p-value = 0.027
##
##
##
## Some Parameter values (can be manually set)
## -------------------------------------------------------
## radius: 0.22 size of largest bubble
## power: 0.50 relative bubble sizes
An interactive visualization lets the user in real time change
parameter values to change characteristics of the visualization. To
create an interactive two-variable scatterplot of continuous variables
with the employee data that displays the corresponding parameters, run
the function interact()
with "ScatterPlot"
specified.
interact("ScatterPlot")
To create an interactive Trellis plot as a combined violin, box, and
scatter plot with the five values of Dept from the Employee data set
that displays the corresponding parameters, run the function
interact()
with "Trellis"
specified.
interact("Trellis")
The functions are not run here because interactivity requires to run directly from the R console.
Use the base R help()
function to view the full manual
for Plot()
. Simply enter a question mark followed by the
name of the function.
?Plot
More on Scatterplots, Time Series plots, and other visualizations from lessR and other packages such as ggplot2 at:
Gerbing, D., R Visualizations: Derive Meaning from Data, CRC Press, May, 2020, ISBN 978-1138599635.