Scatter, Box, and Violin Plots

David Gerbing

lessR provides many versions of a scatter plot with its Plot() function for one or two variables with an option to provide a separate scatterplot for each level of one or two categorical variables. Access all scatterplots with the same simple syntax. The first variable listed without a parameter name, the x parameter, is plotted along the x-axis. Any second variable listed without a parameter name, the y parameter, is plotted along the y-axis. Each parameter may be represented by a continuous or categorical variable, a single variable or a vector of variables.

Plot() also plots time series data when the x-axis variable is a Date variable. See the Time vignette for those examples.

The Data

Illustrate with the Employee data included as part of lessR.

d <- Read("Employee")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
##  2    Gender character     37       0       2   M  M  W ... W  W  M
##  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  high ... high  low  high
##  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
## ------------------------------------------------------------------------------------------

As an option, lessR also supports variable labels. The labels are displayed on both the text and visualization output. Each displayed label consists of the variable name juxtaposed with the corresponding label. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need to be entered into the table. The table can be stored as either a csv file or an Excel file.

Read the variable label file into the l data frame, currently the only permissible name for the label file.

l <- rd("Employee_lbl")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     label character      8       0       8   Time of Company Employment ... Test score on legal issues after instruction
## ------------------------------------------------------------------------------------------

Display the available labels.

l
##                                                label
## Years                     Time of Company Employment
## Gender                                  Man or Woman
## Dept                             Department Employed
## Salary                           Annual Salary (USD)
## JobSat            Satisfaction with Work Environment
## Plan             1=GoodHealth, 2=GetWell, 3=BestCare
## Pre    Test score on legal issues before instruction
## Post    Test score on legal issues after instruction

Continuous Variables

Two Variables

Linear Relationship

A typical scatterplot visualizes the relationship of two continuous variables, here Years worked at a company, and annual Salary. Following is the function call to Plot() for the default visualization.

Because d is the default name of the data frame that contains the variables for analysis, the data parameter that names the input data frame need not be specified. That is, no need to specify data=d, though this parameter can be explicitly included in the function call if desired.

Plot(Years, Salary)

## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## Plot(Years, Salary, enhance=TRUE)  # many options
## Plot(Years, Salary, fill="skyblue")  # interior fill color of points
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
## Plot(Years, Salary, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Years: Time of Company Employment 
## Salary: Annual Salary (USD) 
##  
## Number of paired values with neither missing, n = 36 
## Sample Correlation of Years and Salary: r = 0.852 
##   
## Hypothesis Test of 0 Correlation:  t = 9.501,  df = 34,  p-value = 0.000 
## 95% Confidence Interval for Correlation:  0.727 to 0.923 
## 

Enhance the default scatterplot with parameter enhance. The visualization includes the mean of each variable indicated by the respective line through the scatterplot, the 95% confidence ellipse, labeled outliers, least-squares regression line with 95% confidence interval, and the corresponding regression line with the outliers removed.

Plot(Years, Salary, enhance=TRUE)
## [Ellipse with Murdoch and Chow's function ellipse from their ellipse package]

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## Plot(Years, Salary, color="red")  # exterior edge color of points
## Plot(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
## Plot(Years, Salary, out_cut=.10)  # label top 10% from center as outliers 
## 
## >>> Outlier analysis with Mahalanobis Distance 
##  
##   MD  ID 
## ----- ----- 
## 8.14  18 
## 7.84  34 
##  
## 5.63  31 
## 5.58  19 
## 3.75   4 
## ...  ... 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Years: Time of Company Employment 
## Salary: Annual Salary (USD) 
##  
## Number of paired values with neither missing, n = 36 
## Sample Correlation of Years and Salary: r = 0.852 
##   
## Hypothesis Test of 0 Correlation:  t = 9.501,  df = 34,  p-value = 0.000 
## 95% Confidence Interval for Correlation:  0.727 to 0.923 
## 

A variety of fit lines can be plotted. The available values: "loess" for general non-linear fit, "lm" for linear least squares, "null" for the null (flat line) model, "exp" for the exponential growth and decay, "quad" for the quadratic model, and power for the general power beyond 2. Setting fit to TRUE plots the "loess" line. With the value of power, specify the value of the root with parameter fit_power.

Here, plot the general non-linear fit. For emphasis set plot_errors to TRUE to plot the residuals from the line. The sum of the squared errors is displayed to facilitate the comparison of different models.

Plot(Years, Salary, fit="loess", plot_errors=TRUE)

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## Plot(Years, Salary, enhance=TRUE)  # many options
## Plot(Years, Salary, color="red")  # exterior edge color of points
## Plot(Years, Salary, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier 
## 
##    Loess Model MSE = 100,834,065.368
## 

Non-Linear Relationship

Next, plot the quadratic fit curve through the data. These data are approximately linear so the quadratic curve does not vary far from a straight line. The function displays the corresponding mean squared error to assist in comparing various models to each other. Activating the plot_errors parameter visualizes the discrepancy between the data points and the plotted curve. The fit_new parameter specifies values of the x-variable from which to compute the corresponding value from the estimated quadratic function.

Plot(Years, Salary, fit="quad", plot_errors=TRUE, fit_new=c(1:3, 55, 30, 20, 45))

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## Plot(Years, Salary, enhance=TRUE)  # many options
## Plot(Years, Salary, color="red")  # exterior edge color of points
## Plot(Years, Salary, out_cut=.10)  # label top 10% from center as outliers 
## 
##  Years Salary_Fit                                      
##      1  48320.366                                      
##      2  50874.345                                      
##      3  53494.093                                      
##     20 108092.381                                      
##     30 149087.771 Prediction from beyond the data range
##     45 222912.453 Prediction from beyond the data range
##     55 280349.973 Prediction from beyond the data range
## 
## Regressed linearized data of transformed data values of Salary with sqrt() 
##   Line: b0 = 214.084    b1 = 5.734    Linear Model MSE = 412.447   Rsq = 0.729
##  
## Fit to the data with back transform square of linear regression model 
##  Model MSE = 128,253,592.199 
## 
## 

Other functional fits are available, such as "exp" for exponential.

Three Variables

Map a continuous variable, such as Pre, to the plotted points with the size parameter, a bubble plot.

Plot(Years, Salary, size=Pre)

##  
##  
## 
## Some Parameter values (can be manually set) 
## ------------------------------------------------------- 
## radius: 0.12    size of largest bubble 
## power: 0.50     relative bubble sizes

Indicate multiple variables to plot along either axis with a vector defined according to the base R function c(). Plot the linear model for each variable according to the fit parameter set to "lm". By default, when multiple lines are plotted on the same panel, the confidence interval is turned off by internally setting the parameter fit_se set to 0. Explicitly override this parameter value as needed.

Plot(c(Pre, Post), Salary, fit="lm", fit_se=0)

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## Plot(c(Pre, Post), Salary, enhance=TRUE)  # many options
## Plot(c(Pre, Post), Salary, color="red")  # exterior edge color of points
## Plot(c(Pre, Post), Salary, out_cut=.10)  # label top 10% from center as outliers 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Post: Test score on legal issues after instruction 
## Salary: Annual Salary (USD) 
##  
## Number of paired values with neither missing, n = 37 
## Sample Correlation of Post and Salary: r = -0.070 
##   
## Hypothesis Test of 0 Correlation:  t = -0.416,  df = 35,  p-value = 0.680 
## 95% Confidence Interval for Correlation:  -0.385 to 0.260 
## 

Scatterplot Matrix

Multiple variables for the first parameter value, x, and no values for y, plot as a scatterplot matrix. Pass a single vector, such as defined by c(). Request the non-linear fit line and corresponding confidence interval by specifying TRUE or loess for the fit parameter. Request a linear fit line with the value of "lm".

Plot(c(Salary, Years, Pre, Post), fit="lm")

Smoothed and Binned Scatterplots

Smoothing and binning are two procedures for visualizing a relationship with many data values.

To obtain a larger data set, in this example generate random data with base R rnorm(), then plot. Plot() first checks the presence of the specified variables in the global environment (workspace). If not there, then from a data frame, of which the default value is d. Here, randomly generate values from normal populations for x and y in the workspace.

set.seed(13)
x=rnorm(4000)
y= 8*x + rnorm(4000,1, 30)
Plot(x, y)
## >>> Note: x is not in a data frame (table)
## >>> Note: y is not in a data frame (table)

## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## Plot(x, y, enhance=TRUE)  # many options
## Plot(x, y, color="red")  # exterior edge color of points
## Plot(x, y, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
## Plot(x, y, out_cut=.10)  # label top 10% from center as outliers 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 4000 
## Sample Correlation of x and y: r = 0.251 
##   
## Hypothesis Test of 0 Correlation:  t = 16.397,  df = 3998,  p-value = 0.000 
## 95% Confidence Interval for Correlation:  0.222 to 0.280 
## 

With large data sets, even for continuous variables there can be much over-plotting of points. One strategy to address this issue smooths the scatterplot by turning on the smooth parameter. The individual points superimposed on the smoothed plot are potential outliers. The default number of plotted outliers is 100. Turn off the plotting of outliers completely by setting parameter smooth_points to 0. Show the linear trend with fit set to "lm".

Plot(x, y, smooth=TRUE, fit="lm")
## >>> Note: x is not in a data frame (table)
## >>> Note: y is not in a data frame (table)

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## Plot(x, y, enhance=TRUE)  # many options
## Plot(x, y, fill="skyblue")  # interior fill color of points
## Plot(x, y, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 4000 
## Sample Correlation of x and y: r = 0.251 
##   
## Hypothesis Test of 0 Correlation:  t = 16.397,  df = 3998,  p-value = 0.000 
## 95% Confidence Interval for Correlation:  0.222 to 0.280 
##   
## 
##   Line: b0 = 1.03068757    b1 = 7.91963664    Linear Model MSE = 917.03180812   Rsq = 0.063
## 

Another strategy for alleviating over-plotting makes the fill color mostly transparent with the transparency parameter, or turn off completely by setting fill to "off". The closer the value of trans is to 1, the more transparent is the fill.

Plot(x, y, transparency=0.95)
## >>> Note: x is not in a data frame (table)
## >>> Note: y is not in a data frame (table)

## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## Plot(x, y, enhance=TRUE)  # many options
## Plot(x, y, color="red")  # exterior edge color of points
## Plot(x, y, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
## Plot(x, y, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 4000 
## Sample Correlation of x and y: r = 0.251 
##   
## Hypothesis Test of 0 Correlation:  t = 16.397,  df = 3998,  p-value = 0.000 
## 95% Confidence Interval for Correlation:  0.222 to 0.280 
## 

Another way to visualize a relationship when there are many data points is to bin the x-axis. Specify the number of bins with parameter n_bins. Plot() then computes the mean of y for each bin and connects the means by line segments. This procedure plots the conditional means by default without any assumption of form such as linearity. Specify the stat parameter for median to compute the median of y for each bin. The standard Plot() parameters fill, color, size and segments also apply.

Plot(x, y, n_bins=5)
## >>> Note: x is not in a data frame (table)
## >>> Note: y is not in a data frame (table)

## 
## Table: Summary Stats 
##  
##                x          y 
## -------  -------  --------- 
## n           4000       4000 
## n.miss         0          0 
## min       -3.239   -104.740 
## max        3.589    112.460 
## mean      -0.003      1.006 
## 
##  
## Table: mean of y for levels of x 
##  
##                   bin      n    midpt      mean 
## ---  ----------------  -----  -------  -------- 
## 1     [-3.246,-1.873]    116   -2.560   -16.734 
## 2     (-1.873,-0.508]   1090   -1.191    -5.699 
## 3      (-0.508,0.858]   2001    0.175     0.848 
## 4       (0.858,2.223]    743    1.541    12.374 
## 5       (2.223,3.596]     50    2.909    25.696

One Variable

The default plot for a single continuous variable includes not only the scatterplot, but also the superimposed violin plot and box plot, with outliers identified. Call this plot the VBS plot.

Plot(Salary)
## [Violin/Box/Scatterplot graphics from Deepayan Sarkar's lattice package]
## 
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE)  # Label two outliers ...
## Plot(Salary, box_adj=TRUE)  # Adjust boxplot whiskers for asymmetry

## --- Salary --- 
## Present: 37 
## Missing: 0 
## Total  : 37 
##  
## Mean         : 73795.557 
## Stnd Dev     : 21799.533 
## IQR          : 31012.560 
## Skew         : 0.190   [medcouple, -1 to 1] 
##  
## Minimum      : 46124.970 
## Lower Whisker: 46124.970 
## 1st Quartile : 56772.950 
## Median       : 69547.600 
## 3rd Quartile : 87785.510 
## Upper Whisker: 122563.380 
## Maximum      : 134419.230 
## 
##   
## --- Outliers ---     from the box plot: 1 
##  
## Small      Large 
## -----      ----- 
##             134419.23 
## 
## Number of duplicated values: 0 
## 
## Parameter values (can be manually set) 
## ------------------------------------------------------- 
## size: 0.61      size of plotted points 
## out_size: 0.82  size of plotted outlier points 
## jitter_y: 0.45 random vertical movement of points 
## jitter_x: 0.00  random horizontal movement of points 
## bw: 9529.04       set bandwidth higher for smoother edges

Control the choice of the three superimposed plots – violin, box, and scatter – with the vbs_plot parameter. The default setting is "vbs" for all three plots. Here, for example, obtain just the box plot. Or, use the alias BoxPlot() in place of Plot().

Plot(Salary, vbs_plot="b")
## [Violin/Box/Scatterplot graphics from Deepayan Sarkar's lattice package]
## 
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE)  # Label two outliers ...
## Plot(Salary, box_adj=TRUE)  # Adjust boxplot whiskers for asymmetry

## --- Salary --- 
## Present: 37 
## Missing: 0 
## Total  : 37 
##  
## Mean         : 73795.557 
## Stnd Dev     : 21799.533 
## IQR          : 31012.560 
## Skew         : 0.190   [medcouple, -1 to 1] 
##  
## Minimum      : 46124.970 
## Lower Whisker: 46124.970 
## 1st Quartile : 56772.950 
## Median       : 69547.600 
## 3rd Quartile : 87785.510 
## Upper Whisker: 122563.380 
## Maximum      : 134419.230 
## 
##   
## --- Outliers ---     from the box plot: 1 
##  
## Small      Large 
## -----      ----- 
##             134419.23 
## 
## Number of duplicated values: 0

Do a frequency distribution by specifying the value of parameter stat_x, either "count" or if the y-axis is proportion, then "proportion" or "%". Can specify a custom bin width if desired with the parameter bin_width.

Plot(Salary, stat_x="%", bin_width=13000)
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Salary, stat_x="%", bin_width=13000, size=0)  # just line segments, no points 
## 
## --- Salary --- 
##  
##      n   miss         mean           sd          min          mdn          max 
##      37      0    73795.557    21799.533    46124.970    69547.600   134419.230 
##  
## 
## 
## Bin Width: 13000 
## Number of Bins: 8 
##  
##              Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
## --------------------------------------------------------- 
##   40000 >  53000   46500      5    0.14        5     0.14 
##   53000 >  66000   59500     10    0.27       15     0.41 
##   66000 >  79000   72500     10    0.27       25     0.68 
##   79000 >  92000   85500      4    0.11       29     0.78 
##   92000 > 105000   98500      4    0.11       33     0.89 
##  105000 > 118000  111500      2    0.05       35     0.95 
##  118000 > 131000  124500      1    0.03       36     0.97 
##  131000 > 144000  137500      1    0.03       37     1.00 
##  
## 
## No (Box plot) outliers

Cleveland Dot Plot

Create a Cleveland dot plot when one of the variables has unique (ID) values. In this example, for a single variable, row names are on the y-axis. The default plots sorts by the value plotted with the default value of parameter sort_yx of "+" for an ascending plot. Set to "-" for a descending plot and "0" for no sorting.

Plot(Salary, row_names)

## 
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Salary, row_names, sort_yx="0")  # do not sort y-axis variable by x-axis variable
## Plot(Salary, row_names, segments_y=FALSE)  # drop the line segments
## Plot(Salary, row_names, fill="red")  # red point interiors
## 

The standard scatterplot version of a Cleveland dot plot follows, with no sorting and no line segments.

Plot(Salary, row_names, sort_yx="0", segments_y=FALSE)

## 
## >>> Suggestions or enter: style(suggest=FALSE)
## Plot(Salary, row_names, segments_y=FALSE, sort_yx="0", fill="red")  # red point interiors
## 

This Cleveland dot plot has two x-variables, indicated as a standard R vector with the c() function. In this situation, the two points on each row are connected with a line segment. By default the rows are sorted by distance between the successive points.

Plot(c(Pre, Post), row_names)