2. Visualize: Scatter, Box, and Violin Plots

lessR provides many versions of a scatter plot with its XY() function for one or two variables with an option to provide a separate scatterplot for each level of one or two categorical variables. Access all scatterplots with the same simple syntax. The first variable listed without a parameter name, the x parameter, is plotted along the x-axis. Any second variable listed without a parameter name, the y parameter, is plotted along the y-axis. Each parameter may be represented by a continuous or categorical variable, a single variable or a vector of variables.

XY() also plots time series data when the x-axis variable is a Date variable. See the Time vignette for those examples.

The Data

Illustrate with the Employee data included as part of lessR.

d <- Read("Employee")

## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
##  2    Gender character     37       0       2   M  M  W ... W  W  M
##  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   63788.26  104494.58 ... 66508.32  67562.36
##  5    JobSat character     35       2       3   med  low  high ... high  low  high
##  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
## ------------------------------------------------------------------------------------------

As an option, lessR also supports variable labels. The labels are displayed on both the text and visualization output. Each displayed label consists of the variable name juxtaposed with the corresponding label. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need to be entered into the table. The table can be stored as either a csv file or an Excel file.

Read the variable label file into the l data frame, currently the only permissible name for the label file.

l <- rd("Employee_lbl")

## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     label character      8       0       8   Time of Company Employment ... Test score on legal issues after instruction
## ------------------------------------------------------------------------------------------

Display the available labels.

##                                                label
## Years                     Time of Company Employment
## Gender                                  Man or Woman
## Dept                             Department Employed
## Salary                           Annual Salary (USD)
## JobSat            Satisfaction with Work Environment
## Plan             1=GoodHealth, 2=GetWell, 3=BestCare
## Pre    Test score on legal issues before instruction
## Post    Test score on legal issues after instruction

Continuous Variables

Two Variables

A typical scatterplot visualizes the relationship of two continuous variables, here Years worked at a company, and annual Salary. Following is the function call to XY() for the default visualization.

Because d is the default name of the data frame that contains the variables for analysis, the data parameter that names the input data frame need not be specified. That is, no need to specify data=d, though this parameter can be explicitly included in the function call if desired.

XY(Years, Salary)

## [Interactive chart from the Plotly R package (Sievert, 2020)]

## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## XY(Years, Salary, enhance=TRUE)  # many options
## XY(Years, Salary, fill="skyblue")  # interior fill color of points
## XY(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
## XY(Years, Salary, out_cut=.10)  # label top 10% from center as outliers 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Years: Time of Company Employment 
## Salary: Annual Salary (USD) 
##  
## Number of paired values with neither missing, n = 36 
## Sample Correlation of Years and Salary: r = 0.852 
##   
## Hypothesis Test of 0 Correlation:  t = 9.501,  df = 34,  p-value = 0.000 
## 95% Confidence Interval for Correlation:  0.727 to 0.923 
##

Enhance the default scatterplot with parameter enhance. The visualization includes the mean of each variable indicated by the respective line through the scatterplot, the 95% confidence ellipse, labeled outliers, least-squares regression line with 95% confidence interval, and the corresponding regression line with the outliers removed.

XY(Years, Salary, enhance=TRUE)

## [Interactive chart from the Plotly R package (Sievert, 2020)] 
## [Ellipse with Murdoch and Chow's function ellipse from their ellipse package]

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## XY(Years, Salary, color="red")  # exterior edge color of points
## XY(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
## XY(Years, Salary, out_cut=.10)  # label top 10% from center as outliers 
## 
## >>> Outlier analysis with Mahalanobis Distance 
##  
##   MD                   ID 
## -----                ----- 
## 8.34       Skrotzki, Sara 
## 7.73       Billing, Susan 
##  
## 5.83        Fulton, Scott 
## 5.77      Correll, Trevon 
## 3.92       Downs, Deborah 
## ...                   ... 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Years: Time of Company Employment 
## Salary: Annual Salary (USD) 
##  
## Number of paired values with neither missing, n = 36 
## Sample Correlation of Years and Salary: r = 0.852 
##   
## Hypothesis Test of 0 Correlation:  t = 9.501,  df = 34,  p-value = 0.000 
## 95% Confidence Interval for Correlation:  0.727 to 0.923 
##

The default for formatting both axis labels is to round numeric values of thousands, such as 100000 to 100K. With parameter axis_fmt, this default of to {"K"} can be changed. Also can specify {","} to insert commas in large numbers with a decimal point or {"."} to insert periods, or {""} to turn off formatting. The value of {"K"} can also be combined with {","} or {"."} by forming a vector of values, such as c("K", ",").

Axis labels can also be formatted by adding a prefix to a numeric value with the parameters axis_x_prefix and axis_y_prefix, such as $ or €. The specified value can be multiple characters, such as for the Brazilian currency, R$.

XY(Years, Salary, axis_fmt=",", axis_y_prefix="$")

## [Interactive chart from the Plotly R package (Sievert, 2020)]

## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## XY(Years, Salary, enhance=TRUE)  # many options
## XY(Years, Salary, fill="skyblue")  # interior fill color of points
## XY(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
## XY(Years, Salary, out_cut=.10)  # label top 10% from center as outliers 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Years: Time of Company Employment 
## Salary: Annual Salary (USD) 
##  
## Number of paired values with neither missing, n = 36 
## Sample Correlation of Years and Salary: r = 0.852 
##   
## Hypothesis Test of 0 Correlation:  t = 9.501,  df = 34,  p-value = 0.000 
## 95% Confidence Interval for Correlation:  0.727 to 0.923 
##

A variety of fit lines can be plotted. The available values: "loess" for general non-linear fit, "lm" for linear least squares, "null" for the null (flat line) model, "exp" for the exponential growth and decay, "quad" for the quadratic model, and power for the general power beyond 2. Setting fit to TRUE plots the "loess" line. With the value of power, specify the value of the root with parameter fit_power.

Here, plot the general non-linear fit. For emphasis set plot_errors to TRUE to plot the residuals from the line. The sum of the squared errors is displayed to facilitate the comparison of different models.

XY(Years, Salary, fit="loess", plot_errors=TRUE)

## [Interactive chart from the Plotly R package (Sievert, 2020)]

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## XY(Years, Salary, enhance=TRUE)  # many options
## XY(Years, Salary, color="red")  # exterior edge color of points
## XY(Years, Salary, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier 
## 
##    Loess Model MSE = 100,834,065.368
##

Next, plot the exponential fit and show the residuals from the exponential curve. These data are approximately linear so the exponential curve does not vary far from a straight line. The function displays the corresponding sum of squared errors to assist in comparing various models to each other.

XY(Years, Salary, fit="exp", plot_errors=TRUE)

## [Interactive chart from the Plotly R package (Sievert, 2020)]

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## XY(Years, Salary, enhance=TRUE)  # many options
## XY(Years, Salary, fill="skyblue")  # interior fill color of points
## XY(Years, Salary, out_cut=.10)  # label top 10% from center as outliers 
## 
## Regressed linearized data of transformed data values of Salary with log() 
##   Line: b0 = 10.959    b1 = 0.036
##   Linear Model MSE = 0.0168   Rsq = 0.725
##  
## Fit to the data with back transform exp() of linear regression model 
##  Model MSE = 127,930,074.484 
## 
##

The parameter transforms the y variable to the specified power from the default of 1 before doing the regression analysis. The availability of this parameter provides for a wide range of modifications to the underlying functional form of the fit curve.

Three Variables

Map a continuous variable, such as Pre, to the plotted points with the size parameter, a bubble plot.

XY(Years, Salary, size=Pre)

## [Interactive chart from the Plotly R package (Sievert, 2020)]

##  
##  
## 
## Some Parameter values (can be manually set) 
## ------------------------------------------------------- 
## radius: 0.12    size of largest bubble 
## power: 0.50     relative bubble sizes

Indicate multiple variables to plot along either axis with a vector defined according to the base R function c(). Plot the linear model for each variable according to the fit parameter set to "lm". By default, when multiple lines are plotted on the same panel, the confidence interval is turned off by internally setting the parameter fit_se set to 0. Explicitly override this parameter value as needed.

XY(c(Pre, Post), Salary, fit="lm", fit_se=0)

## [Interactive chart from the Plotly R package (Sievert, 2020)]

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## XY(c(Pre, Post), Salary, enhance=TRUE)  # many options
## XY(c(Pre, Post), Salary, color="red")  # exterior edge color of points
## XY(c(Pre, Post), Salary, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Post: Test score on legal issues after instruction 
## Salary: Annual Salary (USD) 
##  
## Number of paired values with neither missing, n = 37 
## Sample Correlation of Post and Salary: r = -0.070 
##   
## Hypothesis Test of 0 Correlation:  t = -0.416,  df = 35,  p-value = 0.680 
## 95% Confidence Interval for Correlation:  -0.385 to 0.260 
##

VBS Plot of 5 Variables

Read the data and convert the values of numerically valued categorical variables to meaningful labels.

d <- Read("Cars93")

## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##       Variable                  Missing  Unique 
##           Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1        Make character     93       0      32   Acura  Acura ... Volvo  Volvo
##  2        Type character     93       0       6   Small  Midsize ... Compact  Midsize
##  3    MinPrice    double     93       0      79   12.9  29.2  25.9 ... 22.9  21.8  24.8
##  4    MidPrice    double     93       0      81   15.9  33.9  29.1 ... 23.3  22.7  26.7
##  5    MaxPrice    double     93       0      79   18.8  38.7  32.3 ... 23.7  23.5  28.5
##  6     MPGcity   integer     93       0      21   25  18  20 ... 18  21  20
##  7    MPGhiway   integer     93       0      22   31  25  26 ... 25  28  28
##  8     Airbags   integer     93       0       3   0  2  1 ... 0  1  2
##  9  DriveTrain   integer     93       0       3   1  1  1 ... 1  0  1
## 10   Cylinders   integer     92       1       5   4  6  6 ... 6  4  5
## 11      Engine    double     93       0      26   1.8  3.2  2.8 ... 2.8  2.3  2.4
## 12          HP   integer     93       0      57   140  200  172 ... 178  114  168
## 13         RPM   integer     93       0      24   6300  5500  5500 ... 5800  5400  6200
## 14     RevMile   integer     93       0      78   2890  2335  2280 ... 2385  2215  2310
## 15      Manual   integer     93       0       2   1  1  1 ... 1  1  1
## 16     FuelCap    double     93       0      38   13.2  18  16.9 ... 18.5  15.8  19.3
## 17     PassCap   integer     93       0       6   5  5  5 ... 4  5  5
## 18      Length   integer     93       0      51   177  195  180 ... 159  190  184
## 19   Wheelbase   integer     93       0      27   102  115  102 ... 97  104  105
## 20       Width   integer     93       0      16   68  71  67 ... 66  67  69
## 21       Uturn   integer     93       0      14   37  38  37 ... 36  37  38
## 22    RearSeat    double     91       2      24   26.5  30  28 ... 26  29.5  30
## 23      LugCap   integer     82      11      16   11  15  14 ... 15  14  15
## 24      Weight   integer     93       0      81   2705  3560  3375 ... 2810  2985  3245
## 25      Source   integer     93       0       2   0  0  0 ... 0  0  0
## ------------------------------------------------------------------------------------------

d$Airbags <- factor(d$Airbags, levels=0:2, labels=c("none", "driver", "drv+pas"))
d$DriveTrain <- factor(d$DriveTrain, levels=0:2, labels=c("rear", "front", "all"))
d$Manual <- factor(d$Manual, levels=0:1, labels=c("Not_Avail", "Available"))

Visualize the scatterplot of MPGhiway and HP, stratified against three categorical variables: Airbags plotted in different colors for each scatterplot, and separate scatterplots for all six combinations of the levels of DriveTrain and Manual.

XY(x=MPGhiway, y=HP, by=Airbags, facet=c(DriveTrain, Manual))

## [Interactive chart from the Plotly R package (Sievert, 2020)] 
## [Trellis (facet) graphics from Deepayan Sarkar's lattice package]

## 
## ---------- Summary Statistics for MPGhiway
## 
## Airbags   n  Mean  Median  SD  IQR  Min  Max
##    none  34    31      30   6    6   20   50
##  driver  43    28      28   5    4   20   46
## drv+pas  16    27      28   2    2   23   31
## 
## DriveTrain   n  Mean  Median  SD  IQR  Min  Max
##       rear  16    26      26   2    3   22   30
##      front  67    30      29   5    6   21   50
##        all  10    26      24   6    9   20   37
## 
##    Manual   n  Mean  Median  SD  IQR  Min  Max
## Not_Avail  32    26      26   3    3   20   31
## Available  61    31      30   6    6   20   50

Scatterplot Matrix

To plot a scatterplot matrix, specify multiple variables for the first parameter value, x, repeated for the second parameter, y. Define these multiple variables as a vector, such as defined by c(). Request the non-linear fit line and corresponding confidence interval by specifying TRUE or loess for the fit parameter. Request a linear fit line with the value of "lm".

d <- Read("Employee")

## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
##  2    Gender character     37       0       2   M  M  W ... W  W  M
##  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   63788.26  104494.58 ... 66508.32  67562.36
##  5    JobSat character     35       2       3   med  low  high ... high  low  high
##  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
## ------------------------------------------------------------------------------------------

XY(c(Salary, Years, Pre, Post), c(Salary, Years, Pre, Post), fit="lm")

## [Interactive chart from the Plotly R package (Sievert, 2020)]

Smoothed, Contoured, and Binned Scatterplots

Smoothing and binning are two procedures for visualizing a relationship with many data values.

To obtain a larger data set, in this example generate random data with base R rnorm(), then plot. XY() first checks the presence of the specified variables in the global environment (workspace). If not there, then from a data frame, of which the default value is d. Here, randomly generate values from normal populations for x and y in the workspace.

set.seed(13)
x=rnorm(4000)
y= 8*x + rnorm(4000,1, 30)
XY(x, y)

## [Interactive chart from the Plotly R package (Sievert, 2020)] 
## >>> Note: x is not in a data frame (table)
## >>> Note: y is not in a data frame (table)

## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## XY(x, y, enhance=TRUE)  # many options
## XY(x, y, color="red")  # exterior edge color of points
## XY(x, y, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
## XY(x, y, out_cut=.10)  # label top 10% from center as outliers 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 4000 
## Sample Correlation of x and y: r = 0.251 
##   
## Hypothesis Test of 0 Correlation:  t = 16.397,  df = 3998,  p-value = 0.000 
## 95% Confidence Interval for Correlation:  0.222 to 0.280 
##

With large data sets, even for continuous variables there can be much over-plotting of points. One strategy to address this issue smooths the scatterplot by setting the type parameter to smooth. The individual points superimposed on the smoothed plot are potential outliers. The default number of plotted outliers is 100. Turn off the plotting of outliers completely by setting parameter smooth_points to 0. Show the linear trend with fit set to "lm".

XY(x, y, type="smooth", fit="lm")

## [Interactive chart from the Plotly R package (Sievert, 2020)] 
## >>> Note: x is not in a data frame (table)
## >>> Note: y is not in a data frame (table)

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## XY(x, y, enhance=TRUE)  # many options
## XY(x, y, fill="skyblue")  # interior fill color of points
## XY(x, y, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 4000 
## Sample Correlation of x and y: r = 0.251 
##   
## Hypothesis Test of 0 Correlation:  t = 16.397,  df = 3998,  p-value = 0.000 
## 95% Confidence Interval for Correlation:  0.222 to 0.280 
##   
## 
##   Line: b0 = 1.03068757    b1 = 7.91963664
##   Linear Model MSE = 917.03180812   Rsq = 0.063
##

Another strategy for alleviating over-plotting makes the fill color mostly transparent with the transparency parameter, or turn off completely by setting fill to "off". The closer the value of trans is to 1, the more transparent is the fill.

XY(x, y, transparency=0.95)

## [Interactive chart from the Plotly R package (Sievert, 2020)] 
## >>> Note: x is not in a data frame (table)
## >>> Note: y is not in a data frame (table)

## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## XY(x, y, enhance=TRUE)  # many options
## XY(x, y, color="red")  # exterior edge color of points
## XY(x, y, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
## XY(x, y, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 4000 
## Sample Correlation of x and y: r = 0.251 
##   
## Hypothesis Test of 0 Correlation:  t = 16.397,  df = 3998,  p-value = 0.000 
## 95% Confidence Interval for Correlation:  0.222 to 0.280 
##

Contour plots are another effective way to visualize scatter plots with much data. By default, the parameter contours_n is set at 10. XY() provides a threshold for deleting points for consideration of plotting the contour curves. Otherwise, if there are extreme outliers, the axes extend to their maximum and minimum values, typically resulting in much white space that surrounds the visible contour plot. The extreme values of outlier points with low density round down to zero on the color scale. The parameter contours_pad, with a default value of 0, can adjust the white space to pad the resulting contour curve. Increase the parameter value to add more padding to the plot.

XY(x, y, type="contour")

## [Interactive chart from the Plotly R package (Sievert, 2020)] 
## >>> Note: x is not in a data frame (table)
## >>> Note: y is not in a data frame (table)

## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## XY(x, y, enhance=TRUE)  # many options
## XY(x, y, color="red")  # exterior edge color of points
## XY(x, y, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
## XY(x, y, out_cut=.10)  # label top 10% from center as outliers 
## 
## 
## >>> Pearson's product-moment correlation 
##  
## Number of paired values with neither missing, n = 4000 
## Sample Correlation of x and y: r = 0.251 
##   
## Hypothesis Test of 0 Correlation:  t = 16.397,  df = 3998,  p-value = 0.000 
## 95% Confidence Interval for Correlation:  0.222 to 0.280 
##

Another way to visualize a relationship when there are many data points is to bin the x-axis. Specify the number of bins with parameter n_bins. XY() then computes the mean of y for each bin and connects the means by line segments. This procedure plots the conditional means by default without any assumption of form such as linearity. Specify the stat parameter for median to compute the median of y for each bin. The standard XY() parameters fill, color, size and segments also apply.

XY(x, y, n_bins=5)

## [Interactive chart from the Plotly R package (Sievert, 2020)] 
## >>> Note: x is not in a data frame (table)
## >>> Note: y is not in a data frame (table)

## 
## Table: Summary Stats 
##  
##                x          y 
## -------  -------  --------- 
## n           4000       4000 
## n.miss         0          0 
## min       -3.239   -104.740 
## max        3.589    112.460 
## mean      -0.003      1.006 
## 
##  
## Table: mean of y for levels of x 
##  
##                   bin      n    midpt      mean 
## ---  ----------------  -----  -------  -------- 
## 1     [-3.246,-1.873]    116   -2.560   -16.734 
## 2     (-1.873,-0.508]   1090   -1.191    -5.699 
## 3      (-0.508,0.858]   2001    0.175     0.848 
## 4       (0.858,2.223]    743    1.541    12.374 
## 5       (2.223,3.596]     50    2.909    25.696

Cleveland Dot Plot

Create a Cleveland dot plot when one of the variables has unique (ID) values. In this example, for a single variable, row names are on the y-axis. The default plots sorts by the value plotted with the default value of parameter sort of "+" for an ascending plot. Set to "-" for a descending plot and "0" for no sorting.

XY(Salary, row_names)

## [Interactive chart from the Plotly R package (Sievert, 2020)]

## 
## >>> Suggestions or enter: style(suggest=FALSE)
## XY(Salary, row_names, sort="0")  # do not sort y-axis variable by x-axis variable
## XY(Salary, row_names, segments_y=FALSE)  # drop the line segments
## XY(Salary, row_names, fill="red")  # red point interiors
##

The standard scatterplot version of a Cleveland dot plot follows, with no sorting and no line segments.

XY(Salary, row_names, sort="0", segments_y=FALSE)

## [Interactive chart from the Plotly R package (Sievert, 2020)]

## 
## >>> Suggestions or enter: style(suggest=FALSE)
## XY(Salary, row_names, segments_y=FALSE, sort="0", fill="red")  # red point interiors
##

This Cleveland dot plot has two x-variables, indicated as a standard R vector with the c() function. In this situation, the two points on each row are connected with a line segment. By default the rows are sorted by distance between the successive points.

XY(c(Pre, Post), row_names)

## [Interactive chart from the Plotly R package (Sievert, 2020)]

## 
## >>> Suggestions or enter: style(suggest=FALSE)
## XY(c(Pre, Post), row_names, sort="0")  # do not sort y-axis variable by x-axis variable
## XY(c(Pre, Post), row_names, segments_y=FALSE)  # drop the line segments
## XY(c(Pre, Post), row_names, fill="red")  # red point interiors
##  
## 
##  n  diff  Row 
## --------------------------- 
##  1 -4.0 Gvakharia, Kimberly 
##  2 -4.0 Downs, Deborah 
##  3 -3.0 Anderson, David 
##  4 -3.0 Correll, Trevon 
##  5 -3.0 Kralik, Laura 
##  6 -3.0 Jones, Alissa 
##  7 -2.0 Capelle, Adam 
##  8 -2.0 Stanley, Emma 
##  9 -2.0 Adib, Hassan 
## 10 -2.0 Skrotzki, Sara 
## 28  6.0 LaRoe, Maria 
## 29  7.0 Cassinelli, Anastis 
## 30  7.0 Hamide, Bita 
## 31  7.0 Sheppard, Cory 
## 32  8.0 Campagna, Justin 
## 33 10.0 Ritchie, Darnell 
## 34 12.0 Anastasiou, Crystal 
## 35 12.0 Wu, James 
## 36 13.0 Korhalkar, Jessica 
## 37 13.0 Cooper, Lindsay

Categorical and Continuous Variables

A mixture of categorical and continuous variables can be plotted a variety of ways, as illustrated below.

Two Continuous, One Categorical

Plot a scatterplot of two continuous variables for each level of a categorical variable on the same panel with the by parameter. Here, plot Years and Salary each for the two levels of Gender in the data. Colors and geometric plot shapes can distinguish between the plots. For all variables except an ordered factor, the default plots according to the default qualitative color palette, "hues", with the geometric shape of a point.

XY(Years, Salary, by=Gender)

## [Interactive chart from the Plotly R package (Sievert, 2020)]

## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## XY(Years, Salary, enhance=TRUE)  # many options
## XY(Years, Salary, color="red")  # exterior edge color of points
## XY(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
## XY(Years, Salary, out_cut=.10)  # label top 10% from center as outliers

Change the plot colors with the fill (interior) and color (exterior or edge) parameters. Because there are two levels of the by variable, specify two fill colors and two edge colors each with an R vector defined by the c() function. Also, include the regression line for each group with the fit parameter and increase the size of the plotted points with the size parameter.

XY(Years, Salary, by=Gender, size=2, fit="lm",
     fill=c(M="olivedrab3", W="gold1"), 
     color=c(M="darkgreen", W="gold4")
)

## [Interactive chart from the Plotly R package (Sievert, 2020)]

## 
## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## XY(Years, Salary, enhance=TRUE)  # many options
## XY(Years, Salary, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier 
## 
## Gender: M  Line: b0 = 40842.335    b1 = 4047.307
##   Linear Model MSE = 107,647,877.258   Rsq = 0.819
##  
## Gender: W  Line: b0 = 57109.787    b1 = 2882.272
##   Linear Model MSE = 144,700,624.695   Rsq = 0.598
##

Change the plotted shapes with the shape parameter. The default value is "circle" with both an exterior color and filled interior, specified with "color" and "fill". Other possible values, with fillable interiors, are "circle", "square", "diamond", "triup" (triangle up), and "tridown" (triangle down). Other possible values include all uppercase and lowercase letters, all digits, and most punctuation characters. The numbers 0 through 25 defined by the R points() function also apply. If plotting levels according to by, then list one shape for each level to be plotted.

Or, request default shapes across the different by groups by setting parameter shapes to "vary".

XY(Years, Salary, by=Gender, shape="vary")

## [Interactive chart from the Plotly R package (Sievert, 2020)]

## 
## >>> Suggestions  or  enter: style(suggest=FALSE)
## XY(Years, Salary, enhance=TRUE)  # many options
## XY(Years, Salary, fill="skyblue")  # interior fill color of points
## XY(Years, Salary, fit="lm", fit_se=c(.90,.99))  # fit line, stnd errors
## XY(Years, Salary, MD_cut=6)  # Mahalanobis distance from center > 6 is an outlier

A Trellis (facet) plot creates a separate panel for the plot of each level of the categorical variable. Generate Trellis plots with the facet parameter. In this example, plot the best-fit linear model for the data in each panel according to the fit parameter. By default, the 95% confidence interval for each line is also displayed.

XY(Years, Salary, facet=Gender, fit="lm")

## [Interactive chart from the Plotly R package (Sievert, 2020)] 
## [Trellis (facet) graphics from Deepayan Sarkar's lattice package]

## 
## Regression analysis of linearized Salary values
## Need back transformation of regression model to compute predicted values
## 
## Gender 1  Line: b0 = 40842.335  b1 = 4047.307   Fit: MSE =    Rsq = 0.819
## 
## Gender 2  Line: b0 = 57109.787  b1 = 2882.272   Fit: MSE =    Rsq = 0.598
## 
## ---------- Summary Statistics for Years
## 
## Gender   n  Mean  Median  SD  IQR  Min  Max
##      M  17    12      13   5    5    5   24
##      W  19     7       6   5    6    1   18

Turn off the confidence interval by setting the parameter fit_se to 0 for the value of the confidence level.

One Continuous, One Categorical

A categorical variable plotted with a continuous variable results in a traditional scatterplot though, of course, the scatter is confined to the straight lines that represent the levels of the categorical variable, its values.

The first two parameters of XY() are x and y. In this example, the categorical variable, Dept, listed second, specifies the y variable, as in y=Dept. There is no distinction in this function call for two continues variables or one continuous and one categorical. The XY() function evaluates each variable for continuity and responds appropriately.

XY(Salary, Dept)

## [Interactive chart from the Plotly R package (Sievert, 2020)]

## 
## Salary: Annual Salary (USD) 
##   - by levels of - 
## Dept: Department Employed 
##  
##        n   miss         mean           sd          min          mdn          max 
## ACCT    5      0    71792.776    12774.606    56124.970    79547.600    82502.500 
## ADMN    6      0    91277.117    27585.151    63788.260    81058.595   132563.380 
## FINC    4      0    79010.675    17852.498    67139.900    71937.625   105027.550 
## MKTG    6      0    80257.128    19869.812    61036.850    71658.990   109062.660 
## SALE   15      0    88830.065    23476.839    59188.960    87714.850   144419.230 
##

To avoid point overlap, if there is at least one duplicated value of continuous y for any level of categorical x, by default some horizontal jitter for each plotted point is added, which was not needed in this example. Manually adjust the jitter with either parameter jitter_x or, if x is continuous and y categorical, the jitter_y parameter. In addition, if the categorical variable is an R factor or a variable of type character, by default the mean of the continuous variable is displayed at each level of the categorical variable, as well in the text output. If the categorical variable is numeric, better to convert the variable to a factor to have just the categories on the axis and not a continuous scale. For example, d$Gender <- factor(d$Gender).

Another helpful technique for large data sets is to add some fill transparency with the transparency parameter, with values such as 0.8 and 0.9. The combination of jitter and transparency allows for plotting many thousands of points.

Show the different distributions of the continuous variable across the levels of the categorical variable with a scatterplot. Here, show the distribution of Salary for Males and Females across the various departments.

XY(Salary, Dept, by=Gender)

## [Interactive chart from the Plotly R package (Sievert, 2020)]

## 
## Salary: Annual Salary (USD) 
##   - by levels of - 
## Dept: Department Employed 
##  
##        n   miss         mean           sd          min          mdn          max 
## ACCT    5      0    71792.776    12774.606    56124.970    79547.600    82502.500 
## ADMN    6      0    91277.117    27585.151    63788.260    81058.595   132563.380 
## FINC    4      0    79010.675    17852.498    67139.900    71937.625   105027.550 
## MKTG    6      0    80257.128    19869.812    61036.850    71658.990   109062.660 
## SALE   15      0    88830.065    23476.839    59188.960    87714.850   144419.230 
##

Two Continuous, Three Categorical

To illustrate, first, the data. Use the Cars93 data set that is installed with lessR, which describes characteristics of 1993 cars.

d <- Read("Cars93")

## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##       Variable                  Missing  Unique 
##           Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1        Make character     93       0      32   Acura  Acura ... Volvo  Volvo
##  2        Type character     93       0       6   Small  Midsize ... Compact  Midsize
##  3    MinPrice    double     93       0      79   12.9  29.2  25.9 ... 22.9  21.8  24.8
##  4    MidPrice    double     93       0      81   15.9  33.9  29.1 ... 23.3  22.7  26.7
##  5    MaxPrice    double     93       0      79   18.8  38.7  32.3 ... 23.7  23.5  28.5
##  6     MPGcity   integer     93       0      21   25  18  20 ... 18  21  20
##  7    MPGhiway   integer     93       0      22   31  25  26 ... 25  28  28
##  8     Airbags   integer     93       0       3   0  2  1 ... 0  1  2
##  9  DriveTrain   integer     93       0       3   1  1  1 ... 1  0  1
## 10   Cylinders   integer     92       1       5   4  6  6 ... 6  4  5
## 11      Engine    double     93       0      26   1.8  3.2  2.8 ... 2.8  2.3  2.4
## 12          HP   integer     93       0      57   140  200  172 ... 178  114  168
## 13         RPM   integer     93       0      24   6300  5500  5500 ... 5800  5400  6200
## 14     RevMile   integer     93       0      78   2890  2335  2280 ... 2385  2215  2310
## 15      Manual   integer     93       0       2   1  1  1 ... 1  1  1
## 16     FuelCap    double     93       0      38   13.2  18  16.9 ... 18.5  15.8  19.3
## 17     PassCap   integer     93       0       6   5  5  5 ... 4  5  5
## 18      Length   integer     93       0      51   177  195  180 ... 159  190  184
## 19   Wheelbase   integer     93       0      27   102  115  102 ... 97  104  105
## 20       Width   integer     93       0      16   68  71  67 ... 66  67  69
## 21       Uturn   integer     93       0      14   37  38  37 ... 36  37  38
## 22    RearSeat    double     91       2      24   26.5  30  28 ... 26  29.5  30
## 23      LugCap   integer     82      11      16   11  15  14 ... 15  14  15
## 24      Weight   integer     93       0      81   2705  3560  3375 ... 2810  2985  3245
## 25      Source   integer     93       0       2   0  0  0 ... 0  0  0
## ------------------------------------------------------------------------------------------

Two of the categorical variables are integer coded 0 and 1, so recode to R factors to obtain more descriptive labels. For clarity, convert the relevant categorical variables to factors, including Cylinders the number of cylinders for a car, for consistency.

d$Trans <- factor(d$Manual, levels=0:1, labels=c("Auto", "Manual"))
d$Source <- factor(d$Source, levels=0:1, labels=c("Foreign", "Domestic"))
d$Cylinders <- factor(d$Cylinders, levels=c(4,6,8))

Two Continuous, Three Categorical

XY() can display the relationships for up to five variables. The two primary variables, x and y, that form the basis of the scatter plot, are continuous. Usually these two variables are listed first in the function call and so do not need their parameter names specified. Indicate two categorical variables that form the Trellis panels with parameter facet. Call these two variables the Trellis variables, which define a Trellis panel for each combination of their values. Finally, there can be a categorical grouping variable, the by variable, which plots different groups within each Trellis panel.

Plot MPGcity according to Weight. Specify the number of Cylinders and Manual transmission or not as Trellis conditioning variables to form the Trellis plot. Specify the Source of the vehicle, Foreign or Domestic as a grouping variable to plot with separate colors on each panel. Use the parameter value n_axis_x_skip=2 to include only every third axis tick label due to the lack of room to avoid overlapping labels.

XY(Weight, MPGcity, by=Source, facet=c(Cylinders,Trans), n_axis_x_skip=2, n_row=2)

## [Interactive chart from the Plotly R package (Sievert, 2020)] 
## [Trellis (facet) graphics from Deepayan Sarkar's lattice package]

## 
## ---------- Summary Statistics for Weight
## 
##   Source   n  Mean  Median   SD  IQR   Min   Max
##  Foreign  39  2990    2970  538  940  2055  4100
## Domestic  48  3195    3282  565  934  1845  4105
## 
## Cylinders   n  Mean  Median   SD  IQR   Min   Max
##         4  49  2710    2705  373  520  1845  3785
##         6  31  3559    3515  265  255  2810  4105
##         8   7  3836    3935  244  210  3380  4055
## 
##  Trans   n  Mean  Median   SD  IQR   Min   Max
##   Auto  32  3569    3590  357  345  2880  4105
## Manual  55  2832    2785  471  585  1845  3805

From the visualization the patterns emerge. As Weight increases city MPG decreases. Domestic cars tend to weigh more. Foreign cars tend to have fewer cylinders, which also leads to better fuel mileage.

2. Visualize: Scatter, Box, and Violin Plots

David Gerbing

The Data

Continuous Variables

Two Variables

Three Variables

VBS Plot of 5 Variables

Scatterplot Matrix

Smoothed, Contoured, and Binned Scatterplots

Cleveland Dot Plot

Categorical and Continuous Variables

Two Continuous, One Categorical

One Continuous, One Categorical

Two Continuous, Three Categorical

Two Continuous, Three Categorical

Categorical Variables

Input Interactive Plots

Full Manual

More