Here, we focus on lessR visualizations because they are easier to implement and more comprehensive than those provided by any other available R visualization system (and Python as well). The lessR visualization functions are more comprehensive because they also offer statistical output without having to invoke yet other function.
Histogram
No surprise here. If you have already done lessR bar charts then you already know how to do lessR histograms. Just replace the word BarChart with the word Histogram in the function call and specify a continuous variable for the analysis. Figure 1 presents the result.
Histogram(Salary)
Figure 1: lessR default histogram for Salary from the Employee data set.
As with the bar graph, a frequency distribution of a continuous variable can also be presented as a visualization or as a table. For this analysis, bin width is $10,000. The following frequency table displayed by Histogram() lists each bin with the corresponding Count, Proportion, Cumulative Count, and Cumulative Proportion.
Also provided are the summary statistics of the distribution.
--- Salary ---
n miss mean sd min mdn max
37 0 73795.557 21799.533 46124.970 69547.600 134419.230
An outlier analysis should always be done for each variable in an analysis. Since it should always be done, my lessRHistogram() function does that analysis by default. The box plot provides the same outlier analysis but without that visualization.
--- Outliers --- from the box plot: 1
Small Large
----- -----
134419.2
Bin Width and Start
Any data visualization system will provide a default histogram with default values for the bin width and bin starting values that it computes for you. Explicitly specify the bin with and bin starting values by implementing the corresponding parameter values that override the default values calculated by the R default algorithm.
bin_width and bin_start
Histogram() parameters for specifying the width of the bins at the starting point of the bins.
After you create a histogram, such as using the default bin_width and bin_startvalues computed for you, experiment with at least one or two other bin widths.
Histogram(Salary, bin_width=13000)
Figure 2: lessR histogram for Salary from the Employee data set with a customized bin width of $13,000.
Histogram Scale
To transform the scale of a variable, that is, to re-scale, if the data are in Excel you can transform the data in Excel. However, the R implementation is straightforward: simply enter the equation that defines the transformation into the R console. The primary “gotcha” here is that the variable’s reference needs to include the name of its containing data frame.
R Reference for a Variable
data_frame_name$variable_name
An example follows for the variable Salary in the d data frame.
You can have as many active data frames as your computer’s memory can accommodate. Each data frame can contain a variable of the same name. So, in many functions, specify the data frame that contains each variable with the $ notation. However, some analysis functions, such as my lessR functions, use the data parameter to specify the data frame that contains the relevant variables.
d$Salary <- d$Salary /1000
The resulting histogram in Figure 3 is expressed in units of thousands of dollars. The xlab parameter specifies the label on the x-axis.
Histogram(Salary, xlab="Salary (USD in thousands)")
Figure 3: Histogram with re-scaled units for improved readability.
Density Curve
To obtain the density curve superimposed upon the histogram, set the density parameter to TRUE, as shown in Figure 4.
Histogram(Salary, density=TRUE)
Figure 4: Density plot of Salary.
The text output to the R console includes the calculated setting of bandwidth used to generate the density plot. To customize bandwidth, set the bandwidth parameter. Set bandwidth higher for a smoother plot and lower for a wavier, more irregular plot. Figure 5 illustrates.
Histogram(Salary, density=TRUE, bandwidth=15000)
Figure 5: Smoothness of the density curve customize with the bandwidth parameter.
Box Plot and Related
As you can likely guess, you can create a box plot with the BoxPlot() function name. The dimensions of the box plot, or any other visualization generated with R in RStudio, are easily controlled by adjusting the size of the Plots window pane or in the specification of the dimensions when exporting the visualization from the Export tab. Figure 6 shows a narrow box plot as the vertical height when presented horizontally is arbitrary.
BoxPlot(Salary)
Figure 6: Boxplot of Salary from the Employee data set.
Although that function call works just fine, BoxPlot() is actually an alias, a stand-in, for the Plot() function with a specific parameter setting. That parameter is vbs_plot with values that are some combination of v, b, and s. The default value of vbs_plot is "vbs", which means create the full Violin/Box/Scatterplot. To call Plot() directly to create only a box plot, set that parameter tob.
The following function call produces the same box plot as shown in Figure 6. The variable for the analysis needs to be continuous.
Plot(Salary, vbs_plot="b")
The text output includes a variety of summary statistics, including the statistics used to create the box plot, the quartiles and IQR. Also provided are some parameter values that were used to create the visualization, which the user can adjust, as shown in the next section.
One-Variable Scatterplot
Use the Plot() function to create a one-variable scatterplot, shown in Figure 7. Set the vbs_plot parameter to the value of s.
Plot(Salary, vbs_plot="s")
Figure 7: One-dimensional scatterplot for Salary from the Employee data set.
By default, the plotted values will be jittered vertically approximately as much as needed, which can then be customized. The values of the jitter parameters used creating the scatterplot are listed as part of the output at the R console. The relevant parameters are jitter_x for horizontal jitter and jitter_y for vertical jitter.
Parameter values (can be manually set)
-------------------------------------------------------
size: 0.61 size of plotted points
out_size: 0.82 size of plotted outlier points
jitter_y: 0.45 random vertical movement of points
jitter_x: 0.00 random horizontal movement of points
bw: 9529.04 set bandwidth higher for smoother edges
The following function call customizes vertical jitter by setting jitter_y to 6 from its original value of 0.45. Figure 8 illustrates.
Plot(Salary, vbs_plot="s", jitter_y=6)
Figure 8: One-dimensional scatterplot for Salary from the Employee data set with customized vertical jitter.
Box Plot and Scatterplot
Create a box plot with the plotted points from which the box plot is computed. Exclude the violin plot simply by not including the v in the specification of vbs_plot. The result appears in Figure 9.
Plot(Salary, vbs_plot="bs")
Figure 9: Box plot of Salary from the Employee data set with the plotted individual data values.
If only the box plot is specified, the outlier points are still plotted.
Outliers
The lessR boxplot analysis detects both the moderate or potential outliers and the actual outliers. The potential outliers are displayed with a relatively dark red dot. Identify the actual outliers by a stronger red dot.
To illustrate, here we create a strong outlier. Consider the fourth data value for the Salary variable. Indicate a specific value by enclosing its position in the Salary variable within square brackets. Also indicate that Salary is in the d data frame,
d$Salary[4]
[1] 111074.9
Now, change its value from $111,074.9 to $190,000. To do so, apply the usual are assignment operator, <-.
The <- notation has the advantage of showing the direction of the assignment, here from right to left. In general, you can also use the regular equals sign, =, in place of <-, but then some meaning is lost.
d$Salary[4] <-190000
The resulting box plot in Figure 10 now displays a potential outlier and an actual outlier, the value of $190,000.
BoxPlot(Salary)
Figure 10: A potential and actual outlier flagged by the boxplor visualization.
As described, the strong outlier is plotted with a more intense color of red. The potential outlier is visualized with a darker shade of red.
VBS Plot
The VBS plot (Gerbing, 2020) integrates three streamlined visualizations of the distribution of a continuous variable: violin, box, and scatter plots. The default visualization for the lessRPlot() function of a continuous variable is the full VBS plot, shown in Figure 11. Creating the plot only requires specifying a continuous variable to the Plot() function.
Plot(Salary)
Figure 11: VBS plot of Salary from the Employee data set.
This combination of violin/box/scatterplot is understood and applied in the data analysis community. That said, other systems require customization of several parameters to obtain the optimal plot. Instead, Plot() tunes those parameters in the background, saving you the programming.
To illustrate, generate 1000 normal curve simulated data values. Plot() automatically adjusts the size of the plotted points to accommodate the relatively large number of plotted points, as shown in Figure 12.
y <-rnorm(n=1000, mean=0, sd=1)Plot(y)
Figure 12: Plot of 1000 simulated normal curve values.
The detection of outliers at the two extremes of the normal curve provides an excellent example of how outliers can occur when generated from the same process as the remaining data values. Weird things happen from time to time. Occasionally, you flip a fair coin 10 times and get nine heads (probability of 0.00978, or almost 1%). That is another reason you need to use subjective judgment to evaluate whether an outlier is
a coding error
from a different process as the other data values
a weird result from the same process
All the data in Figure 12 are properly sampled from the same normal distribution. The detected outliers represent a weird result from the same process, expected for such a large sample, here with 1000 simulated normally distributed data values.
Trellis Plots
To compare distributions across groups with Trellis plots, use parameter by1 to define the first conditioning variable, the categorical variable for which the plots will be created for each level of the variable. The result is shown in Figure 13 for categorical variable Gender.
Plot(Salary, by1=Gender)
Figure 13: VBS plot of Salary from the Employee data set for Men and Women.
The Trellis plot can be created for any combination of the violin, box, and scatterplot. Figure 14 shows the Trellis plot for Gender for the box plot with plotted points by also specifying the vbs_plot parameter.
Plot(Salary, by1=Gender, vbs_plot="bs")
Figure 14: Trellis plot of Salary as box plots with the plotted points.
Use parameter by2 to define a second conditioning variable. See Figure 15. Here, Plan has three levels: 1, 2, and 3.
Plot(Salary, by1=Gender, by2=Plan)
Figure 15: VBS plot of Salary from the Employee data set across combinations of Men and Women and health Plan .
There are not enough data values for each of the six groups to provide adequate sample sizes. A larger sample size would be more practical for such a two-conditioned variable Trellis plot.