Histogram

Data Values into Bins

From lessR version 4.4.1.

In contrast to the relatively few unique values of a categorical variable, a continuous variable has many potential values. How many potential values? Generally, there are too many unique data values to plot individually on a single graph. Consider annual salary, where every single value to the nearest penny must be considered from about $20,000 to $500,000 or so. Because there are so many potential data values, for most data sets too many possible values never occur in the data. For example, a specific annual salary of $83,924.79 would likely not occur unless the sample size was extremely large and even then not likely to the nearest penny.

What to do with all the unique data values that cannot be individually plotted?

Bins

Sequence of adjacent intervals, each generally the same size, which forms groups of values of a continuous variable.

Partition the range of values into bins, sometimes called classes. Figure 1 presents an example of a bin that contains values from $50,000 to $60,000 for annual salaries.

Figure 1: An example of a bin defined over the range of data values from $50,000 up to $60,000.

Each bin contains similar data values, defined by a lower and upper boundary, which specifies the width of the bin. In Figure 1, bin width is $10,000, and the midpoint is $55,000.

To evaluate the distribution of a continuous variable, first define the bins and then place each data value into its respective bin, as illustrated in Figure 2. Assign a annual salary of $57,358 to the bin $50,000 to $60,000.

Figure 2: The assignment of data value $57538 to the bin $50,000 to $60,000.

Consistently assign values precisely equal to a bin boundary to either the adjacent lower bin or the adjacent higher bin. Each computer application that generates histograms defaults to one of those choices.

Example

The most typical graphical display of the variation of a continuous variable is the histogram.

Histogram

The visualization of a continuous variable with values distributed across a set of adjacent bins, usually of the same width, plotted on the horizontal, x-axis, with the count of the number of data values in each bin plotted on the vertical, y-axis.

Like the bar graph, the histogram consists of a set of bars. However, instead of associating a single data value with a number, the histogram associates each bin with a number, the corresponding frequency or count. Another distinction is that the adjacent bars of a histogram share a common side to indicate the underlying continuity of a continuous variable.

For example, consider the histogram shown in Figure 3 of the annual salaries of 37 different employees from the Employee data table included with lessR.

lessR Histogram

Histogram(Salary)

Unless you specify quiet=TRUE in the function call, the following text output also results.

The resulting frequency table lists each bin with the corresponding Count, Proportion, Cumulative Count, and Cumulative Proportion.

             Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
--------------------------------------------------------- 
  40000 >  50000   45000      4    0.11        4     0.11 
  50000 >  60000   55000      8    0.22       12     0.32 
  60000 >  70000   65000      8    0.22       20     0.54 
  70000 >  80000   75000      5    0.14       25     0.68 
  80000 >  90000   85000      3    0.08       28     0.76 
  90000 > 100000   95000      5    0.14       33     0.89 
 100000 > 110000  105000      1    0.03       34     0.92 
 110000 > 120000  115000      1    0.03       35     0.95 
 120000 > 130000  125000      1    0.03       36     0.97 
 130000 > 140000  135000      1    0.03       37     1.00

Also provided are the summary statistics of the distribution.

--- Salary ---  
     n   miss         mean           sd          min          mdn          max 
     37      0    73795.557    21799.533    46124.970    69547.600   134419.230

An outlier analysis should always be done for each variable in an analysis. Since it should always be done, the function does that analysis by default. The box plot provides the same outlier analysis but without that visualization.

--- Outliers ---     from the box plot: 1 
 
Small      Large 
-----      ----- 
            134419.2

Figure 3: A histogram for the variable Salary.

The width of the bins for this histogram is 8000. Here we see, for example, that there is only one person who receives a salary between $40,000 and $48,000. Because the data values are organized by bins, from the histogram alone we do not know the precise salary of that person, just that it is somewhere between $40,000 and $48,000.

Artifacts and Issues

Optimal Bin Width

The choice of bins is critical for discerning the overall shape of a distribution. Yet, this choice often relies upon potentially subjective decisions made by the analyst.

No one correct histogram

Unlike many statistical analyses, there is no one correct, best histogram for a given data set, so we run several histograms with different bin widths.

Although there is not just one correct answer for choosing the bin width for a given data set, some choices provide more clarity than others to discern the basic shape of the distribution. The default histogram generated by analysis systems presume a default bin width. However, these initial histograms may need adjusting because the underlying algorithms do not always identify an optimal bin width. Through a process of trial and error, select a bin width that displays as much detail as possible for the available sample size without excessive random noise.

What guidance do we have for choosing a bin width? Although virtually any distribution shape is possible, most distributions of a continuous variable encountered in practice tend to be continuously increasing or decreasing in value, or like the normal or “bell” shaped curve, with increasing frequencies up to a value and then decreasing frequencies after that value (or vice versa). In particular, most real-life distributions, as opposed to mathematical oddities dreamed up by mathematicians, are not characterized by a pattern of up and down zigzags.

As shown in the following examples, random ups and downs of a histogram’s bars usually indicate sampling error resulting from a bin width that is too small. Yes, that problem can be corrected by increasing bin width, but only to a certain extent. Bins that are too large obscure information, not leveraging all the information in the data.

Bin Width

One artifact involves bin width.

Bin width artifact

Changing the width of the bins not only changes their width, it can also change the overall shape of the histogram.

Choosing the optimal bin width depends on several factors with no one precise best answer. Still, some choices are better than others. One problem is bins that are too small.

A bin width can be too small relative to the amount of available data, which yields an undersmoothed histogram.

Undersmoothed histogram

Too many bins result in too much detail that displays random noise.

The result is that the undersmoothed histogram indicates random ups and downs that would not reproduce if a new set of comparable data were collected and analyzed. The previously displayed histogram in Figure 3 is somewhat undersmoothed. To clarify that characteristic, consider the frequency polygon of the same data.

Frequency polygon

Similar to a histogram but instead of bars, plots a point at would be the midpoint of each bar and connects each pair of adjacent points with a line segment.

Figure 4 shows the frequency polygon that corresponds to the histogram in Figure 3.

lessR Frequency polygon

Plot(Salary, stat_x="count")

Use the Plot() function because we are plotting points, which are then, by default, connected with line segments.

stat_x: Specify a value of "count" to mimic the action of a histogram, which computes counts.

Note: Can also control the bin width as shown shortly. The following frequency polygon is based on a bin width of 8000.

Figure 4: Frequency polygon of the Salary data that displays the irregularities of the estimated distribution shape.

Could those irregularities characteristic of that histogram or its associated frequency polygon be real? Could those irregularities consistently repeat in other samples of data from the same population? The answer is yes, those zigzags could be an actual characteristic of the underlying population. However, particularly for such a small sample of 37 employees, repeatability of those irregular fluctuations is very unlikely.Instead, the underlying true distributional shape is likely much smoother.

To be more dramatic, we can decrease the bin with even more as in Figure 5.

lessR Custom bin width

Histogram(Salary, bin_width=5000)

bin_width: Specify the width of the bins to provide for custom values.

These undersmoothed histograms reflect too much random sampling variability – too many random ups and downs – relative to the likely much smoother true shape of the underlying distribution. The too narrow bin width obscures the actual shape of the underlying distribution. Imagine trying to interpret this histogram, having to continually repeat statements such as: Salaries just under $80,000 tend to be not encountered so much, but then over $80,000 there is a rise in the number of salaries followed by a quick decline in the number with salaries a little more than $85,000. True enough for this particular sample but irrelevant regarding the population as a whole, which means also for any other sample.

In practice, when encountering undersmoothed histograms, increase bin width and experiment. The following histogram in Figure 6, with a bin width of 13,000, is an excellent candidate for the final histogram.

Figure 6: Histogram with a bin width of $13,000 that could be the histogram accepted as the best display of the underlying distribution.

To further illustrate the more smoothed estimation of the underlying distribution, consider the corresponding frequency polygon in Figure 7.

Figure 7: Frequency polygon that estimates a distribution free of irregular ups and downs.

Interpretation. The histogram in Figure 6 indicates a distribution with a relatively small number of employees with salaries that begin around $40,000, then increases in frequency of employees up to $80,000, then the number of employees diminishes rapidly in frequency for larger salaries.

This histogram shows the same general form as the histogram in Figure 3, with a bin width of $8000. The improvement is that the revised histogram with a bin width of $13,000 reveals the shape of the distribution without the irregular fluctuations.

Just as with an undersmoothed histogram that obscures the underlying distribution, a bin width that is too wide also does not properly display the shape of the underlying distribution.

Oversmoothed histogram

Bins that are too large obscure details of the underlying distribution.

Oversmoothed histograms do not utilize the data efficiently. They bury some of the information inherent in the data.

The following oversmoothed histogram in Figure 8 illustrates the loss of information.

For this histogram, the bin width is $25000. If we only had this histogram by which to describe the underlying distribution, we would misleadingly conclude that the largest number of salaries are between $40,000 and $65,000 and that the number of people with higher salaries steadily declines. True, but from the more optimal histogram in Figure 6, we know that the shape of the distribution is more nuanced, with first an increase in the number of people from the lowest end of the salary spectrum as salary increases and then followed by a decrease in salaries.

To summarize, in general the default bin width given by a computer algorithm for a specific histogram function is usually more or less decent. However, this initial bin width can often be improved by experimenting with different bin widths. The goal is to display as much information as possible given the size of the sample from which the histogram is estimated.

Bin Shift

Another example of the potential arbitrariness of a histogram is the bin shift artifact.

Bin shift artifact

Change the starting point of a histogram, and the shape of the histogram likely changes.

Particularly in small samples, the shape of the histogram may be overly dependent on the starting position of the first bin. A histogram of the same data with the same bin width can substantively change shape simply by changing the starting point of the first bin.

To illustrate the impact of bin shift, consider a sample of 15 shipment times, assessed as the number of days from the time of the order until the shipment is received, shown in Figure 9.

Figure 9: Default histogram of ship time, the first bin starts at 5.

The default starting point of the histogram in Figure 9 is 5. The histogram indicates steadily increasing frequencies for longer ship times. To revise the histogram, retain the bin width of 1 but instead start the first bin at 4.5, as shown in Figure 10.

lessR Custom bin start

Histogram(ShipTime, bin_start=4.5)

bin_start: Specify the starting value of the bins to provide for custom values.

Note: The data are available at: https://web.pdx.edu/~gerbing/data/shiptime.csv

Figure 10: Histogram of ship Time with specified bin start of 4.5.

Compare the histograms in Figure 9 and Figure 10 of ship Time, which have the same bin width but with start values that differ by 0.5. The resulting histograms differ to the extent that, at first glance, they appear to describe different data sets. The reason for this change is that for this small sample size there are relatively large gaps between some of the adjacent data values. The continuity of the possible underlying values would be better represented by the relatively few values in the data set. The one-dimensional scatterplot, explained later, is usually a better alternative to visualize the distribution of data values in a small data set.

Histogram Scale

For any type of visualization plotted with axes, the values displayed on an axis should not include too many zeros to enhance readability.To illustrate, consider a histogram of the data values of the variable Income. The histogram is shown in Figure 11.

Figure 11: Histogram with too many zeros in the numbers on the x-axis to efficiently read.

After dividing the values of the variable Income by 1000, its values now range between 0 and 600 instead of 0 and 600,000. The rescaling of the variable should be noted, such as with a new variable label on the horizontal or x-axis for the histogram, shown in Figure 12.

R Variable transformation

To transform the scale of a variable, that is, to re-scale, if the data are in Excel you can transform the data in Excel. However, the R implementation is straightforward: simply enter the equation that defines the transformation into the R console. The primary “gotcha” here is that the variable’s reference needs to include the name of its containing data frame.

data_frame_name$variable_name

Why do you have to include the data frame name? You can have as many active data frames as your computer’s memory can accommodate. Each data frame can contain a variable of the same name. In many functions, specify the data frame that contains each variable with the $ notation. However, some analysis functions, such as my lessR functions, use the data parameter to specify the data frame that contains the relevant variables. In that situation, specify the variable names only.

d$Salary <- d$Salary / 1000

The resulting histogram in Figure 12 is expressed in units of thousands of dollars.

Histogram(Salary, xlab="Salary (USD in thousands)")

xlab: Specifies the label on the x-axis to override the default value of the variable name.

Other language for a variable transformation is calculated field.

Figure 12: Histogram with rescaled x-axis.

Density Curve

The histogram is a 19th-century innovation. It’s notable limitation is that the underlying distribution of a continuous variable is, well, continuous but the histogram plots the distribution in discrete chunks, the bins. Ideally, a continuous distribution would be represented by a smooth curve. Modern computer technology delivers that representation, directly estimating that smooth curve.

Density curve

The smooth curve that directly visualizes continuity.

As with the histogram, the density curve indicates where the values of the variable tend to occur more or less frequently than the other values, as shown in Figure 13. Here, return to the data from the Employee data set.

lessR Density curve

Histogram(Salary, density=TRUE)

density: Set to TRUE to display the smoothed density curve over the histogram.

Note: Can still specify bin_width and bin_start if desired.

The density curve in Figure 13 is a bit wobbly. Its smoothness can be adjusted.

Bandwidth

Determines the smoothness of the estimated density curve.

The value of each point of the density curve is computed by averaging the values of nearby data values. A larger bandwidth results in a smoother curve because it averages over a broader range of data points, leading to less sensitivity to fluctuations in individual data points. A smaller bandwidth produces a more jagged curve that closely follows the individual data points.

The lack of smoothness in Figure 13 is likely due to random sampling fluctuations characteristic of such small samples. The bandwidth applied to that histogram is 9000. To derive a smoother curve more impervious to this likely random error, the curve in Figure 14 was computed with a larger bandwidth than the density curve in Figure 13, a bandwidth value of 15000.

lessR Custom bandwidth

Histogram(Salary, density=TRUE, bandwidth=15000)

bandwidth: Specify the bandwidth for the density curve to provide for custom values.

Figure 14: Density curve of Salary with an increased bandwidth.

As with the underlying histogram, there is no one best density curve. The goal is to find the optimal value of bandwidth that balances too much smoothness against over-emphasizing random fluctuations of the data with lots of usually random ups and downs in the density curve.

Box Plot and Related

The alternate primary display to the histogram of a continuous variable is the box plot plus related visualizations that can enhance the basic box plot. To understand the box plot, we need another set of statistics based on the position of a date value within the set of sorted data values from smallest to largest.

Order Statistics

There are many positions within a set of ordered data values that can be identified, which helps us understand the characteristics of a continuous variable’s distribution of data values.

Order statistic

Specify the position in an ordered set of data values.

One use for order statistics is that because their values depend only on the order of the sorted values in a distribution, they can be applied to ordinal data. Statistics such as the mean and the standard deviation require the higher quality interval or ratio data.

An essential order statistic indicates the middle of the distribution.

Median

The value midway between the smallest and largest values of the sorted distribution.

A distribution with an odd number of values has a data value in the middle. For a distribution with an even number of values, the median is generally not a data value but the mean of the two values closest to the middle.

The median requires one split of the sorted data, but any number of splits can be made. The quartiles split the distribution into four equal parts: 1st , 2nd , 3rd , and 4th quarters.

Quartile

One of three values that together separate the values of an ordered distribution into four equal parts.

The median is the second quartile. The first quartile cuts off the bottom 25% of the distribution, and the third quartile cuts off the bottom 75%.

The more variable the data values in the distribution, the wider the box. From the quartiles, we can define an order statistic that describes the variability of the data values because its value is the width of the box in the box plot.

Interquartile range (IQR)

The middle 50% of the data values in a distribution, the difference between the first and third quartiles.

We have the median, counterpart to the mean, and the IQR, counterpart to the standard deviation.

Statistics such as the mean and standard deviation are called parametric statistics, which are compared to order statistics, sometimes referred to as non-parametric statistics.

The Box

Construct the box in the box plot from the three quartiles. The length of the box is the interquartile range or IQR, the range of data that contains the middle 50% of all the data values with a line through the median and perpendicular lines extending out from the edges. Figure 15 shows the box plot based on these quartiles.

For a symmetric distribution, such as the normal curve, the right side is a mirror image of the left side. The median will split the box evenly down the middle. However, many distributions lack symmetry, with one tail larger than the other.

Skewness

Indicator of lack of distribution symmetry.

An asymmetric distribution has a longer tail on either the right side or the left side on a histogram or density plot. Asymmetry reveals itself in the box plot such that the side of the distribution with the most extended tail occupies a larger box area from the median to the end of the box than on the other side. Figure 16 illustrates.

The implications of skew in decision-making can be influential. For example, consider a decision to set a re-order time. If the visualizations in Figure 16 represent ship time, it is apparent that some shipments are received much later than the others. The average ship time may not be a sufficient indicator of the amount of lag expected from the time of the order to the time of receiving the shipment. When setting a reorder point based on inventory levels, the cost of not having adequate inventory due to a particularly late order must be considered.

Whiskers and Outliers

As readily seen from the box plot visualizations, lines extend out from either side of the box. To understand the meaning of those lines we first need to understand the concept of an outlier. Some values in a distribution may be anomalies. An essential task of virtually every data analysis is identifying these anomalies and then understanding why they occurred.

Outlier

A value far from most of the remaining data values.

Outliers should always be identified for each variable of interest because their values could represent as something as simple and as important as a coding error. Or, an outlier could represent the outcome of a process that is different from the process that generated all or most of the other values of the distribution.

Sample from one population

A summary statistic should summarize data sampled from a single population, that is, data generated by a single process.

Data are only meaningful when the same process generates all the data values for a variable. Mixing the values from multiple processes into data values for a single variable may yield accurate numerical results but does not represent any real-world process. The process that generates an outlier can be different from that which generated the remaining values.

An analysis of salaries may include some part-time employees as well as full-time employees. Any resulting conclusions, such as the average salary, may not be representative of either group.
The mean of nine salaries and last year’s annual GNP can be correctly calculated, but this mean has no meaningful interpretation.

The lack of meaning that results from mixing data from different processes follows from that there is no single concept or entity shared by all the data values. Identify and then analyze outliers from a different population as a separate group, and then generalize the results to the population of interest.

The box plot is the classic visualization for detecting outliers in a single distribution of values for a variable. In the context of a box plot, outliers are defined relative to the size of the box. The definition of an outlier pertains to two different categories. As you read these definitions, remember that the IQR is the width of the box.

Well-known and influential statistician John Tukey developed the concept of the box plot in the 1970’s and developed these definitions of an outlier.

Moderate or potential outlier

Values between 1.5 IQR’s and 3.0 IQR’s from the edges of the box.

Outlier

Values more than 3.0 IQR’s from either box’s edge.

The purpose of those lines extending out from the central box of the box plot is to assist in identifying outliers.

Whisker

A line from a box’s edge that extends to the most extreme data value that is not a potential outlier, that is, within 1.5 IQR’s of the edges, with a perpendicular line segment at the end of the line.

Figure 17 shows the box plot of Salary from the Employee data set.

lessR Boxplot

BoxPlot(Salary)

Although that function call works just fine, BoxPlot() is actually an alias, a stand-in, for the Plot() function with a specific parameter setting.

vbs_plot: With values that are some combination of v, b, and s. The default value is "vbs", which means create the full Violin/Box/Scatterplot.

To call Plot() directly to create only a box plot, set that parameter tob. The following function call produces the same box plot as shown in Figure 17. The variable for the analysis needs to be continuous.

Plot(Salary, vbs_plot="b"

Note: The lessR box plot analysis displays the potential outliers with a relatively dark red dot. Identify the actual outliers by a stronger red dot.

Notice the red point displayed past the right whisker of the box plot in Figure 17. That point is a potential outlier, the largest salary in the employee data set, almost $135,500. That value is more than 1.5 times the width of the box beyond the third quartile. The data analyst who is doing this salary analysis needs to explore why that potential outlier exists.

Interpretation. Is there some characteristic that distinguishes the outlier employee from the remaining 36 employees? For example, is the employee in a more upper echelon of management than the remaining employees? If so, then that data point should be dropped from the analysis, and the conclusions from the salary analysis applied to the 36 Employees at the same hierarchy within the company. Whether or not to delete that data value from the analysis from these data values cannot be determined only from the data. Instead, the analyst must supply knowledge of how salaries are defined within the company and how different management levels are allocated.

Subjective judgment and business domain knowledge cannot be separated from data analytic conclusions. Many pure mathematical or statistical types prefer to live in a world of their geeked mathematical objective reality. However, the reality in which we are working, doing data analysis, requires domain knowledge and judgment. When you encounter an outlier, always use your subjective judgment to evaluate the reason for the outlier:

a coding error
from a different process as the other data values
a weird result from the same process

Statistics alone cannot answer that essential question.

Here is an example of outliers that result from the same process. To illustrate, generate 1000 normal curve simulated data values, as shown in Figure 18.

Figure 18: Plot of 1000 simulated normal curve values.

The detection of outliers at the two extremes of the normal curve provides an excellent example of how outliers can occur when generated from the same process as the remaining data values. Weird things happen from time to time. Occasionally, you flip a fair coin 10 times and get nine heads (probability of 0.00978, or almost 1%). All the data in Figure 18 are properly sampled from the same normal distribution. The detected outliers represent a weird result from the same process, expected for such a large sample, here with 1000 simulated normally distributed data values.

One-Variable Scatterplot

The scatterplot provides a direct visualization of the data values and corresponding sample size. Each data value plots as a point, according to its location along the corresponding value axis.

lessR One-variable scatterplot

Plot(Salary, vbs_plot="s")

vbs_plot: Set to "s" to create only the scatterplot for, in this example, the one specified variable.

Figure 19: One-dimensional scatterplot for Salary.

Points corresponding to the same value overlap. What happens if points overlap each other? Plot overlapping points either with partial transparency or with random perturbations to distinguish between multiple points of the same value.

Jitter

Random displacement of the horizontal and/or vertical coordinates of a plotted point.

More overlap results in the need for more jitter. If possible, apply only vertical jitter to retain the value of the plotted point. If there are too many points with the same value, add horizontal jitter as well.

The scatterplot in Figure 19 displays slight jitter. To emphasize, Figure 20 displays more vertical jitter for the same data values.

lessR Jitter specification

jitter_y: Specify the extent of vertical jitter for a scatterplot to provide for custom values.

jitter_x: Specify the extent of horizontal jitter for a scatterplot to provide for custom values.

By default, the plotted values will be jittered vertically approximately as much as needed, which can then be customized. The values of the jitter parameters used creating the scatterplot are listed as part of the output at the R console.

Parameter values (can be manually set)
-------------------------------------------------------
size: 0.61      size of plotted points
out_size: 0.82  size of plotted outlier points
jitter_y: 0.45  random vertical movement of points
jitter_x: 0.00  random horizontal movement of points
bw: 9529.04     set bandwidth higher for smoother edges

Figure 20: One-dimensional scatterplot of Salary with added vertical jitter.

Again, there is no one correct answer for the amount of jitter to apply to the points in a scatterplot. The amount to apply depends on the subjective perception of the analyst, usually just enough to separate the points.

VBS Plot

Further Development

John Tukey developed the box plot just before computer technology became dominant and widely accessible for data visualization. As such, it is a pre-computer technology, which, however, still performs quite well even with our modern sophisticated computer visualizations. That said, we have a modern update that enhances the box plot.

Violin plot

Replace the top and bottom flat edges of the box plot with a density plot, which often, but not necessarily, resembles a violin.

Figure 21 shows an example of a violin plot.

Those flat edges of the box in the box plot do not provide any information. Replacing those edges with the corresponding density plot offers more information regarding the location of the data values of the continuous variable of interest.

With modern computer technology we can go further to combine the violin plot, the box plot, and the one-dimensional scatterplot into one comprehensive visualization. Contributing to the development of this type of visualization (Gerbing 2020), I call this combination the VBS plot for Violin/Box/Scatterplot, illustrated in Figure 22.

Gerbing, David. 2020. R Visualizations: Derive Meaning from Data. CRC Press.

lessR VBS visualization

Plot(Salary)

The default value of the vbs_plot parameter is "vbs".

The VBS plot simultaneously retains the advantages of all three of its constituent component visualizations. We have the density plot to estimate the smooth distribution of the continuous variable over the range of date values, we have the box plot with the median and first and 3rd quartiles along with the whiskers to identify outliers, and we have a direct visualization of the data points themselves.

Trellis Graphics

A common and essential task of data visualization is comparing the values of a distribution across different groups.

Stratification

Divide a data table into subgroups, or strata, based on the levels of one or more categorical variables to analyze how the distribution of a single variable or the relationship between two variables varies across these groups.

One Conditioning Variable

In the employee data set, we have the salaries for 37 employees and their gender. In this data set, only male and female values surfaced. Now compare salaries across these two groups.

Trellis (or facet) graphics

Plot the same visualization in different panels, each panel displaying the data values for a different group.

The Trellis plot, often referred to as a facet plot, visualizes a stratification based on the levels of at least one categorical variable. The visualization of each level of the categorical variable is called a facet.

The Trellis plot is so named because it resembles a garden trellis, also called a lattice. The garden trellis is a structure that supports climbing plants and vines. The trellis provides a vertical surface on which plants can grow upward. Figure 23 shows four examples.

Figure 23: Four garden trellis images created by DALL-E, OpenAI.

Figure 24 illustrates a simple Trellis visualization: Generate a VBS plot separately for Men and Women. The categorical variable that specifies the levels for which to create the visualization is called a conditioning variable. For Figure 24, there are two levels of the conditioning variable Gender present in these data: Male and Female.

lessR VBS Trellis visualization

Plot(Salary, facet1=Gender)

facet1: Specify the value of the conditioning variable, the categorical variable for which its levels define the different panels or facets of the Trellis plot.

Note: Can also specify the vbs_plot parameter to obtain any combination of violin, box, and scatterplots in the Trellis plot.

Note: Interact with colors and other parameters with Trellis VBS plots by entering interact("Trellis").

Figure 24: Trellis VBS plot of Salary plotted separately for two Genders.

Interpretation. We conclude that a higher percentage of women occupy the lower salary ranges. For example, the median Salary for women is considerably smaller than that for men. More of the data points are shifted lower for women, despite that the highest salary is for a woman. The pattern of these visualization encourages further investigation to establish any significant differences in Salary, and perhaps a pattern of discrimination.

Current discrimination, however, is not a conclusion that can be supported by these data. It is potentially suggestive but alternative explanations exist. Perhaps women in the past did not apply for the available jobs in equal numbers with men. Perhaps in the past, the maternity leave and support was insufficient and has since been upgraded. Or, perhaps management used to be dominated by cigar smoking, whiskey drinking chauvinists who did not hire women, but now all the old guys are retired, dead, or in jail.

Two Conditioning Variables

Further subsetting of the data values occurs when we condition over two categorical variables. In that situation, groups are formed for every combination of levels for the two variables. For the Employee data, we have two levels of Gender and three levels of health Plan, numbered 1, 2, and 3. There is a total of six groups to plot, as shown in Figure 25.

lessR VBS Two-Way Trellis visualization

Plot(Salary, facet1=Gender, facet2=Plan)

facet1: Specify the value of the conditioning variable, the categorical variable for which its levels define the different panels of the Trellis plot.

facet2: Specify the value of the second conditioning variable.

Figure 25: Trellis VBS plot of Salary plotted separately for two Genders and choice of health plan