Doing data analysis involves analyzing variables. We use the term “variables” because their data values vary. Co-variability, or relatedness, of two or more variables is the topic addressed here.

Data analysis

The analysis of variability applies to the values of a single variable. The analysis of co-variability is the extent to which the values of two or more variables vary together.

The methods for analyzing variability and co-variability differ for categorical and continuous variables. We begin by analyzing categorical variables.

Categorical Variables

Joint Distributions

From lessR version 4.5.4.

Visualizations for two categorical variables show the relationship between the two variables with respect to a numerical variable. For example, how is salary related to an employee’s gender and department? If there is no relationship among the categorical groups with respect to salary, then each group would have the same mean salary. If there are real differences in salary across the groups, they will be evident in the group means, for example, visually in a bar chart.

As with a single-variable bar chart, a bar chart for two categorical variables associates a numerical value with each group. The distinction is that each group is defined by a combination of levels of the two categorical variables. As with the one-variable chart, the bar chart is constructed from a summary table, either entered directly as the data into the bar chart function or implicitly computed by the function from the original, raw data.

The source of the numeric values associated with the groups can be anything, including random nonsense. However, the summary table is usually computed from a data aggregation for the specified numerical variable. As with the one-categorical-variable bar chart, compute the numerical value as an aggregation of either:

the count of the membership in each group
a statistic such as the mean of another numerical variable

An example summary, or pivot, table follows for the 10 groups defined by the combination of two levels of Gender and five levels of Dept. This kind of summary table has its own name.

Joint distribution

A summary table of two variables that lists the numerical value associated with each group.

Each group is associated with its mean Salary. Table 1 shows the joint distribution with three variables presented in long format, the format in which the data can be read into a data analysis application for analysis, such as from a corresponding Excel data file. In this format, data values for a single variable are organized into a single column, and each row contains the values for a single entity, such as one employee.

Table 1: Mean Salary by Gender and Department in long format, amenable to be read into a data analysis application.

Gender	Dept	Salary_mean
M	ACCT	69626.20
W	ACCT	73237.16
M	ADMN	90963.34
W	ADMN	91434.00
M	FINC	82967.60
W	FINC	67139.90
M	MKTG	109062.66
W	MKTG	74496.02
M	SALE	96150.97
W	SALE	74188.25

When presenting the results, comparing the numbers that define the joint distribution becomes easier with an alternative table organization. This organization arranges the levels of one categorical variable horizontally and the levels of the other categorical variable vertically. Each cell contains the corresponding group mean, resulting in a wide format table. However, without additional programming, the horizontal tabular presentation in Table 2 is not a data table that can be directly imported into a data analysis application.

Visualizations such as the bar chart display descriptive statistics. To evaluate whether the differences likely generalize beyond a single sample, inferential statistics are required. The chi-square test of independence assesses the independence of two categorical variables, with the null hypothesis that the variables are unrelated.

Table 2: Mean Salary by Gender and Department in wide format.

	Gender
Dept	M	W
ACCT	59626.19	63237.16
ADMN	80963.35	81434.00
FINC	72967.60	57139.90
MKTG	99062.66	64496.02
SALE	86150.97	64188.25

The descriptive statistics in Table 2 communicate the relationship between gender and department. We can clearly see the average salary in each department separately for men and women. For example, we can see that the highest earning group is men in marketing, with an average salary of $99,063. To better evaluate this relationship, visualize the data. The corresponding bar chart constructed from that table of statistics more effectively communicates the underlying relationship.

Bar Charts

Stacked Bar Chart

There are different forms of the bar chart. Here, consider the stacked bar chart.

Stacked bar chart

For each level of the first categorical variable, the levels of the second categorical variable are stacked, one on top of the other.

A stacked bar chart shows the relationship of two categorical variables with respect to a numerical, continuous variable.

Purpose of the stacked bar chart

Compare the levels of the first categorical variable while observing the subdivision by the second categorical variable within each level of the first variable.

Directly from the Summary Table

Figure 1 shows a bar chart of the mean salary for employees in each department, separately for men and women, constructed entirely from the information in Table 1. Figure 1 is the visual representation of Table 1. All three variables in the summary table are processed by the bar chart function in this example: the categorical variables Gender and Dept, and the continuous variable Salary_mean, which contains the mean salary for each group.

As such, read the data shown in Table 1 from a data file containing the three variables, then create the bar chart.

lessR stacked bar chart from the summary table

d <- Read("https://web.pdx.edu/~gerbing/data/GenDptSalary.xlsx")
Chart(Dept, y=Salary_mean, by=Gender)

The data frame that contains the variables of interest is not specified with the data parameter in the call to Chart(), so the function relies on the default data frame name, d. The x-axis variable is the categorical variable Dept, and the y-axis variable is the continuous variable Salary_mean.

by: The second categorical variable.

Note: This specification for the call to Chart() is the same as the function call for the one-categorical-variable visualization, except that the by variable is also specified.

Note: Experiment interactively with different parameter settings of the two-variable bar chart for your chosen data with the lessR function interact("BarChart").

Figure 1: Stacked bar chart of the mean Salary of employees in each department, further subdivided into Gender, computed from the summary table as the data.

Each bar in the stacked bar chart has the same height as if there were no by variable, but each bar interior is now divided according to the relative proportions of the by levels. In this example, the height of each Dept bar is the same as in the one-variable plot of Dept, but the bars are now subdivided by Gender.

From the Original Data

Perhaps the more common way to create the bar chart is to enter the raw, original data table of measurements into the bar chart function. The function then implicitly computes the summary, or pivot, table from the original measurements and displays the corresponding bar chart. However, in this situation, a parameter is needed to indicate the statistic by which to aggregate the numerical variable. Here, compute the mean salary.

lessR stacked bar chart from the original data

d <- Read("Employee")
Chart(Dept, y=Salary, by=Gender, stat="mean")

by: The second categorical variable.

stat: The statistic by which to aggregate the specified numerical variable. Applicable values include "sum", "mean", "sd", "dev" for mean deviations, "min", "median", and "max".

Figure 2: Stacked bar chart of the mean Salary of employees in each department, further subdivided into Gender, computed from the original, raw data.

The only distinction between the bar chart created from the summary table, Figure 1, and the bar chart computed from the original data of measurements, Figure 2, is the label on the vertical, $y$, axis. In the summary table, the previously computed mean is entered as the variable Salary_mean. In the bar chart created from the original data, the output indicates that the mean was computed from Salary. Of course, the data analysis application also allows custom axis labels.

Unstacked Bar Chart

The alternative to the stacked bar chart is the unstacked version. The unstacked bar chart presents the same information as the stacked bar chart but with a different emphasis.

Unstacked or grouped bar chart

For each level of the first categorical variable, plot the levels of the second categorical variable as separate bars grouped together.

The unstacked bar chart presents the levels of the second categorical variable, the subcategories, side by side.

Purpose of the unstacked bar chart

Directly visualize the differences between the levels of the second categorical variable at each level of the first variable.

Figure 3 shows a bar chart of counts for the employees in each department, with adjacent bars displaying the relative numbers of men and women. The comparison of men’s and women’s counts in each department is straightforward: compare the heights of the adjacent bars.

lessR unstacked bar chart

Chart(Dept, y=Salary, by=Gender, stat="mean", beside=TRUE)

beside: Set to TRUE to create the unstacked version.

labels: Optional, set to "off" to turn labels off, or "%" to show cell percentages, or "input" to show the actual input values. The default for the unstacked bar chart is "%" because the bars are often too narrow to display the full input value.

Figure 3: Unstacked bar chart of the count of employees in each department, further subdivided into Gender.

The unstacked bar chart explicitly compares the mean Salary of each Gender within each department.

Interpretation. Men and women employees in accounting and administration, on average, have about the same salaries. However, in finance and sales, and especially in marketing, men make considerably more. This discrepancy should be further investigated. The result may be benign; for example, perhaps many men have worked longer at the company. Or, there may be overt or covert discrimination.

Charts from Counts

When constructing a bar chart from counts, the numerical variable in the visualization is the count of observations in each group.

Do men and women employees differ in their job satisfaction? Are gender and job satisfaction related? To answer this question, employees were administered a self-report survey in which they rated their job satisfaction on a 3-point scale: low, medium, and high.

The resulting summary table of counts in Table 3 follows for the 35 employees with data recorded for both corresponding variables, Gender and JobSat.

Table 3: Counts of job satisfaction categories by gender.

	Gender
JobSat	M	W
low	11	2
med	4	7
high	3	8

The joint distribution of counts, that is, frequencies, has its own name.

Cross-tabulation table

A table that lists the number of occurrences for each group defined by two or more categorical variables.

The table’s name comes from the meaning of the word tabulate, which means to count. When creating a bar chart of counts, it is constructed from the numbers in this cross-tabulation table.

Figure 4 shows a two-categorical-variable bar chart of the number of employees in each department, for both men and women.

lessR two categorical variable bar chart from counts

Chart(Gender, by=JobSat)

Note: No numerical variables are entered into the analysis; only the two categorical variables. The numerical variable for the analysis, the Count of employees in each group, is obtained directly from an analysis of the two categorical variables without reference to a third variable.

Figure 4: Visualization of the relation between gender and job satisfaction among employees.

The largest response category is men with low job satisfaction, 11 employees, or 31% of all employees. The next-largest response category is women with high job satisfaction, 8 employees, or 23% of all employees. The smallest response category is women with low job satisfaction, 2 employees, or 6% of all employees.

Interpretation. Job satisfaction and gender are related for the employees of this company. Among the 35 employees for whom data for both variables were recorded, men tend to have lower job satisfaction than women.

How should we explain the relationship between gender and job satisfaction? These data cannot answer that question, but the pattern is clearly a topic for further investigation by management.

100% Stacked Bar Chart

In general, there are different numbers of observations in each category or level on the x-axis. In this example, the number of men and women is almost the same, but the numbers can vary dramatically in other situations. To compensate for unequal sample sizes, the 100% stacked bar chart compares the distribution of the by variable within each level of the x-variable.

A motorcycle clothing manufacturer sells jackets in three different thicknesses: Lite, Medium, and Thick. One venue for selling is a motorcycle rally, which features a specific motorcycle brand. The clothing manufacturer must decide the product mix to bring to each rally. Different rallies have different levels of attendance, so the manufacturer should not base the comparison only on the raw number of jackets sold. Instead, the relevant question is whether the percentage distribution of jacket types differs across rallies.

To compare the product mix across motorcycle brands with different sample sizes, use the 100% stacked bar chart. Each bar has the same height, 100%, and the segments within each bar show the relative percentage of each jacket type for that motorcycle brand.

Consider how many Lite jackets to inventory at a BMW rally. The bar chart in Figure 5 shows that 9% of all purchases are made by BMW owners buying a Lite jacket. However, this 9% applies to all purchases in the entire cross-tabulation table, and is also affected by the number of jackets purchased by Honda owners.

Instead, the interest is the percentage of observations within each level of the first categorical variable, here motorcycle brand. The 100% stacked bar chart facilitates comparison of the second categorical variable across the levels of the first variable by computing percentages within each level.

Each bar’s height represents 100% of the observations within one category of the first categorical variable. Figure 6 shows both the traditional and 100% stacked bar charts of past rally sales for two motorcycle brands. The data are from the lessR file Jackets.

lessR stacked 100% bar chart

Chart(Bike, by=Jacket, stack100=TRUE)

stack100: Set to TRUE to display the stacked 100% bar chart.

As with the traditional bar chart of two categorical variables, the analysis begins with the table of joint frequencies, also including the row and column sums shown in Table 4.

Table 4: Bike Sales by Jacket Type and Brand

	Bike
Jacket	BMW	Honda	Sum
Lite	89	283	372
Med	135	207	342
Thick	194	117	311
Sum	418	607	1025

To construct the 100% stacked bar chart from the joint frequencies, calculate each cell proportion not as divided by the overall sum but instead by the corresponding column sum, the column marginal sum. For example, the proportion of BMW riders who buy a Lite jacket:

\[ \frac{89}{418} \approx 0.2129 \]

Find the corresponding value of 21% at the bottom of the first bar in Figure 6, the percentage of Lite jacket purchases at BMW rallies.

Interpretation. When going to a motorcycle rally with most or all attendees being BMW owners, bring about 21% of the total inventory for Lite jackets, 32% Medium jackets, and the remaining 46% Thick jackets.

The different product mixes likely follow from differences in riding style. The BMW bike is more of a sport bike than the Honda bikes. The thicker jacket provides more protection in case of a spill.

Bar Chart Alternatives

To illustrate alternatives to the bar chart for visualizing the same data, return to Table 1, computed from the Employee data table. This table contains two categorical variables, Dept and Gender, and the continuous variable Salary_mean. Table 5 expands this table by including the number of employees in each group, the variable Salary_n, and the number of missing values in each group, the variable Salary_na.

Table 5: Mean Salary and Count by Gender and Department

Gender	Dept	n	Salary_mean
M	ACCT	2	69626.20
W	ACCT	3	73237.16
M	ADMN	2	90963.34
W	ADMN	4	91434.00
M	FINC	3	82967.60
W	FINC	1	67139.90
M	MKTG	1	109062.66
W	MKTG	5	74496.02
M	SALE	10	96150.97
W	SALE	5	74188.25

The following sections apply this data table to the tree map and a bubble plot visualization.

Treemap

As with the one-categorical-variable bar chart generalizing to a bar chart of two categorical variables, a treemap visualization also generalizes.

Treemap characteristics

The values of the categorical variables determine the structure: rectangular boxes that, together, define the tree map, one box for each group. The numerical variable, with a value for each group, determines the size of the corresponding box.

lessR Treemap

Chart(Dept, by=Gender, type="treemap")
Chart(Dept, by=Gender, y=Salary_mean", type="treemap")

First, create a treemap of the count of occurrences of each group using the variable n already computed in the given summary table. Find the result in Figure 7.

Figure 7: Treemap of counts over Dept and Gender.

Second, as in Figure 8, create the treemap of the mean of salary distributed across departments according to gender within each department.

Figure 8: Treemap of Salary distributed over department and gender.

Sunburst Chart

The sunburst chart extends the pie chart analogously to how the stacked or unstacked bar chart extends the bar chart of one categorical variable. The sunburst chart adds another layer, displayed as a concentric ring, for each additional stratification variable. Unlike the bar chart for two categorical variables, whether stacked or unstacked, the sunburst chart can add as many concentric rings as specified.

Consistent with the other charts, the sunburst chart can apply to counts, Figure 9, or to another variable aggregated over the categories, Figure 10.

lessR Sunburst

Chart(Dept, by=Gender, type="sunburst")
Chart(Dept, by=Gender, y=Salary_mean", type="sunburst")

Figure 9: Sunburst chart of Salary distributed over Dept and Gender.

Figure 10: Sunburst of Salary distributed over Dept and Gender.

In these sunburst charts, the intial entering pie chart of Dept is further subdivided into Gender values of M and W in the outer ring.

Bubble Plot

The bubble plot displays a bubble for each group. Figure 11 shows the bubble plot of the count of employees for each combination of department and gender.

lessR Bubble plot

Chart(Dept, by=Gender)
Chart(Dept, by=Gender, y=Salary, stat="mean")

Note: Pass two categorical variables to Plot() to automatically create the bubble plot.

Note: Applies only to counts of groups defined by combinations of levels of the two variables.

Figure 11: Bubble plot of counts across Dept and Gender.

Figure 12 shows the bubble plot of Salary distributed over Dept and Gender.

Figure 12: Bubble plot of Salary distributed over Dept and Gender.

Business Applications

The types of applications that apply to the one-categorical bar chart also apply to the two-categorical-variable versions, except that in this case, the categorical variables are analyzed simultaneously.

Sales - Decision focus: Identify how different products perform in different regions.
Plot sales results for each group that define the two categorical variables product and region.
Marketing - Decision focus: Understand which services are favored by different genders and where improvements might be needed.
Plot customer satisfaction by service type, retail vs online, and gender.
Marketing - Decision focus: Understand market share by product category of our company and competitors.
Plot market share by product category and company, our own and competitors.
Marketing - Decision focus: Evaluate when certain online marketing channels are most effective to increase advertising effectiveness.
Plot online sales revenue by origin, such as direct, referral, and social media, with time of day.
HR - Decision focus: Identify departments with high turnover rates and reveal any related gender disparities related to employee turnover by department and gender.
Plot employee turnover by department and gender.
Advertising - Decision focus: Evaluate different ad campaign styles, serious or funny, across multiple online platforms, such as Instagram, Google, and Facebook, for return on investment.
Plot sales revenue by platform and style.