Visualize Categorical Data

The Family of Visualizations

An essential data visualizations relates the size of a numerical value associated with each level of a categorical variable. These visualizations involve two variables: the categorical variable of interest, generically referred to as \(x\), and the associated numerical variable, generically referred to as \(y\). Refer to this family of visualizations as categorical data visualizations, of which the most well-known examples are bar charts and pie charts.

Categorical data visualization

A data visualization that associates a value of a numerical variable with a corresponding level of a categorical variable.

What benefit do we derive from these type of visualizations?

Purpose of a categorical data visualization

Assess the extent of the numerical value associated with each level of a categorical variable and compare the numerical values across levels.

What information do we need to construct a categorical data visualization?

Data from which to construct the categorical data visualization

Construct the bar chart or related visualization from the values of the two variables expressed as a table in which one column contains the name of each level of the categorical variable and the second column associates the number associated with the corresponding level.

For example, consider the number of employees, n, who work in each department of a small company. The values of the categorical variable, Dept, refer to the departments of the company. There are five departments, so there are five unique values of Dept: ACCT (accounting), ADMN (administration), FINC (finance), MKTG (marketing), and SALE (sales). Each unique value of Dept defines a group of employees.

Table 1 of the paired data values of the variables Dept and n displays the counts of the employees in each department. We see, for example, that 15 people work in sales, by far the department with the most employees.

Table 1: Visualize categorical data from a table that pairs each category with its corresponding numerical value.

Dept	n
ACCT	5
ADMN	6
FINC	4
MKTG	6
SALE	15

How can we visually relate the different levels and numbers in this straightforward and compact table? The bar chart is the most frequently encountered categorical data visualization, illustrated in Figure 1 (a).

Bar chart

A bar chart plots a bar for each level of a categorical variable, the height of each bar proportional to a numeric value associated with the corresponding level.

Comparing the lengths of the bars with each other facilitates a better understanding of the differences among the groups on the numerical variable of interest. Similar logic applies to the related visualizations. Figure 1 displays six alternative categorical data visualizations.

Each corresponding visualization function processes the data in a simple table, as shown in Table 1, to obtain a visualization. However, if you are not sure which visualization to choose, generally choose the bar chart or perhaps the dot plot in place of visualizations that rely on area as the primary visual aesthetic such as a pie chart shown in Figure 1 (b).

Pie chart

Relate each level of a categorical variable to the area of a circle (pie) scaled according to the associated numerical value.

The issue is the ease and accuracy by which people can evaluate the extent of the data represented by a visual aesthetic.

Visual aesthetic length generally preferred for comparing levels

People are more accurate and precise at detecting differences in length or position than in detecting differences in angles.

The bar chart and the dot plot depend on the visual aesthetic of length, which provides more accurate comparisons among levels than the visual aesthetic of area, upon which the other categorical data visualizations rely. The effectiveness of length for visual comparison does not imply that the other visualizations should not be used. Still, bar charts and dot plots should probably be considered as the first contenders for visualization.

Of course, the goal of a visualization is to summarize meaningful results. However, note that the levels can be for any categorical variable, and the associated number can be anything, meaningful or not. You are free to open a worksheet app such as Excel and make up a similar table of total randomness with nonsense categories and associated made-up numbers. You can submit that table to a bar chart function or any related function and create the resulting visualization. All you need is a similar table of paired levels and numbers with access to a corresponding visualization function.

The Input Summary Table

In practice, the levels of each categorical variable define groups, such as the employees who work in different departments of a company shown in Table 1. Another question of potential interest: What is the mean salary for employees of the company across different departments? A bar chart or related visualizes these results. The count of the number of employees in each department, or the calculation of the mean salary for the employees in each department, is an example of data aggregation, a standard method for obtaining the numerical value associated with each category.

Data aggregation

Compute a statistic over groups defined by the levels of one or more categorical variables.

You may know an alternate name for a summary table of an aggregated statistic, pivot table.

Pivot table

The name by which Excel refers to a summary table computed as a data aggregation.

An example of data aggregation is Table 1, the summary table that shows the number of employees in each department, indicated by the variable n. The statistical procedure aggregated the count of employees across departments from the original data with one record or observation for each employee. Or aggregate the mean salary across departments.

Summary tables computed from different aggregations and potential subsequent processing of that information lead to different bar charts and related visualizations.

Aggregate a numerical variable that corresponds to each level of a categorical variable.
- The count of the number of occurrences of each level.
- Any statistic computed for another numerical variable for each level.
Optionally, further processing of the results after the aggregation is also possible.
- Show the deviation of each level from the center, such as the mean.
- Sort the levels according to their corresponding numerical values.

How many categorical variables from which to aggregate the groups defined by the respective levels?

Granularity

Level of detail in the data depending on the extent of aggregation.

The more categorical variables are aggregated, the more detailed the resulting summary table. At one extreme, the number of levels to aggregate over is 0, where the statistic of interest is calculated once for all individual data values, the most detailed, most granular result. For example, compute the mean salary for all employees in a company. Or, consider aggregation over groups defined by a single categorical variable, such as the average salary in each department, visualized according to the examples in Figure 1.

At the other extreme, define the groups by the combinations of levels over several categorical variables, with the limitation that each additional categorical variable defines additional groups, which lowers the amount of available data per group. Define one such group as Women working in the Finance Department, employed more than 10 years, and within 10 years of retirement age. What is their mean salary? The complete aggregation would consider all possible combinations of groups based on the different values of these categorical variables, resulting a high level of granularity.

Construct the Visualization

The chosen categorical variable provides the list of categories, or levels, from which to construct the chart. Then, pair each category with a numerical variable presumably of interest. The provided example is the number of people who work in each department.

What is the source of the values of the numerical variable associated with each category? There are three possibilities: Provide the table that pairs categories with numbers directly or use one of two different ways to have the bar chart or related function implicitly calculate the summary table from the original data.

Raw data

The original data values that were recorded for each observation, an instance of the unit of analysis, such as a person, before summarizing or transforming the data.

The raw or original data file of measurements is data at its most detailed, granular level. An example of raw data is the employee data table with variables such as Salary and Gender. The data values for each employee are the original data values from which the analysis begins. To begin the analysis, read these data values into an R data frame, usually named \(d\).

Provide the summary table directly to the bar chart function. The numbers could be anything, including a prior data aggregation result.
- Enter this table from some web page or printed quarterly report into a worksheet app such as Excel.
- Read the summary table from the worksheet app into the data analysis system.
- Enter the summary table into a bar chart function as its data to analyze.
Example: From a table of the amount of wine grapes produced in tons for different varietals (types of grapes) during a given year in Oregon, visualize the production number for each varietal with a bar chart.
Obtain the counts simply from analysis of the values of the categorical variable. The numerical variable is the count of how many times each category or level of a categorical variable appeared in the data, or could have appeared, where the bar chart or related function from the original data implicitly does the aggregation by calculating the summary table. A separate numerical variable is not specified in the function call that creates the visualization.
Example: From the class grade book, how many students in the class received an A for their course grade? An A-? … and so on. From the original data, the class grade book, the bar chart function first computes the summary table, which associates the count for each level with the corresponding level, the letter grade in this example, and the draws the bar chart.

Frequency distribution

A summary (pivot) table of the counts of the number of times each categorical value (level) occurred in the data for a variable.

Define the data aggregation by computing a statistic separately for each level of the categorical variable. The bar chart or related function implicitly aggregates by constructing the summary (pivot) table from the original data as a statistic computed for a specified continuous variable.
Example: The bar chart function computes the average salary for each department in a company, and then plots the bar chart from the resulting summary table.

Functions and Parameters

Accomplish each data analysis operation by calling a function. This discussion applies to all analysis and visualization functions in all data analysis systems. It is introduced here because bar charts and related visualizations are the first that we encounter.

Function

Procedure to accomplish a specific, repeatable task that provides the same output given the same input, here a specific statistical set of computations and instructions for plotting a visualization from the data.

Various conditions control the output of each analysis, text or visualization. For example, every data visualization system, such as Excel or R or Tableau, has at least one bar chart function. Access the bar chart function by, depending on the system, using written instructions or by clicking somewhere on the display. Further, characteristics such as the interior color of the bars or maximum line width can be customized. Customizing the visualization requires access to the function’s parameters.

Parameter

A user-controlled value of a function’s code, a placeholder, that specifies some characteristic of the way the data is processed or the output of the function is displayed.

Each function includes parameters to customize input or output. For any bar chart function from any analysis system, one parameter sets the color of the bars, and another sets the color of the bar edges.

Data analysis functions can have from a few to tens of parameters. To manually set all the parameter values to process data with a function would be much too tedious.

Default parameter value

A preset value of a parameter that can be explicitly changed when invoking the function.

Regardless of whether functions such as a bar chart function are invoked with a written instruction at a command line or clicking on a button in a GUI interface, there needs to be a way to customize the values of the relevant parameters. Override a default parameter value by explicitly setting its value in the function call. For example, the bar chart function in any visualization system provides a way to change the default interior bar colors to whatever the user specifies.

Business Applications

We have an example of the number of employees in each department, and the average salary in each department is mentioned as another possibility for a categorical data visualization. However, there are almost limitless examples of comparing numerical values across categories in business analysis. These examples involve both the variety of categorical variables and their corresponding numerical variables of interest. Apply categorical data visualization, such as bar charts, to better perceive the extent and differences of each category regarding the numerical variable of interest. Some examples follow.

Sales - Decision focus: Identify over- and under-performing areas for further investigation to understand reasons for success and reasons for improvement.
Plot: sales performance across different
- products
- regions
Marketing - Decision focus: Consider which products and geographical regions may need additional advertising or price adjustment.
Plot market share of a product:
- with competitors
- across regions
Marketing - Decision focus: Understand where customers are most and least satisfied to generate guidance on product or store design.
Plot customer satisfaction for
- different products
- in-store vs online shopping experience
Supply Chain - Decision focus: Evaluate each supplier on salient attributes.
Plot supplier performance across different suppliers for:
- quality of goods
- delivery time
Supply Chain - Decision focus: To employ just-in-time inventory management more effectively, identify over- and under-stocked units.
Plot amounts of inventory across:
- products
- retail outlets
Finance - Decision focus: Understand what products are over-performing and which are under-performing or perhaps even generating negative returns to consider revising the product mix or product marketing.
Plot revenue, expenses, and profit margins across:
- products
- regions
Human Relations - Decision focus: Decide who receives promotions and bonuses and who needs additional training.
Plot evaluations of employee performance across:
- departments
- genders
Operations Management - Decision focus: Identify bottlenecks and opportunities for process optimization.
Plot: process efficiency across
- products
- regions