Visualize Aggregated Data
A categorical data visualization displays a visual aesthetic that reflects the magnitude of a numerical variable associated with each level of the variable. For a single categorical variable, these visualizations involve two variables: the categorical variable of interest, generically denoted \(x\), and the associated numerical variable, generically denoted \(y\). The most well-known examples if categorical data visualizations are bar charts and pie charts.
Create categorical data visualizations, such as the standard bar chart or pie chart,to answer questions such as:
- How many employees (y) work in each department (x)?
- How does the average salary (y) vary across departments (x)?
A data visualization that associates a value of a numerical variable with each corresponding level of the categorical variable.
What benefit do we derive from these types of visualizations?
Assess the extent of the numerical value associated with each level of a categorical variable and compare the numerical values across levels.
A data visualization transforms a data table into an image. What information do we need to construct a categorical data visualization? What are the characteristics of the input data table that we will transform?
Construct the bar chart or related visualization from the values of the two variables expressed as a table in which one column contains the name of each level of the categorical variable and the second column contains the number associated with the corresponding level.
The numerical values used to construct the visualization can be any numbers, but in data analysis, they are usually computed by aggregating values across categories.
From the original data table, compute a single numerical summary, such as a count or a mean, separately for each category or combination of categories.
The result is a summary table of aggregated data values. Examples include:
In Excel, aggregation is called pivoting, and the resulting summary table is called a pivot table. However, aggregation is the more general and widely used term in data analysis.
- counting the number of employees in each department
- computing the mean salary of employees in each department
It is that summary table of aggregated data from which we create the categorical data visualization.
For example, consider the number of employees, n, who work in each department of a small company. The values of the categorical variable Dept are the labels for the company’s departments. There are five departments, each with its own label: ACCT (accounting), ADMN (administration), FINC (finance), MKTG (marketing), and SALE (sales). Each unique value of Dept defines a group of employees.
The values of a categorical variable are also referred to as levels or labels.
Table 1 of the paired data values of the variables Dept and n displays the counts of the employees in each department. 15 people work in sales, by far the department with the most employees.
| Dept | n |
|---|---|
| ACCT | 5 |
| ADMN | 6 |
| FINC | 4 |
| MKTG | 6 |
| SALE | 15 |
Nor do the visualizations need to be restricted to simply counting the occurrence of each value of a categorical variable in the data. Another source of aggregation is to compute a statistic, such as the mean of a continuous variable.
| Dept | Salary_mean |
|---|---|
| ACCT | 71792.78 |
| ADMN | 91277.12 |
| FINC | 79010.68 |
| MKTG | 80257.13 |
| SALE | 88830.07 |
Aggregation with lessR:
pivot(d, mean, Salary, by=Dept)
with Salary and Dept in the \(d\) data frame.
Regardless of how or where the summary table is computed, any categorical data visualization is based on this type of aggregated data, here the tabulated counts of employees in each department. Either this summary (pivot) table is already computed and presented to the data visualization function, or the function implicitly computes the table from the original data.
The Family of Visualizations
How can we visually relate the different levels and numbers in this straightforward and compact table? One commonly used visual aesthetic is length. The bar chart is the most frequently encountered categorical data visualization, illustrated in Figure 1 (a).
Plot a bar for each level of a categorical variable, with the height of each bar proportional to the numeric value associated with that level.
Closely related to the bar chart is the dot plot, sometimes called a lollipop plot. Replace the bars with line segments, each connected to a dot that marks the corresponding numerical value.
Plot a line segment with a dot at the end for each level of a categorical variable, with the length of each segment proportional to the numeric value associated with that level.
Figure 1 presents two examples of charts based on the visual aesthetic of length: the bar chart and dot plot.
Another visual aesthetic for categorical data visualizations is area. The most common comparison based on area of the differing magnitudes of the numbers associated with the categories is the ring chart version of a pie chart.
Relate each level of a categorical variable to the area of a circle (pie) scaled according to the associated numerical value.
Other types of area based categorical data visualization include the radar chart, the bubble chart, the icicle chart, and the treemap. Figure 2 illustrates five different charts from the family of categorical data visualizations based on the visual aesthetic area.
Set the chart size by resizing the window that displays it. In the treemap, the inner rectangles reorient with the display window size. For example, as shown in Figure 2 (e) and (f), reducing the display window height can flatten the treemap into a single row of rectangles.
The visual aesthetic of length on which the bar chart and the dot plot depend provides more exact comparisons among levels than the visual aesthetic of area. Perceiving area results in a more qualitative comparison, which may be perfectly suitable or even preferred for a given application. However, rely upon length for a more exact comparison.
Of course, the goal of a visualization is to summarize meaningful results. However, note that the levels can be for any categorical variable, and the associated number can be anything, meaningful or not. You are free to open a worksheet app, such as Excel, and create a similar table of total randomness with nonsense categories and associated made-up numbers. You can submit that table to a bar chart function or any related function and create the resulting visualization. All you need is a similar table of paired levels and numbers with access to a corresponding visualization function.
Functions and Parameters
Accomplish each data analysis procedure by calling a function. This discussion applies to all analysis and visualization functions in all data analysis systems, GUI or command-based. It is introduced here because bar charts and related visualizations are the first that we encounter.
Procedure to accomplish a specific, repeatable task that provides the same output given the same input, here a specific statistical set of computations and instructions, such as for plotting a visualization from the data.
Various conditions control the output of each analysis, text, or visualization. For example, every data visualization system, such as Excel, R, or Tableau, has at least one bar chart function. Access the bar chart function, depending on the system, either by following written instructions or by clicking on a spot on the display. Further, characteristics such as the interior color of the bars or maximum line width can be customized. Customizing the visualization requires access to the function’s parameters.
A user-controlled value of a function’s code, a placeholder, that specifies some characteristic of the way the data is processed or the output of the function is displayed.
Each function includes parameters to customize input or output. For any bar chart function from any analysis system, one parameter sets the color of the bars, and another sets the color of the bar edges.
Data analysis functions can have from a few to tens of parameters. Manually setting all parameter values to process data with a function would be much too tedious.
A preset value of a parameter that can be explicitly changed when invoking the function.
Regardless of whether functions such as a bar chart function are invoked via a command-line argument or by clicking a GUI button, there needs to be a way to customize the values of the relevant parameters. Override a default parameter value by explicitly setting its value in the function call. For example, the bar chart function in any visualization system allows the user to change the default bar colors to whatever the user specifies.
Business Applications
We have an example of the number of employees in each department, and the average salary in each department is mentioned as another possibility for a categorical data visualization. However, there are almost limitless examples of comparing numerical values across categories in business analysis. These examples involve both the variety of categorical variables and their corresponding numerical variables of interest. Apply categorical data visualization, such as bar charts, to visualize the extent and differences within each category for the numerical variable of interest. Some examples follow.
- Sales - Decision focus: Identify over- and under-performing areas for further investigation to understand reasons for success and reasons for improvement.
Plot: sales performance across different- products
- regions
- Marketing - Decision focus: Consider which products and geographical regions may need additional advertising or price adjustment.
Plot market share of a product:- with competitors
- across regions
- Marketing - Decision focus: Understand where customers are most and least satisfied to generate guidance on product or store design.
Plot customer satisfaction for- different products
- in-store vs online shopping experience
- Supply Chain - Decision focus: Evaluate each supplier on salient attributes.
Plot supplier performance across different suppliers for:- quality of goods
- delivery time
- Supply Chain - Decision focus: To employ just-in-time inventory management more effectively, identify over- and under-stocked units.
Plot amounts of inventory across:- products
- retail outlets
- Finance - Decision focus: Understand what products are over-performing and which are under-performing or perhaps even generating negative returns to consider revising the product mix or product marketing.
Plot revenue, expenses, and profit margins across:- products
- regions
- Human Relations - Decision focus: Decide who receives promotions and bonuses and who needs additional training.
Plot evaluations of employee performance across:- departments
- genders
- Operations Management - Decision focus: Identify bottlenecks and opportunities for process optimization.
Plot: process efficiency across- products
- regions