Visualize Aggregated Data
A categorical data visualization displays a visual aesthetic that reflects the magnitude of a numerical variable associated with each level of the variable. For a single categorical variable, these visualizations involve two variables: the categorical variable of interest, generically denoted \(x\), and the associated numerical variable, generically denoted \(y\). The most well-known examples if categorical data visualizations are bar charts and pie charts.
Create categorical data visualizations, such as the standard bar chart or pie chart,to answer questions such as:
- How many employees (y) work in each department (x)?
- How does the average salary (y) vary across departments (x)?
A data visualization that associates a value of a numerical variable with each corresponding level of the categorical variable.
What benefit do we derive from these types of visualizations?
Assess the extent of the numerical value associated with each level of a categorical variable and compare the numerical values across levels.
A data visualization transforms a data table into an image. What information do we need to construct a categorical data visualization? What are the characteristics of the input data table that we will transform?
Construct the bar chart or related visualization from the values of the two variables expressed as a table in which one column contains the name of each level of the categorical variable and the second column contains the number associated with the corresponding level.
The numerical values used to construct the visualization can be any numbers, but in data analysis, they are usually computed by aggregating values across categories.
From the original data table, compute a single numerical summary, such as a count or a mean, separately for each category or combination of categories.
The result is a summary table of aggregated data values. Examples include:
In Excel, aggregation is called pivoting, and the resulting summary table is called a pivot table. However, aggregation is the more general and widely used term in data analysis.
- counting the number of employees in each department
- computing the mean salary of employees in each department
It is that summary table of aggregated data from which we create the categorical data visualization.
For example, consider the number of employees, n, who work in each department of a small company. The values of the categorical variable Dept are the labels for the company’s departments. There are five departments, each with its own label: ACCT (accounting), ADMN (administration), FINC (finance), MKTG (marketing), and SALE (sales). Each unique value of Dept defines a group of employees.
The values of a categorical variable are also referred to as levels or labels.
Table 1 of the paired data values of the variables Dept and n displays the counts of the employees in each department. 15 people work in sales, by far the department with the most employees.
| Dept | n |
|---|---|
| ACCT | 5 |
| ADMN | 6 |
| FINC | 4 |
| MKTG | 6 |
| SALE | 15 |
Nor do the visualizations need to be restricted to simply counting the occurrence of each value of a categorical variable in the data. Another source of aggregation is to compute a statistic, such as the mean of a continuous variable.
| Dept | Salary_mean |
|---|---|
| ACCT | 71792.78 |
| ADMN | 91277.12 |
| FINC | 79010.68 |
| MKTG | 80257.13 |
| SALE | 88830.07 |
Aggregation with lessR:
pivot(d, mean, Salary, by=Dept)
with Salary and Dept in the \(d\) data frame.
Regardless of how or where the summary table is computed, any categorical data visualization is based on this type of aggregated data, here the tabulated counts of employees in each department. Either this summary (pivot) table is already computed and presented to the data visualization function, or the function implicitly computes the table from the original data.
The Family of Visualizations
How can we visually relate the different levels and numbers in this straightforward and compact table? One commonly used visual aesthetic is length. The bar chart is the most frequently encountered categorical data visualization, illustrated in Figure 1 (a).
Plot a bar for each level of a categorical variable, with the height of each bar proportional to the numeric value associated with that level.
Closely related to the bar chart is the dot plot, sometimes called a lollipop plot. Replace the bars with line segments, each connected to a dot that marks the corresponding numerical value.
Plot a line segment with a dot at the end for each level of a categorical variable, with the length of each segment proportional to the numeric value associated with that level.
Figure 1 presents two examples of charts based on the visual aesthetic of length: the bar chart and dot plot.
Another visual aesthetic for categorical data visualizations is area. The most common comparison based on area of the differing magnitudes of the numbers associated with the categories is the ring chart version of a pie chart.
Relate each level of a categorical variable to the area of a circle (pie) scaled according to the associated numerical value.
Other types of area based categorical data visualization include the radar chart, the bubble chart, the icicle chart, and the treemap. Figure 2 illustrates five different charts from the family of categorical data visualizations based on the visual aesthetic area.
Set the chart size by resizing the window that displays it. In the treemap, the inner rectangles reorient with the display window size. For example, as shown in Figure 2 (e) and (f), reducing the display window height can flatten the treemap into a single row of rectangles.
The visual aesthetic of length on which the bar chart and the dot plot depend provides more exact comparisons among levels than the visual aesthetic of area. Perceiving area results in a more qualitative comparison, which may be perfectly suitable or even preferred for a given application. However, rely upon length for a more exact comparison.
Chart of Counts
For a categorical variable Dept, create the bar chart of counts in Figure 1 (a) with the function call:
⇒ Bar chart of the number of employees in each department: Chart(Dept)
Note that the default
Chart() can create seven different types of charts. Specify the type of chart with the type parameter.
- Bar chart with
type="bar", the default value - Dot chart with
type="dot" - Pie chart with
type="pie" - Radar chart with
type="radar" - Bubble chart with
type="bubble" - Icicle chart with
type="icicle" - Treemap chart with
type="treemap"
For example, create the pie chart in Figure 2 with:
⇒ Pie chart of the number of employees in each department: Chart(Dept, type="pie")
Chart of Aggregating a Numerical Variable
If you wish to aggregate over a statistic created from an another numerical variable, add the y= parameter value to specify the numerical variable and the stat= parameter to specify the statistic by which to aggregate. For example, do the following to plot a radar chart of the mean salary for each Dept.
⇒ Radar chart of the mean salary of employees in each department:
Chart(Dept, y=Salary, stat="mean", type="radar")
Available statistics for aggregation follow.
- Mean with
stat="mean" - Mean deviation with
stat="deviation" - Median with
stat="median" - Minimum with
stat="min" - Maximum with
stat="max" - Standard deviation with
stat="sd"
Chart from Data Already Aggregated
Another possibility creates the chart from data already aggregated, either counts or a statistic such as the mean of a continuous variable. That is, the aggregation is already complete, so specify the summary (pivot) table as the input data frame. Maybe you have a table from a web site or a company financial report that you wish to visualize.
Chart() then analyzes the summary table directly rather than the original data. In that situation, specify the numeric variable, y= in addition to the categorical variable, but without a stat reference because there is no further transformation to accomplish. Simply analyze the two given columns of data, \(x\) and \(y\) generically (and the formal parameter names).
For example, create the treemap of mean salary across departments directly from the summary data in Table 2. ⇒ Treemap of the mean salary of employees in each department in which the summary table is in data frame \(a\):
Chart(Dept, y=Salary_mean, data=a, type="treemap")
For a treemap generated from the original data, the required aggregation is performed implicitly. The size of each rectangle then reflects the number of observations in the corresponding category. This is intentional, because it allows the displayed mean or other summary statistic to be interpreted relative to the sample size used to compute it.
In contrast, when a summary table is provided directly to the visualization function, the size of each treemap rectangle reflects the magnitude of the supplied numerical values. With only the summarized \(x\) and \(y\) columns available, the original sample sizes cannot be recovered.
Charts are Interactive
These charts are interactive (based on Plotly graphics), appearing in the RStudio Viewer window. Hover the mouse over a part of the chart and get more detailed information about that particular section. (The bar chart also renders as a static chart in the RStudio Plots window.)
These interactive charts, from the plotly system, are actually mini-webpages, called htmlwidgets. Moreover, you can save each such creation as a full, interactive webpage from the Export tab in the RStudio Viewer window where the interactive charts are displayed.
Unfortunately, a bug in the RStudio-Plotly interface does not allow an interactive plot from the RStudio options of directly saving the chart as a .png image or as an option to copy the image to the clipboard, options provided by the Export tab. Fortunately, there are two alternatives to saving a non-interactive rendering of the chart.
- A screenshot of the
Viewerwindow - Plotly provides a direct to a
.pngimage save by moving the mouse to the top right corner of the chart window
Chart() Manual
As with any function in the R ecosystem, to view the official manual, enter q ? and the function name in response to the command prompt >.
?Chart
All functions in R must include a manual that explains all available options and specifies the values of the available parameters.
Of course, the goal of a visualization is to summarize meaningful results. However, note that the levels can be for any categorical variable, and the associated number can be anything, meaningful or not. You are free to open a worksheet app, such as Excel, and create a similar table of total randomness with nonsense categories and associated made-up numbers. You can submit that table to a bar chart function or any related function and create the resulting visualization. All you need is a similar table of paired levels and numbers with access to a corresponding visualization function.
Functions and Parameters
Accomplish each data analysis procedure by calling a function. This discussion applies to all analysis and visualization functions in all data analysis systems, GUI or command-based. It is introduced here because bar charts and related visualizations are the first that we encounter.
Procedure to accomplish a specific, repeatable task that provides the same output given the same input, here a specific statistical set of computations and instructions, such as for plotting a visualization from the data.
Various conditions control the output of each analysis, text, or visualization. For example, every data visualization system, such as Excel, R, or Tableau, has at least one bar chart function. Access the bar chart function, depending on the system, either by following written instructions or by clicking on a spot on the display. Further, characteristics such as the interior color of the bars or maximum line width can be customized. Customizing the visualization requires access to the function’s parameters.
A user-controlled value of a function’s code, a placeholder, that specifies some characteristic of the way the data is processed or the output of the function is displayed.
Each function includes parameters to customize input or output. For any bar chart function from any analysis system, one parameter sets the color of the bars, and another sets the color of the bar edges.
Data analysis functions can have from a few to tens of parameters. Manually setting all parameter values to process data with a function would be much too tedious.
A preset value of a parameter that can be explicitly changed when invoking the function.
The parameter fill, which applies to all the lessR visualization functions, sets the color that fills the bars. By default, Chart() displays each bar in a different color, but the bars can also be set to the same color, or any combination of user-specified colors. To change the color of all the bars to a blue shade, provide an argument to the fill parameter, such as "steelblue", shown in Figure 3. Set the parameter quiet to TRUE to suppress the statistical output.
Chart(Dept, fill="steelblue", quiet=TRUE)View all R color names and colors with the lessR function showColors().
Regardless of whether functions such as a bar chart function are invoked via a command-line argument or by clicking a GUI button, there needs to be a way to customize the values of the relevant parameters. Override a default parameter value by explicitly setting its value in the function call. For example, the bar chart function in any visualization system allows the user to change the default bar colors to whatever the user specifies.
Business Applications
We have an example of the number of employees in each department, and the average salary in each department is mentioned as another possibility for a categorical data visualization. However, there are almost limitless examples of comparing numerical values across categories in business analysis. These examples involve both the variety of categorical variables and their corresponding numerical variables of interest. Apply categorical data visualization, such as bar charts, to visualize the extent and differences within each category for the numerical variable of interest. Some examples follow.
- Sales - Decision focus: Identify over- and under-performing areas for further investigation to understand reasons for success and reasons for improvement.
Plot: sales performance across different- products
- regions
- Marketing - Decision focus: Consider which products and geographical regions may need additional advertising or price adjustment.
Plot market share of a product:- with competitors
- across regions
- Marketing - Decision focus: Understand where customers are most and least satisfied to generate guidance on product or store design.
Plot customer satisfaction for- different products
- in-store vs online shopping experience
- Supply Chain - Decision focus: Evaluate each supplier on salient attributes.
Plot supplier performance across different suppliers for:- quality of goods
- delivery time
- Supply Chain - Decision focus: To employ just-in-time inventory management more effectively, identify over- and under-stocked units.
Plot amounts of inventory across:- products
- retail outlets
- Finance - Decision focus: Understand what products are over-performing and which are under-performing or perhaps even generating negative returns to consider revising the product mix or product marketing.
Plot revenue, expenses, and profit margins across:- products
- regions
- Human Relations - Decision focus: Decide who receives promotions and bonuses and who needs additional training.
Plot evaluations of employee performance across:- departments
- genders
- Operations Management - Decision focus: Identify bottlenecks and opportunities for process optimization.
Plot: process efficiency across- products
- regions