R/lessR: Visualize Categorical Data

This document presents the R implementation of the more general, conceptual discussion of categorical data visualizations.

The lessR function Chart() creates a variety of charts for a categorical variable, generically referred to as x. If variable x is in the data frame of original data named d, then just provide the name of the variable to create a bar chart of the count of the number of values for each category (level).

Chart Types

Chart() can create seven different types of charts. Specify the type of chart with the type parameter.

Bar chart with type="bar", the default value
Dot chart with type="dot"
Pie chart with type="pie"
Radar chart with type="radar"
Bubble chart with type="bubble"
Icicle chart with type="icicle"
Treemap chart with type="treemap"

When creating a bar chart, to be explicit, include type = "bar" in the call to Chart(). However, there is no need to include the type parameter for a bar chart as it is the default value. For any other type of chart, specify with type=.

From lessR version 4.5.4.

Aggregate for Counts

Chart() will implicitly create the summary table that pairs level of the categorical variable with a number, such as the count of the number of values for each level in the data table. Here, the variable is Dept, which identifies the department in which each employee represented in the data table works. How many employees work in each department?

Chart(Dept)

The result is a frequency distribution and a bar chart of that distribution, the frequency (count) at which each category (level) occurred in the data table. If the data frame is named something other than d, then include the data argument, such as data=mydata where mydata is the name of the data frame that contains the variable of interest.

Or, create a pie chart of the number of employees in each department:

Chart(Dept, type="pie")

Aggregate a Continuous Variable

To create a bar chart based on a statistic, such as the mean of some other continuous (numerical) variable, specify the name of that variable with y= and specify the statistic by which to aggregate with stat=. The following example creates the bar chart for the mean salary for each department in the company.

Chart(Dept, y=Salary, stat="mean")

Available statistics for aggregation follow.

Mean with stat="mean"
Mean deviation with stat="deviation"
Median with stat="median"
Minimum with stat="min"
Maximum with stat="max"
Standard deviation with stat="sd"

Or, visualize the median number of employees in each department with a pie chart.

Chart(Dept, y=Salary, stat="median", type="pie")

Pre-aggregated Data

Another possibility creates the chart from data already aggregated, either counts or a statistic such as the mean of a continuous variable. That is, the aggregation is already complete, so specify the summary (pivot) table as the input data frame. Maybe you have a table from a web site or a company financial report that you wish to visualize.

Chart() then analyzes the summary table directly rather than the original data. In that situation, specify the numeric variable, y= in addition to the categorical variable. However, do not include a stat reference because there is no further transformation to accomplish because the data are already aggregated. Simply analyze the two given columns of data, \(x\) and \(y\) generically (the parameter names).

For example, create the treemap of mean salary across departments directly from the summary data in Table 1.

Table 1: Visualize categorical data from a summary (pivot) table of means that pairs each category (level) with its corresponding numerical value.

Dept	Salary_mean
ACCT	71792.78
ADMN	91277.12
FINC	79010.68
MKTG	80257.13
SALE	88830.07

Suppose you create this table in Excel. Then read the Excel table into R. One approach is to read the summary table of aggregated data into the data frame named a, a mnemonic for aggregation. Here, create the treemap of the mean salary of employees in each department in which the summary table is in data frame \(a\):

Chart(Dept, y=Salary_mean, data=a, type="treemap")

Of course, you can name the data frame of the aggregated data any valid name.

Charts are Interactive

These charts are interactive, appearing in the RStudio Viewer window. Hover the mouse over a part of the chart and get more detailed information about that particular section. (The bar chart renders both as an interactive chart in the RStudio Viewer window and as a standard, static chart in the RStudio Plots window.)

`lessR` charts interactively display additional information by hovering the mouse over a section.

These interactive charts, from the lessR interface to the plotly system, are actually mini-webpages, called htmlwidgets.

Export the Visualization

Static visualizations

RStudio provides a simple and effective process for exporting visualizations, which appear in the lower right window pane under the Plots tab. For static charts, from the RStudio Plots window, after creating a visualization, click the Export button, then save it as a .png image, a .pdf file, or copy it directly to the clipboard for pasting into any relevant application, such as MS Word.

Figure 1 illustrates these choices.

Figure 1: Export a *static* visualization to a .png file, .pdf file, or to the clipboard.

Interactive visualizations

Interactive Plotly visualizations are displayed in the RStudio Viewer window as htmlwidgets, essentially a kind of mini web page, which provides the framework for their interactivity. However, these visualizations do not behave like standard static R plots. In particular, the Viewer pane Export options are not a reliable general mechanism for saving these charts as displayed. Instead, use one of the following practical options.

Take a screenshot of the chart in the Viewer window, then paste that image directly into a document such as Microsoft Word.
Use the Plotly chart’s own toolbar in the upper-right corner of the chart window to download a .png image when that option is available as shown below.

Plotly button to save the visualization to a .png image.

Moreover, you can save each such creation as a full, interactive webpage from the Export tab in the RStudio Viewer window where the interactive charts are displayed. This option is not useful for exporting an interactive visualization into Word, but does present many possibilities for users who know how to work with and display webpages.

Statistical Output

Unlike most visualization systems, lessR also displays statistics related to the visualization. Chart() displays the summary table and related statistics. The following is the resulting text output for counting the number of employees in each department.

[Interactive chart from the Plotly R package (Sievert, 2020)] 

--- Dept --- 

Missing Values: 0 

                ACCT   ADMN   FINC   MKTG   SALE    Total 
Frequencies:       5      6      4      6     15       36 
Proportions:   0.139  0.167  0.111  0.167  0.417    1.000 

Chi-squared test of null hypothesis of equal probabilities 
  Chisq = 10.944, df = 4, p-value = 0.027

And, the output for the mean salary for each department.

[Interactive chart from the Plotly R package (Sievert, 2020)] 

Salary 
  - by levels of - 
Dept 
 
       n   miss       mean         sd        min        mdn        max 
ACCT    5      0   71792.78   12774.61   56124.97   79547.60   82502.50 
ADMN    6      0   91277.12   27585.15   63788.26   81058.60  132563.38 
FINC    4      0   79010.68   17852.50   67139.90   71937.62  105027.55 
MKTG    6      0   80257.13   19869.81   61036.85   71658.99  109062.66 
SALE   15      0   88830.06   23476.84   59188.96   87714.85  144419.23 
 
 Plotted Values 
 -------------- 
      ACCT      ADMN      FINC      MKTG      SALE 
  71792.78  91277.12  79010.68  80257.13  88830.06

One straightforward function call generates a variety of charts and corresponding statistical output.

Parameter Values

Example

A data analysis function in any data analysis system is defined with a set of parameters to customize the visualization. For example, the lessR function Chart() and other data visualization functions have the parameter fill that sets the color that fills the bars. By default, Chart() displays each bar in a different color, but the bars can also be set to the same color or other colors depending on the colors passed to the fill parameter. To change the color of all the bars to a blue shade, as in Figure 2, set the fill parameter to "steelblue", one of many R defined color names.¹

¹ Create a pdf file that shows all R color names and colors with the lessR function showColors().

Feel free to explore the Chart() manual if you wish by entering ?Chart into R. All functions in R must include a manual that explains all available options and specifies the values of the available parameters.

The full function call to obtain the blue bars appears below. Again, set the parameter quiet to TRUE to suppress the statistical output.

Chart(Dept, fill="steelblue")

Figure 2: Bar chart with bars filled a shade of blue.

As is true of Excel and other analysis systems, such as R, the general format for setting a parameter value within the function call follows in Figure 3. The three dots, ..., in the figure indicate other stuff that is part of the function call, such as a variable name.

Figure 3: A parameter and its value in a function call.

In the Chart() example above, fill names the parameter. The value of "steelblue" is the specific value set for that parameter. Explicitly setting that parameter value overrides the default value of fill for Chart(), which provides a different color for each bar.

Function Input

All input into a function is input into the function’s parameters. Each parameter has a name. For example, the parameter name of the first categorical variable entered into a Chart() function call is x. The name of the parameter that specifies the data frame is data. Parameter values can be numeric or a character string, such as a word or a letter. As is true of all computer analysis systems, such as Excel and R, if a parameter value is a character string, enclose its value in quotes, double or single quotes. For example, "steelblue". Specify numbers without quotes.

Order of listed parameters

In the definition of a function, if the parameter values are entered into the same order as they are presented in the definition, then the parameter values do not need to be named.

For example, because the Chart() first parameter is \(x\) in the definition of the function, then if the value for that parameter is listed first in the function call, that value does not have to be named with a x= preceding the value.

Chart(Dept) is the same function call as Chart(x=Dept)

Parameters control many aspects of how a function processes data, far more than just color. You can rely on the default parameter values, or add additional pairs of parameter names and values, as there are parameters to specify.

Ease of identifying available parameters

GUI systems, such as Tableau, may be considered easier to learn and use in part because the available parameters are listed on the computer display and accessed by clicking on their respective buttons. This accessibility leads to the misperception that when using a system with written instructions, there is much information to memorize, requiring the dedicated neuroticism of a super-geek. In reality, it is easy to access the list of available parameters using R functions, and the meaning of the parameters is also explained.

Summary

To use R for data analysis requires at least three separate R functions. Run R either on your computer or in the cloud.

Retrieve the lessR functions from your R library library(lessR)
Read the data from a file into R:
d <- Read("") to browse for the data file
or,
d <- Read("path name" or "web address") to specify the location of the data file
Analyze the data values for generically named categorical variable x and possibly numerical variable y. For a bar chart, enter one of the following three function calls, depending on the type of data and the purpose of the analysis. For any other chart, add an argument for the type parameter, such as "pie".
- Chart(x), calculate the counts of the levels of categorical variable x from the original data
- Chart(x, y, stat="mean") calculate the mean of numeric variable y for each level of categorical variable x from the original data, other statistics are available
- Chart(x, y, data=a) create the chart from a pre-calculated summary table of usually aggregated data y over categorical variable x, here stored in data frame \(a\)