2. Visualize: Histogram and Related

Data

Most of the following examples are an analysis of data in the Employee data set, included with lessR. First read the Employee data into the data frame d. See the Read and Write vignette for more details.

d <- Read("Employee")

## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
##  2    Gender character     37       0       2   M  M  W ... W  W  M
##  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   63788.26  104494.58 ... 66508.32  67562.36
##  5    JobSat character     35       2       3   med  low  high ... high  low  high
##  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
## ------------------------------------------------------------------------------------------

As an option, also read the table of variable labels. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need be entered into the table. The table can be a csv file or an Excel file.

Read the label file into the l data frame, currently the only permitted name. The labels will be displayed on both the text and visualization output. Each displayed label is the variable name juxtaposed with the corresponding label, as shown in the following output.

l <- rd("Employee_lbl")

## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     label character      8       0       8   Time of Company Employment ... Test score on legal issues after instruction
## ------------------------------------------------------------------------------------------

##                                                label
## Years                     Time of Company Employment
## Gender                                  Man or Woman
## Dept                             Department Employed
## Salary                           Annual Salary (USD)
## JobSat            Satisfaction with Work Environment
## Plan             1=GoodHealth, 2=GetWell, 3=BestCare
## Pre    Test score on legal issues before instruction
## Post    Test score on legal issues after instruction

Histogram

One of the most frequently encountered visualizations for continuous variables is the histogram, which outlines the general shape of the underlying distribution.

Histogram: Bin similar values into a group, then plot the frequency of occurrence of the data values in each bin proportional to the height of the corresponding bar.

A call to a function to create a histogram contains the name of the continuous variable that contains the plotted values. With the X() function, that variable name is the first parameter value passed to the function. In this example, the only parameter value passed to the function is the variable name. The data frame is named d, the default value. The following illustrates the call to X() with a continuous variable named $x$.

To illustrate, consider the continuous variable Salary in the Employee data table. Use X() to tabulate and display the number of employees in each department, here relying upon the default data frame (table) named d, so the data= parameter is not needed.

X(Salary)

## [Interactive plot from the Plotly R package (Sievert, 2020)]

Histogram of tablulated counts for the bins of Salary.

## >>> Suggestions 
## bin_width: set the width of each bin 
## bin_start: set the start of the first bin 
## bin_end: set the end of the last bin 
## X(Salary, type="density")  # smoothed curve + histogram 
## X(Salary, type="vbs")  # Violin/Box/Scatterplot (VBS) plot 
## 
## --- Salary --- 
##  
##      n   miss         mean           sd          min          mdn          max 
##      37      0    83795.557    21799.533    56124.970    79547.600   144419.230 
##  
## 
##   
## --- Outliers ---     from the box plot: 1 
##  
## Small      Large 
## -----      ----- 
##             144419.2 
## 
## 
## Bin Width: 10000 
## Number of Bins: 10 
##  
##              Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
## --------------------------------------------------------- 
##   50000 >  60000   55000      4    0.11        4     0.11 
##   60000 >  70000   65000      8    0.22       12     0.32 
##   70000 >  80000   75000      8    0.22       20     0.54 
##   80000 >  90000   85000      5    0.14       25     0.68 
##   90000 > 100000   95000      3    0.08       28     0.76 
##  100000 > 110000  105000      5    0.14       33     0.89 
##  110000 > 120000  115000      1    0.03       34     0.92 
##  120000 > 130000  125000      1    0.03       35     0.95 
##  130000 > 140000  135000      1    0.03       36     0.97 
##  140000 > 150000  145000      1    0.03       37     1.00 
##

By default, the X() function provides a color theme according to the current, active theme. The function also provides the corresponding frequency distribution, summary statistics, the table that lists the count of each category, from which the histogram is constructed, as well as an outlier analysis based on Tukey’s outlier detection rules for box plots.

Customize the Histogram

style(quiet=TRUE)

Use the parameters bin_start, bin_width, and bin_end to customize the histogram.

X(Salary, bin_start=35000, bin_width=14000)

Customized histogram.

Easy to change the color, either by changing the color theme with style(), or just change the fill color with fill. Can refer to standard R colors, as shown with lessR function showColors(), or implicitly invoke the lessR color palette generating function getColors(). Each 30 degrees of the color wheel is named, such as "greens", "rusts", etc, and implements a sequential color palette.

Use the color parameter to set the border color, here turned off.

X(Salary, fill="reds", color="transparent")

Customized histogram.

The default for formatting both axis labels is to round numeric values of thousands, such as 100000 to 100K. With parameter axis_fmt, this default of to {"K"} can be changed. Also can specify {","} to insert commas in large numbers with a decimal point or {"."} to insert periods, or {""} to turn off formatting. The value of {"K"} can also be combined with {","} or {"."} by forming a vector of values, such as c("K", ",").

Axis labels can also formatted by adding a prefix to a numeric value with the parameter axis_pre, such as $ or €. The value of axis_pre can be multiple characters, such as for the Brazilian currency, R$.

X(Salary, axis_fmt=",", axis_x_pre="$")

Formatted axis values.

Density Plot

The histogram portrays a continuous distribution with discrete bins, with more modern visualizations available that directly display the estimated underlying smooth curve.

Density plot: A smooth curve that estimates the underlying continuous distribution.

To create a density plot, specify the value of type as "density" parameter. The result is the filled density curve superimposed on the histogram.

X(Salary, type="density")

Histogram with density plot.

The kind parameter indicates the type of density curve. The default is "general". Options are "normal" for a normal density curve and "both" for both.

VBS Plot

A more modern version of the density plot combines the violin plot, box plot, and scatter plot into a single visualization, here called the VBS plot.

X(Salary, type="vbs")

## 
## --- Salary --- 
## Present: 37 
## Missing: 0 
## Total  : 37 
##  
## Mean         : 83795.557 
## Stnd Dev     : 21799.533 
## IQR          : 31012.560 
## Skew         : 0.190   [medcouple, -1 to 1] 
##  
## Minimum      : 56124.970 
## Lower Whisker: 56124.970 
## 1st Quartile : 66772.950 
## Median       : 79547.600 
## 3rd Quartile : 97785.510 
## Upper Whisker: 132563.380 
## Maximum      : 144419.230 
## 
##   
## --- Outliers ---     from the box plot: 1 
##  
## Small      Large 
## -----      ----- 
##             144419.23 
## 
## Number of duplicated values: 0 
## 
## 
## ---------- Parameter values (can be manually set)
##  
## size: 0.61      size of plotted points 
## out_size: 0.82  size of plotted outlier points 
## jitter_y: 0.45 random vertical movement of points 
## jitter_x: 0.00  random horizontal movement of points 
## bw: 9529.04       set bandwidth higher for smoother edges 
## 
## 
## ---------- Summary Statistics for Salary

VBS plot.

Interactive Input to Create a Histogram

An interactive visualization lets the user in real time change parameter values to change characteristics of the visualization. To create an interactive histogram of the variable Salary that displays the corresponding parameters, run the function interact() with "Histogram" specified.

interact("Histogram")

The interact() function is not run here because interactivity requires to run directly from the R console.