Most of the following examples are an analysis of data in the
Employee data set, included with lessR. First
read the Employee data into the data frame d. See the
Read and Write
vignette for more details.
##
## >>> Suggestions
## Recommended binary format for data files: feather
## Create with Write(d, "your_file", format="feather")
## More details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 7 ... 1 2 10
## 2 Gender character 37 0 2 M M W ... W W M
## 3 Dept character 36 1 5 ADMN SALE FINC ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low high ... high low high
## 6 Plan integer 37 0 3 1 1 2 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 90 ... 83 59 80
## 8 Post integer 37 0 22 92 74 86 ... 90 71 87
## ------------------------------------------------------------------------------------------
As an option, also read the table of variable labels. Create the
table formatted as two columns. The first column is the variable name
and the second column is the corresponding variable label. Not all
variables need be entered into the table. The table can be a
csv
file or an Excel file.
Read the label file into the l data frame, currently the only permitted name. The labels will be displayed on both the text and visualization output. Each displayed label is the variable name juxtaposed with the corresponding label, as shown in the following output.
##
## >>> Suggestions
## Recommended binary format for data files: feather
## Create with Write(d, "your_file", format="feather")
## More details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 label character 8 0 8 Time of Company Employment ... Test score on legal issues after instruction
## ------------------------------------------------------------------------------------------
## label
## Years Time of Company Employment
## Gender Man or Woman
## Dept Department Employed
## Salary Annual Salary (USD)
## JobSat Satisfaction with Work Environment
## Plan 1=GoodHealth, 2=GetWell, 3=BestCare
## Pre Test score on legal issues before instruction
## Post Test score on legal issues after instruction
One of the most frequently encountered visualizations for continuous variables is the histogram, which outlines the general shape of the underlying distribution.
Histogram: Bin similar values into a group, then plot the frequency of occurrence of the data values in each bin proportional to the height of the corresponding bar.
A call to a function to create a histogram contains the name of the
continuous variable that contains the plotted values. With the
Histogram()
function, that variable name is the first
parameter value passed to the function. In this example, the
only parameter value passed to the function is the variable
name. The data frame is named d, the default value. The
following illustrates the call to Histogram()
with a
continuous variable named \(x\).
To illustrate, consider the continuous variable Salary in
the Employee data table. Use Histogram()
to tabulate and
display the number of employees in each department, here relying upon
the default data frame (table) named d, so the
data=
parameter is not needed.
## >>> Suggestions
## bin_width: set the width of each bin
## bin_start: set the start of the first bin
## bin_end: set the end of the last bin
## Histogram(Salary, density=TRUE) # smoothed curve + histogram
## Plot(Salary) # Violin/Box/Scatterplot (VBS) plot
##
## --- Salary ---
##
## n miss mean sd min mdn max
## 37 0 73795.557 21799.533 46124.970 69547.600 134419.230
##
##
## --- Outliers --- from the box plot: 1
##
## Small Large
## ----- -----
## 134419.2
##
##
## Bin Width: 10000
## Number of Bins: 10
##
## Bin Midpnt Count Prop Cumul.c Cumul.p
## ---------------------------------------------------------
## 40000 > 50000 45000 4 0.11 4 0.11
## 50000 > 60000 55000 8 0.22 12 0.32
## 60000 > 70000 65000 8 0.22 20 0.54
## 70000 > 80000 75000 5 0.14 25 0.68
## 80000 > 90000 85000 3 0.08 28 0.76
## 90000 > 100000 95000 5 0.14 33 0.89
## 100000 > 110000 105000 1 0.03 34 0.92
## 110000 > 120000 115000 1 0.03 35 0.95
## 120000 > 130000 125000 1 0.03 36 0.97
## 130000 > 140000 135000 1 0.03 37 1.00
By default, the Histogram()
function provides a color
theme according to the current, active theme. The function also provides
the corresponding frequency distribution, summary statistics, the table
that lists the count of each category, from which the histogram is
constructed, as well as an outlier analysis based on Tukey’s outlier
detection rules for box plots.
Use the parameters bin_start
, bin_width
,
and bin_end
to customize the histogram.
## >>> Suggestions
## bin_end: set the end of the last bin
## Histogram(Salary, density=TRUE) # smoothed curve + histogram
## Plot(Salary) # Violin/Box/Scatterplot (VBS) plot
##
## --- Salary ---
##
## n miss mean sd min mdn max
## 37 0 73795.557 21799.533 46124.970 69547.600 134419.230
##
##
## --- Outliers --- from the box plot: 1
##
## Small Large
## ----- -----
## 134419.2
##
##
## Bin Width: 14000
## Number of Bins: 8
##
## Bin Midpnt Count Prop Cumul.c Cumul.p
## ---------------------------------------------------------
## 35000 > 49000 42000 1 0.03 1 0.03
## 49000 > 63000 56000 14 0.38 15 0.41
## 63000 > 77000 70000 9 0.24 24 0.65
## 77000 > 91000 84000 4 0.11 28 0.76
## 91000 > 105000 98000 5 0.14 33 0.89
## 105000 > 119000 112000 2 0.05 35 0.95
## 119000 > 133000 126000 1 0.03 36 0.97
## 133000 > 147000 140000 1 0.03 37 1.00
Easy to change the color, either by changing the color theme with
style()
, or just change the fill color with
fill
. Can refer to standard R colors, as shown with
lessR function showColors()
, or implicitly
invoke the lessR color palette generating function
getColors()
. Each 30 degrees of the color wheel is named,
such as "greens"
, "rusts"
, etc, and implements
a sequential color palette.
Use the color
parameter to set the border color, here
turned off.
## >>> Suggestions
## bin_width: set the width of each bin
## bin_start: set the start of the first bin
## bin_end: set the end of the last bin
## Histogram(Salary, density=TRUE) # smoothed curve + histogram
## Plot(Salary) # Violin/Box/Scatterplot (VBS) plot
##
## --- Salary ---
##
## n miss mean sd min mdn max
## 37 0 73795.557 21799.533 46124.970 69547.600 134419.230
##
##
## --- Outliers --- from the box plot: 1
##
## Small Large
## ----- -----
## 134419.2
##
##
## Bin Width: 10000
## Number of Bins: 10
##
## Bin Midpnt Count Prop Cumul.c Cumul.p
## ---------------------------------------------------------
## 40000 > 50000 45000 4 0.11 4 0.11
## 50000 > 60000 55000 8 0.22 12 0.32
## 60000 > 70000 65000 8 0.22 20 0.54
## 70000 > 80000 75000 5 0.14 25 0.68
## 80000 > 90000 85000 3 0.08 28 0.76
## 90000 > 100000 95000 5 0.14 33 0.89
## 100000 > 110000 105000 1 0.03 34 0.92
## 110000 > 120000 115000 1 0.03 35 0.95
## 120000 > 130000 125000 1 0.03 36 0.97
## 130000 > 140000 135000 1 0.03 37 1.00
The histogram portrays a continuous distribution with discrete bins, with more modern visualizations available that directly display the estimated underlying smooth curve.
Density plot: A smooth curve that estimates the underlying continuous distribution.
To create a density plot, add the density
parameter. The
result is the filled density curve superimposed on the histogram.
## >>> Suggestions
## bin_width: set the width of each bin
## Histogram(Salary) # histogram only
## Plot(Salary) # Violin/Box/Scatterplot (VBS) plot
##
## --- Bandwidth --- for general curve: 9529.0447
##
##
## --- Salary ---
##
## n miss mean sd min mdn max
## 37 0 73795.557 21799.533 46124.970 69547.600 134419.230
##
##
## --- Outliers --- from the box plot: 1
##
## Small Large
## ----- -----
## 134419.2
The type
parameter indicates the type of density curve.
The default is "general"
. Options are "normal"
for a normal density curve and "both"
for both.
A more modern version of the density plot combines the violin plot, box plot, and scatter plot into a single visualization, here called the VBS plot.
## [Violin/Box/Scatterplot graphics from Deepayan Sarkar's lattice package]
##
## >>> Suggestions
## Plot(Salary, out_cut=2, fences=TRUE, vbs_mean=TRUE) # Label two outliers ...
## Plot(Salary, box_adj=TRUE) # Adjust boxplot whiskers for asymmetry
## --- Salary ---
## Present: 37
## Missing: 0
## Total : 37
##
## Mean : 73795.557
## Stnd Dev : 21799.533
## IQR : 31012.560
## Skew : 0.190 [medcouple, -1 to 1]
##
## Minimum : 46124.970
## Lower Whisker: 46124.970
## 1st Quartile : 56772.950
## Median : 69547.600
## 3rd Quartile : 87785.510
## Upper Whisker: 122563.380
## Maximum : 134419.230
##
##
## --- Outliers --- from the box plot: 1
##
## Small Large
## ----- -----
## 134419.23
##
## Number of duplicated values: 0
##
## Parameter values (can be manually set)
## -------------------------------------------------------
## size: 0.61 size of plotted points
## out_size: 0.82 size of plotted outlier points
## jitter_y: 0.45 random vertical movement of points
## jitter_x: 0.00 random horizontal movement of points
## bw: 9529.04 set bandwidth higher for smoother edges
An interactive visualization lets the user in real time change
parameter values to change characteristics of the visualization. To
create an interactive histogram of the variable Salary that
displays the corresponding parameters, run the function
interact()
with "Histogram"
specified.
interact("Histogram")
The interact()
function is not run here because
interactivity requires to run directly from the R console.
Use the base R help()
function to view the full manual
for Histogram()
. Simply enter a question mark followed by
the name of the function.
?Histogram
More on Histograms and other visualizations from lessR and other packages such as ggplot2 at:
Gerbing, D., R Visualizations: Derive Meaning from Data, CRC Press, May, 2020, ISBN 978-1138599635.