R: Visualize Continuous Variables

Author

David Gerbing

Published

Apr 17, 2024, 08:24 am

Here, we focus on lessR visualizations because they are easier to implement and more comprehensive than those provided by any other available R visualization system (and Python as well). The lessR visualization functions are more comprehensive because they also offer statistical output without having to invoke yet other function.

Histogram

No surprise here. If you have already done lessR bar charts then you already know how to do lessR histograms. Just replace the word BarChart with the word Histogram in the function call and specify a continuous variable for the analysis. Figure 1 presents the result.

Histogram(Salary)
Figure 1: lessR default histogram for Salary from the Employee data set.

As with the bar graph, a frequency distribution of a continuous variable can also be presented as a visualization or as a table. For this analysis, bin width is $10,000. The following frequency table displayed by Histogram() lists each bin with the corresponding Count, Proportion, Cumulative Count, and Cumulative Proportion.

             Bin  Midpnt  Count    Prop  Cumul.c  Cumul.p 
--------------------------------------------------------- 
  40000 >  50000   45000      4    0.11        4     0.11 
  50000 >  60000   55000      8    0.22       12     0.32 
  60000 >  70000   65000      8    0.22       20     0.54 
  70000 >  80000   75000      5    0.14       25     0.68 
  80000 >  90000   85000      3    0.08       28     0.76 
  90000 > 100000   95000      5    0.14       33     0.89 
 100000 > 110000  105000      1    0.03       34     0.92 
 110000 > 120000  115000      1    0.03       35     0.95 
 120000 > 130000  125000      1    0.03       36     0.97 
 130000 > 140000  135000      1    0.03       37     1.00

Also provided are the summary statistics of the distribution.

--- Salary ---  
     n   miss         mean           sd          min          mdn          max 
     37      0    73795.557    21799.533    46124.970    69547.600   134419.230

An outlier analysis should always be done for each variable in an analysis. Since it should always be done, my lessR Histogram() function does that analysis by default. The box plot provides the same outlier analysis but without that visualization.

  
--- Outliers ---     from the box plot: 1 
 
Small      Large 
-----      ----- 
            134419.2

Bin Width and Start

Any data visualization system will provide a default histogram with default values for the bin width and bin starting values that it computes for you. Explicitly specify the bin with and bin starting values by implementing the corresponding parameter values that override the default values calculated by the R default algorithm.

bin_width and bin_start

Histogram() parameters for specifying the width of the bins at the starting point of the bins.

After you create a histogram, such as using the default bin_width and bin_startvalues computed for you, experiment with at least one or two other bin widths.

Histogram(Salary, bin_width=13000)
Figure 2: lessR histogram for Salary from the Employee data set with a customized bin width of $13,000.

Histogram Scale

To transform the scale of a variable, that is, to re-scale, if the data are in Excel you can transform the data in Excel. However, the R implementation is straightforward: simply enter the equation that defines the transformation into the R console. The primary “gotcha” here is that the variable’s reference needs to include the name of its containing data frame.

R Reference for a Variable

data_frame_name$variable_name

An example follows for the variable Salary in the d data frame.

You can have as many active data frames as your computer’s memory can accommodate. Each data frame can contain a variable of the same name. So, in many functions, specify the data frame that contains each variable with the $ notation. However, some analysis functions, such as my lessR functions, use the data parameter to specify the data frame that contains the relevant variables.

d$Salary <- d$Salary / 1000

The resulting histogram in Figure 3 is expressed in units of thousands of dollars. The xlab parameter specifies the label on the x-axis.

Histogram(Salary, xlab="Salary (USD in thousands)")
Figure 3: Histogram with re-scaled units for improved readability.

Density Curve

To obtain the density curve superimposed upon the histogram, set the density parameter to TRUE, as shown in Figure 4.

Histogram(Salary, density=TRUE)
Figure 4: Density plot of Salary.

The text output to the R console includes the calculated setting of bandwidth used to generate the density plot. To customize bandwidth, set the bandwidth parameter. Set bandwidth higher for a smoother plot and lower for a wavier, more irregular plot. Figure 5 illustrates.

Histogram(Salary, density=TRUE, bandwidth=15000)
Figure 5: Smoothness of the density curve customize with the bandwidth parameter.