Histogram
Data Values into Bins
In contrast to the relatively few unique values of a categorical variable, a continuous variable has many potential values. How many potential values? Generally, there are too many unique data values to plot individually on a single graph. Consider annual salary, where every single value to the nearest penny must be considered from about $20,000 to $500,000 or so. Because there are so many potential data values, for most data sets too many possible values never occur in the data. For example, a specific annual salary of $83,924.79 would likely not occur unless the sample size was extremely large and even then not likely to the nearest penny.
What to do with all the unique data values that cannot be individually plotted?
Sequence of adjacent intervals, each generally the same size, which forms groups of values of a continuous variable.
Partition the range of values into bins, sometimes called classes. Figure 1 presents an example of a bin that contains values from $50,000 to $60,000 for annual salaries.
Each bin contains similar data values, defined by a lower and upper boundary, which specifies the width of the bin. In Figure 1, bin width is $10,000, and the midpoint is $55,000.
To evaluate the distribution of a continuous variable, first define the bins and then place each data value into its respective bin, as illustrated in Figure 2. Assign a annual salary of $57,358 to the bin $50,000 to $60,000.
Consistently assign values precisely equal to a bin boundary to either the adjacent lower bin or the adjacent higher bin. Each computer application that generates histograms defaults to one of those choices.
Example
The most typical graphical display of the variation of a continuous variable is the histogram.
The visualization of a continuous variable with values distributed across a set of adjacent bins, usually of the same width, plotted on the horizontal, x-axis, with the count of the number of data values in each bin plotted on the vertical, y-axis.
Like the bar graph, the histogram consists of a set of bars. However, instead of associating a single data value with a number, the histogram associates each bin with a number, the corresponding frequency or count. Another distinction is that the adjacent bars of a histogram share a common side to indicate the underlying continuity of a continuous variable.
For example, consider the histogram shown in Figure 3 of the annual salaries of 37 different employees from the Employee data table included with lessR.
The width of the bins for this histogram is 8000. Here we see, for example, that there is only one person who receives a salary between $40,000 and $48,000. Because the data values are organized by bins, from the histogram alone we do not know the precise salary of that person, just that it is somewhere between $40,000 and $48,000.
Artifacts and Issues
Optimal Bin Width
The choice of bins is critical for discerning the overall shape of a distribution. Yet, this choice often relies upon potentially subjective decisions made by the analyst.
Unlike many statistical analyses, there is no one correct, best histogram for a given data set, so we run several histograms with different bin widths.
Although there is not just one correct answer for choosing the bin width for a given data set, some choices provide more clarity than others to discern the basic shape of the distribution. The default histogram generated by analysis systems such as R and Tableau presume a default bin width. However, these initial histograms may need adjusting because the underlying algorithms do not always identify an optimal bin width. Through a process of trial and error, select a bin width that displays as much detail as possible for the available sample size without excessive random noise.
What guidance do we have for choosing a bin width? Although virtually any distribution shape is possible, most distributions of a continuous variable encountered in practice tend to be continuously increasing or decreasing in value, or like the normal or “bell” shaped curve, with increasing frequencies up to a value and then decreasing frequencies after that value (or vice versa). In particular, most real-life distributions, as opposed to mathematical oddities dreamed up by mathematicians, are not characterized by a pattern of up and down zigzags.
As shown in the following examples, random ups and downs of a histogram’s bars usually indicate sampling error resulting from a bin width that is too small. Yes, that problem can be corrected by increasing bin width, but only to a certain extent. Bins that are too large obscure information, not leveraging all the information in the data.
Bin Width
One artifact involves bin width.
Changing the width of the bins not only changes their width, it can also change the overall shape of the histogram.
Choosing the optimal bin width depends on several factors with no one precise best answer. Still, some choices are better than others. One problem is bins that are too small.
A bin width can be too small relative to the amount of available data, which yields an undersmoothed histogram.
Too many bins result in too much detail that displays random noise.
The result is that the undersmoothed histogram indicates random ups and downs that would not reproduce if a new set of comparable data were collected and analyzed. The previously displayed histogram in Figure 3 is somewhat undersmoothed. To clarify that characteristic, consider the frequency polygon of the same data.
Similar to a histogram but instead of bars, plots a point at would be the midpoint of each bar and connects each pair of adjacent points with a line segment.
Figure 4 shows the frequency polygon that corresponds to the histogram in Figure 3.
Could those irregularities characteristic of that histogram or its associated frequency polygon be real? Could those irregularities consistently repeat in other samples of data from the same population? The answer is yes, those zigzags could be an actual characteristic of the underlying population. However, particularly for such a small sample of 37 employees, repeatability of those irregular fluctuations is very unlikely.Instead, the underlying true distributional shape is likely much smoother.
To be more dramatic, we can decrease the bin with even more as in Figure 5.
These undersmoothed histograms reflect too much random sampling variability – too many random ups and downs – relative to the likely much smoother true shape of the underlying distribution. The too narrow bin width obscures the actual shape of the underlying distribution. Imagine trying to interpret this histogram, having to continually repeat statements such as: Salaries just under $80,000 tend to be not encountered so much, but then over $80,000 there is a rise in the number of salaries followed by a quick decline in the number with salaries a little more than $85,000. True enough for this particular sample but irrelevant regarding the population as a whole, which means also for any other sample.
In practice, when encountering such undersmoothed histograms, increase bin width and experiment. The following histogram in Figure 6, with a bin width of 13,000, is an excellent candidate for the final histogram.
To further illustrate the more smoothed estimation of the underlying distribution, consider the corresponding frequency polygon in Figure 7.
The histogram in Figure 6 indicates a distribution with a relatively small number of employees with salaries that begin around $40,000, then increases in frequency of employees up to $80,000, then the number of employees diminishes rapidly in frequency for larger salaries. This histogram shows the same general form as the histogram in Figure 3, with a bin width of $8000. The improvement is that the revised histogram with a bin width of $13,000 reveals the shape of the distribution without the irregular fluctuations.
Just as with an undersmoothed histogram that obscures the underlying distribution, a bin width that is too wide also does not properly display the shape of the underlying distribution.
Bins that are too large obscure details of the underlying distribution.
Oversmoothed histograms do not utilize the data efficiently. They bury some of the information inherent in the data.
The following oversmoothed histogram in Figure 8 illustrates the loss of information.
For this histogram, the bin width is $25000. If we only had this histogram by which to describe the underlying distribution, we would misleadingly conclude that the largest number of salaries are between $40,000 and $65,000 and that the number of people with higher salaries steadily declines. True, but from the more optimal histogram in Figure 6, we know that the shape of the distribution is more nuanced, with first an increase in the number of people from the lowest end of the salary spectrum as salary increases and then followed by a decrease in salaries.
To summarize, in general the default bin width given by a computer algorithm for a specific histogram function is usually more or less decent. However, this initial bin width can often be improved by experimenting with different bin widths. The goal is to display as much information as possible given the size of the sample from which the histogram is estimated.
Bin Shift
Another example of the potential arbitrariness of a histogram is the bin shift artifact.
Change the starting point of a histogram, and the shape of the histogram likely changes.
Particularly in small samples, the shape of the histogram may be overly dependent on the starting position of the first bin. A histogram of the same data with the same bin width can substantively change shape simply by changing the starting point of the first bin.
To illustrate the impact of bin shift, consider a sample of 15 shipment times, assessed as the number of days from the time of the order until the shipment is received, shown in Figure 9.
The default starting point of the histogram in Figure 9 is 5. The histogram indicates steadily increasing frequencies for longer ship times. To revise the histogram, retain the bin width of 1 but instead start the first bin at 4.5, as shown in Figure 10.
Compare the histograms in Figure 9 and Figure 10 of ship Time, which have the same bin width but with start values that differ by 0.5. The resulting histograms differ to the extent that, at first glance, they appear to describe different data sets. The reason for this change is that for this small sample size there are relatively large gaps between some of the adjacent data values. The continuity of the possible underlying values would be better represented by the relatively few values in the data set. The one-dimensional scatterplot, explained later, is usually a better alternative to visualize the distribution of data values in a small data set.
Histogram Scale
For any type of visualization plotted with axes, the values displayed on an axis should not include too many zeros to enhance readability.To illustrate, consider a histogram of the data values of the variable Income. The histogram is shown in Figure 11.
After dividing the values of the variable Income by 1000, its values now range between 0 and 600 instead of 0 and 600,000. The rescaling of the variable should be noted, such as with a new variable label on the horizontal or x-axis for the histogram, shown in Figure 12.
Density Curve
The histogram is a 19th-century innovation. It’s notable limitation is that the underlying distribution of a continuous variable is, well, continuous but the histogram plots the distribution in discrete chunks, the bins. Ideally, a continuous distribution would be represented by a smooth curve. Modern computer technology delivers that representation, directly estimating that smooth curve.
The smooth curve that directly visualizes continuity.
As with the histogram, the density curve indicates where the values of the variable tend to occur more or less frequently than the other values, as shown in Figure 13. Here, return to the data from the Employee data set.
The density curve in Figure 13 is a bit wobbly. Its smoothness can be adjusted.
Determines the smoothness of the estimated density curve.
The value of each point of the density curve is computed by averaging the values of nearby data values. A larger bandwidth results in a smoother curve because it averages over a broader range of data points, leading to less sensitivity to fluctuations in individual data points. A smaller bandwidth produces a more jagged curve that closely follows the individual data points.
The lack of smoothness in Figure 13 is likely due to random sampling fluctuations characteristic of such small samples. The bandwidth applied to that histogram is 9000. To derive a smoother curve more impervious to this likely random error, the curve in Figure 14 was computed with a larger bandwidth than the density curve in Figure 13, a bandwidth value of 15000.
As with the underlying histogram, there is no one best density curve. The goal is to find the optimal value of bandwidth that balances too much smoothness against over-emphasizing random fluctuations of the data with lots of usually random ups and downs in the density curve.