Doing data analysis involves analyzing variables. We use the term “variables” because their data values vary. Co-variability, or relatedness, of two or more variables is the topic addressed here.

Data analysis

The analysis of variability applies to the values of a single variable. The analysis of co-variability is the extent to which the values of two or more variables vary together.

The methods for analyzing variability and co-variability differ for categorical and continuous variables. We begin by analyzing categorical variables.

Continuous Variables

The Scatterplot

Do the values of two variables tend to change together or separately?

Relationship of continuous variables

As the value of one variable increases, the values of the other tend to either increase or decrease.

Two continuous variables are related if, as the values of one variable increase, the values of the other variable tend to increase or decrease systematically. Relationships can be positive or negative.

Positive relationship

The values of both variables tend to increase together.

For a positive relationship, the values of both variables tend to increase together. The more Years worked at the company, the higher, on average, is a person’s salary.

Negative relationship

As the values of one variable increase, the values of the other variable tend to decrease.

In a negative relationship, the values of the variables tend to move in opposite directions. For example, the more a student is absent from class, on average, the lower the student’s grade.

Positive Relationship Example

To illustrate a positive relationship, consider the Employee data set once again. What is the relationship, if any, between the number of years employed at the company and salary?

A scatterplot displays the values of two selected variables for each row of data. A two-variable scatterplot plots one variable on each axis. The scatterplot in Figure 1 shows that more years employed tends to be associated with a higher salary. Each plotted point represents one employee’s data values for Years and Salary. There are 36 employees with data values for both variables, so the scatterplot consists of 36 points.

lessR Scatterplot

XY(Years, Salary)

Note: Interact with colors and other parameters by entering interact("scatterplot").

Figure 1: Scatterplot of Years employed and Salary from the Employee data set.

The scatterplot in Figure 1 indicates a strong, linear relationship. As the number of Years employed increases, the annual USD Salary also tends to be larger. This concept of a relationship leads to one of the essential concepts in all of statistics and data analysis, including the foundation of the modern pursuit of machine learning.

Basis of supervised machine learning

Two variables are related to the extent that knowledge of the value of one variable provides information regarding the value of the second variable.

Enhance the Scatterplot

The stronger the relationship, the more information about the value of Variable y is provided regarding the value of Variable x. In this example, knowing how many years an employee has worked at the company provides a more accurate estimate of their salary than not knowing this. To enhance our visual interpretation ofthe scatterplot, consider the 95% data ellipse and the best fit line.

95% data ellipse

Contains, on average, across multiple samples, 95% of the points in a sample scatterplot of two normally distributed variables.

lessR 95% data ellipse and fit line

XY(Years, Salary, ellipse=0.95, fit="lm")

ellipse: Specify a proportion, usually close to 1, that sets the extent of the data ellipse superimposed on the scatterplot.

fit: Specify the type of line fit through the scatterplot, such as "lm" for linear model, that is, the straight (least-squares) linear regression line. Other values include "exp" for the best exponential curve fit, "log" for the best arrhythmic fit line, "quad" for a quadratic fit, and "loess" for the best, general non-linear fit line.

Note: Other values of fit specify non-linear curves to fit to the data. Available values include loess for a general fit, exp for an exponential curve fit, and

Figure 2 illustrates the same scatterplot from Figure 1, but here with the 95% data ellipse included. A data ellipse can be specified for any percentage, with 95% is the typical value.

Figure 2: Scatterplot of Years employed and Salary from the Employee data set.

Building on the information provided by the ellipse, we can illustrate the strength of the relationship between Years and Salary by how predictable Salary is given Years. Figure 3 highlights the section of the scatterplot that applies to the value of 10 years.

Figure 3: 95% expected range of salaries for employees who have worked 10 years at the company.

Overall, Salary ranges from a low of $46,124.97 to a high of $134,419.20, a range of $88,294.23. However, for employees who have worked 10 years at the company, the 95% expected range of salaries, read directly from Figure 3, is about $49,000 to $102,000, a considerably reduced range of $53,000.

Knowing the value of variable $X$, here Years employed, provides information about the value of variable $Y$, here Salary. Although knowing $X$ does not determine $Y$ exactly, it does narrow the range of likely values of $Y$. As is almost always the case, aside from trivial transformations such as converting inches to centimeters, that information is not perfect. The stronger the relationship, the narrower the enclosing ellipse and, therefore, the more information available about the value of $Y$.

Two Unrelated Variables

Consider a scatterplot of two uncorrelated variables, here X and Y. Generate the data by simulating random sampling from a normal distribution. Create these simulated data values for variables X and Y with a population mean of 50 and a standard deviation of 10. The values are randomly sampled, with no correlation between X and Y in the population.

The data for each variable were generated by simulating random sampling from a normal distribution using the R function rnorm().

Figure 4 shows the scatterplot, 95% data ellipse, and fit line for two unrelated variables. The obtained sample correlation is r = -0.144, differing from the population value of 0 only by random fluctuations of sampling error. The 95% data ellipse in the scatterplot of variables X and Y in Figure 4 is approximately circular, indicating that X and Y are unrelated. As a result, the best-fitting line through the scatterplot is nearly flat.

Figure 4: Scatterplot of two variables with no linear relationship.

Because the data are simulated from a multivariate normal distribution of unrelated variables with the same population standard deviations, the 95% data ellipse is approximately circular. It is not a perfect circle because sampling error virtually guarantees that the sample data will not exactly mirror the population from which they are sampled.

Interpretation. The lack of a linear relationship between the variables indicates that, for a specific value of $X$, the corresponding value of $Y$ is about as likely to be above its mean of 50 as below it. Increasing the value of $X$ does not increase predictability regarding the corresponding value of $Y$. If we know the value of $X$, we gain essentially no information about the value of $Y$.

Contour Curves

Density

When we view a scatterplot, what are we trying to understand? We want to know how often different combinations of values of two paired variables, $x$ and $y$, occur together. This is the same question we asked regarding two categorical variables, where the answer was their joint frequency table and its associated visualizations. The same principle applies to two continuous variables, except that continuity changes how we answer the same question.

Consider the scatterplot in Figure 1, which displays years employed and annual salary for 37 employees. The measured values are necessarily discrete. The values of years are rounded to the nearest integer. Yet time flows continuously. Time of employment could in principle be recorded as 5 years, or 5.4 years, or 5.37 years. We could round years to the nearest year and salary to the nearest $10,000 and construct a joint frequency table, but any such table would be an artifact of the rounding chosen, not a faithful representation of the underlying relationship. Between any two discrete levels we record, infinitely many values exist. The challenge, then, is to express the analog of joint frequency in the context of continuity.

To answer the question of how often different combinations of paired values occur for sampled values of categorical variables, we turned to a third variable, the joint frequencies, presented in a cross tabulation table. For continuous variables, the third variable analogous to joint frequency is their bivariate density, $z$, which can be represented as a smooth surface over the continuous values of $x$ and $y$. The plot extends into three dimensions, with $z$ represented as height.

Density

The likelihood of observing a random data value close to a given coordinate, a paired value of $x$ and $y$.

The key realization is that the two-dimensional scatterplot can be understood as a view of an underlying three-dimensional density structure. Plot the usual $x$-$y$ coordinate system, and then, over each point, plot its density. The result is typically a mound or mountain shaped three-dimensional object that more fully shows the relationship between the $x$ and $y$ variables.

Contour Curves of Density

How can we visualize a three dimensional structure on a two-dimensional sheet of paper or a computer screen? Ask a backpacker. Backpackers rely on topographic maps to find their way in the back country. A hypothetical example of a topographic map based on the corresponding 3-D mountain appears in Figure 5. Each curve in the topographic map is a curve of constant elevation.

Contour curve

A two-dimensional curve that connects points of equal value, such as equal density.

A contour plot is a set of contour curves that helps visualize a three-dimensional surface in two dimensions. The lines on a topographic map are contour curves. Visually, a contour curve is a slice through a three-dimensional surface parallel to the “floor”, the region defined by the $x$- and $y$-axes. Each contour curve connects points of equal density, much like a topographic map shows lines of constant elevation. However, there is a crucial distinction between the contour curves displayed here in Figure 6 and those from the idealized bivariate normal distribution shown in Figure 2. The contour curves displayed here follow the shape of the data without imposing bivariate normality.

lessR Contour plot

XY(Years, Salary, type="contour")

type: Parameter to indicate the type of plot for the relationship of two continuous variables.

Figure 6: Contour plot of Years employed and Salary from the Employee data set.

The contour plot generalizes beyond the points of specific scatterplot obtained with a single sample to a more general configuration. The 37 employees in Figure~$\ref{fig:sps}$b may constitute the entire workforce of a particular company at a particular time, but, as with all data, they are better understood as a sample from a larger population. Another sample of 37 employees could have been hired, or could be hired. We want to estimate not just how many of these 37 employees fall near any given combination of values in this particular sample, but how likely such a combination would occur in the underlying population. A different sample of 37 employees from the same population would produce a different scatterplot, but the population remains the same, and, except for extreme sampling error, so would the displayed relationship.

We can also visualize the specific points from which the contour plot was estimated, providing both the estimated contour curves indicate by regions of higher to lower density and the points from which those contours were estimated.

Including the scatterplot superimposed on the contour curves further explains the relationship between the specific sample data and the estimated population relationship. Moreover, the dual plot clarifies the meaning of the contour curves by illustrating the generalization from the specific to the general. For example, when presenting Figure 1 in a business meeting as a data analyst to explain the relationship between years employed and salary, the audience likely has little experience with contour curves in this context, but likely does understand scatterplots. Fortunately, presenting both together provides an intuition for the meaning of the contour curves.

lessR Contour plot with the scatterplot

XY(Years, Salary, type="contour", contour_points=TRUE)

contour_points: Parameter to indicate to plot the scatter plot as well as the contour curves when set to the the value of TRUE.

Figure 7: Contour plot of Years employed and Salary from the Employee data set showing the scatterplot.

For such a small sample size, the contour plots in Figure 6 and Figure 7 are surprisingly effective at visualizing the relationship between Years employed and Salary without plotting the individual points and without assuming the bivariate normality of the data ellipse. The plot generalizes that relationship to all intermediate values with only 37 points from which to base the estimation.

Ellipse as a Contour Curve

The data ellipse already presented is another example of a contour curve, but a special one based on an idealized bivariate normal distribution. For a bivariate normal distribution, regions of equal Mahalanobis distance from the center form ellipses. These ellipses are also contours of equal density for that distribution.

The 95% is a descriptive summary of the bivariate data cloud, shown in Figure 2. Under bivariate normality, it represents a region expected to contain about 95% of the data values sampled from that distribution. For a particular sample, the actual percentage of values inside the ellipse will usually be close to, but not exactly, 95%.

The size of the ellipse depends on the chosen probability level. Larger probability levels produce larger ellipses. Accordingly, the 0.95 data ellipse encompasses more of the data cloud than the corresponding 0.68 ellipse. The usual choice is 0.95. For a population correlation of $\rho=0.00$, the ellipse is a circle if the variables are expressed on the same scale and in the same physical units.

Correlation Coefficient

The most widely encountered correlation coefficient is the Pearson product-moment correlation coefficient, or, more simply, the Pearson correlation, denoted $r_{xy}$. One feature of the Pearson correlation is that it is invariant to a change in units. Measure height in inches or measure height in centimeters. The relationship between height and weight is the same regardless of the arbitrary measurement units. Accordingly, the Pearson correlation of height with weight is the same in either case.

A correlation of +1 denotes a perfect positive association, with all points falling on a straight line. “Perfect” means that if the value of $X$ is known, the relationship provides the precise value of $Y$. A correlation of -1 indicates a perfect negative relationship. A correlation of 0 indicates no linear relationship between the two variables.

Magnitude and direction of the correlation coefficient

The size of the correlation indicates its magnitude. The closer the value is to +1 or -1, the stronger the linear relationship. The direction of the relationship is indicated by the sign of the coefficient, + or -.

Strength and direction are two independent concepts for evaluating a linear relationship with a correlation coefficient. For example, a correlation of -0.7 indicates a stronger linear relationship than a correlation of 0.5.

Linear, or straight-line, relationships are perhaps the most common, but not the only type of relationship.

The Pearson correlation coefficient only applies to linear relationships

Variables can be strongly non-linearly related, such as a U pattern, and yet correlate near 0.0, so always examine the scatterplot for linearity before interpreting a correlation coefficient.

There are several types of correlation coefficients not covered here that generalize to beyond linear relationships.

lessR Correlation analysis

XY(Years, Salary) also yields text output, which includes the sample correlation coefficient. (Also provided is a hypothesis test and confidence interval generalizing to the population.)

--- Pearson's product-moment correlation --- 
 
Number of paired values with neither missing, n = 36 
Sample Correlation of Years and Salary: r = 0.852 
  
Hypothesis Test of 0 Correlation:  t = 9.501,  df = 34,  p-value = 0.000 
95% Confidence Interval for Correlation:  0.727 to 0.923

The correlation between Years employed and annual Salary in USD from the scatterplot in Figure 1 is high (r=0.85), indicating a strong linear relationship.

Add a Third Variable

By including a third variable, the visualization of the relationship between two continuous variables can provide more information. This additional variable may be categorical or continuous. Both possibilities are discussed next.

Stratification

Including one or more categorical variables in the visualization can enhance the analysis of the relationship between two continuous variables. By examining the relationship at different levels of a categorical variable, stratification facilitates comparison across groups.

On the Same Panel

Points in a scatterplot can be plotted with different colors and/or plotting symbols according to the values of a third variable, here a categorical variable. Figure 8 shows the same scatterplot as Figure 1, except that the plotted points for the strata of men and women are differentiated by color.

lessR Stratified single-panel scatter plot

XY(Years, Salary, by=Gender, fit="lm")

by: Specify a categorical variable by which to stratify the scatterplot according to its levels, plotting the points of the same level in the same, unique style on the same panel.

Figure 8: Scatterplot of Years employed and Salary stratified on Gender by color.

A plotted point displayed as a small, colored circle is only one possibility. Any character, letter of the alphabet, or digit can be plotted as the shape of each point. Most visualization systems also offer special plotting symbols with interiors that can be filled with color. The default symbol is usually a small circle. Figure 9 illustrates other possibilities, in which points from different levels of the categorical variable are represented by different colors and shapes.

lessR Stratified scatterplot by shape

Plot(Years, Salary, by=Gender, shape=c("triup", "tridown"))

c(): The combine function that groups multiple values together into a single unit, a vector.

shape: The plot symbol. If stratifying on a categorical variable, list one symbol for each level of a categorical variable that is plotted in the order that the levels of the classification variable are defined.

Note: Besides the circle, other symbols that can be filled with color are the square, diamond, bullet, triup, and tridown. To view more available symbols for plotting points, access the points manual, enter: ?points.

Figure 9: Scatterplot of Years employed and Salary stratified on Gender by shape.

Interpretation

At this company, three of the top four salaries are held by men, and three of the bottom four salaries are held by women. The scatterplot also reveals that women tend to be concentrated at the lower end of the number of Years employed. The eight employees with the least Years of employment are all women.

On Different Panels

The previous example stratified on a categorical variable by plotting all the points on the same panel, that is, the same set of coordinate axes. The alternative plots the points for each level on a separate panel, the Trellis plot.

When to prefer a Trellis plot

For categorical variables with four or more levels, a Trellis plot is more readable than plotting points and fit lines for all levels on the same panel.

Figure 10 illustrates the Trellis scatterplot of Years and Salary stratified by the categorical variable Gender.

lessR Stratified Trellis scatter plot

XY(Years, Salary, facet=Gender, fit="lm")

facet: Specify the categorical variable by which to stratify the scatterplot according to its levels, with each group (strata) plotted on a different panel, a Trellis plot.

Note: By default, the panels are displayed in a single column. To create a Trellis plot with the panels displayed in a single row, set the parameter n_row to 1. Specify the desired number of rows or columns with n_row or n_col, respectively.

Figure 10: Scatterplot of Years employed and Salary stratified on Gender as a Trellis plot.

Combine Same and Different Panels

We can also combine both forms of stratification for traditional scatterplots.

lessR Scatterplot stratified by facets and within facets

XY(MidPrice, HP, facet=Cylinders, by=Source)

Figure 11: Scatterplot of a median price car with the number of cylinders in the engine for USA and non-USA cars.

Continuous

The previous examples in this section mapped a categorical third variable to one or more visual aesthetics. Another possibility is to introduce a continuous third variable. Figure 12 illustrates a scatterplot of car prices versus horsepower. Bubbles replace the standard small filled circles. The size of each bubble indicates the car’s fuel mileage, expressed as miles per gallon.

lessR Bubbles scatter plot

XY(MidPrice, HP, pt_size=MPGcity)

pt_size: Specify the size of the plotted points, with a default value of 1. If a variable, then the points are represented as bubbles, their sizes determined by the value of the specified variable.

Note: Additional parameters.

radius: Specify the size of the largest bubble.

power: Specify the relative sizes of the bubbles. Larger values provide greater separation between bubble sizes.

transparency: To at least partially overcome the problem of over-plotting, specify the transparency level as a proportion that varies from 0, no transparency, to 1, complete transparency.

The bubbles in the Figure 12 scatterplot have power set at 1.5 and transparency set at 0.5.

XY(MidPrice, HP, pt_size=MPGcity, power=1.5, transparency=.5)

Figure 12: Scatterplot of a car’s median price with horsepower plotted as bubbles with size determined by city MPG.

Interpretation

Inexpensive cars tend to have less horsepower but better fuel mileage. Relatively expensive cars offer more horsepower but considerably less fuel mileage. Moreover, to obtain more than minimal horsepower of about 100 leads to a rapid drop-off in fuel mileage. The car buyer needs to choose between economy and lack of power or the ability to accelerate quickly and achieve higher speeds and to pay more money for the privilege.

Big(er) Data

What happens when there are many data values to plot in a scatterplot, the situation described somewhat colloquially as big data? With big data, there may be thousands, hundreds of thousands, or millions of data values to plot. A problem encountered in scatterplots with many data points is overlap, as illustrated in Figure 13 with 5000 pairs of simulated data points.

Figure 13: Scatterplot of 5000 paired data values.

If the data values are not in a data frame, then add data=NULL to the XY() or related function call.

lessR Transparency of points

XY(x, y, fit="lm", transparency=.95)

transparency: Specify the amount of transparency of the interior of plotted points as a proportion from 0, no transparency, to 1, complete transparency.

Figure 14: Scatterplot of 5000 paired data values, each point plotted with a mostly transparent interior.

One reasonable solution is plotting smaller-sized points, shown in Figure 15.

lessR Scatterplot point size

XY(x, y, fit="lm", pt_size=.25)

pt_size: Specify the size of a plotted point, from 0 to whatever. The setting is 0.25 in Figure 15.

Figure 15: Scatterplot of 5000 paired data values, each point plotted at a small size.

A solution that works well is smoothing.

lessR Two-dimensional scatterplot densities

Transform a scatterplot of plotted points into a two-dimensional smoothed surface.

Analogous to transforming a histogram into a smoothed density curve, a two-dimensional scatterplot can be transformed into a smoothed two-dimensional surface. Plotting outliers as individual points, however, provides useful information.

lessR Scatterplot smoothing

XY(x, y, fit="lm", type="smooth", data=NULL)

type: Set to “smooth” to smooth the scatterplot.

Figure 16: Scatterplot of 5000 paired data values smoothed into a two-dimensional density plot.

Patterns of Correlations

Scatterplot Matrix

Supervised machine learning constructs predictive models from information contained in multiple variables. For example, how do the choice of advertising media and the number of advertisements affect sales revenue? There are thousands of additional examples of these predictive models applied daily across the business spectrum.

The key to building these models is to leverage the relationships between variables. Of particular interest is finding variables related to the target variable, the value to be predicted from a set of other variables called predictor variables or features. The key, then, is understanding the relationships among all variables in a data set.

Scatterplot matrix

A square matrix with a scatterplot for each pair of variables in the lower-triangular part of the matrix and, here, the corresponding correlations in the upper-triangular part of the matrix.

Figure 17 shows the scatterplot matrix of the four continuous variables in the d data frame, the Employee data set. Each scatterplot in the matrix corresponds to a specific correlation in the upper triangle of the matrix. For example, in the first row, the correlation of Salary and Years is 0.85, corresponding to the first scatterplot in the first column, with the strong least-squares line of best fit, the regression line.

lessR Scatterplot Matrix

XY(x=c(Salary, Years, Pre, Post), y=c(Salary, Years, Pre, Post), fit="lm")

x: To obtain the scatterplot matrix, specify only one expression in the first position, the x parameter. The variable is a vector built with the R combine function, c(). List relevant variables within the combine function, separated by commas.

Figure 17: Scatterplot matrix of four employee variables.

In this 4x4 matrix, Salary and Years correlate strongly with each other, r=0.85. Variables Pre and Post also correlate strongly with each other, r=0.91. The variables in each paired set do not correlate with the variables in the other paired set. For example, Salary only correlates r=.03 with Pre.

Interpretation. If we were to build a predictive model of an employee’s salary from the other three variables, only the number of years employed would be a useful predictor. Scores on the pre-test before instruction on some topic and the post-test after the instruction are not related to salary and so would not be effective predictors of that variable.

Heat Map

The heat map replaces individual correlation coefficients with colored squares. An example appears in Figure 18.

R heat map

d <- d[, .(Salary, Years, Pre, Post)]
mycor <- cor(d, use="pairwise.complete.obs")
heatmap(mycor, symm=TRUE, Rowv=NA, Colv=NA)

To obtain the heat map of the correlation matrix, first compute the correlation matrix, here stored in the object named mycor. The matrix is computed with the R function cor(). The function does not automatically filter out the non-numeric variables in the input data frame, here the d data frame of the Employee data set. Select the relevant numeric variables manually with the code between the square brackets [ ], then pass the smaller data frame to the cor() function. The code is a bit technical and not obvious without a background in subsetting data frames. However, to apply it to another data table, use the same form, changing only the data frame name, if needed, and the selected variable names.

use: How to deal with missing data when computing the correlations. The value "pairwise.complete.obs" is a good choice unless there is considerable missing data for a variable, in which case that variable should not be selected for computing the corresponding correlations.

Note: The R heatmap() function then processes the stored correlation matrix mycor. The symm parameter set to TRUE informs the function that the input matrix is symmetric, which is true of a correlation matrix. The Rowv and Colv parameters tidy up the output.

Figure 18: Heat map of four employee variables.

The heat map presents the same information as the numerical correlation matrix, but with colored squares whose intensity indicates the corresponding correlation. Using the color scheme from Figure 18, the deeper the red color, the stronger the correlation. The paler the yellow color, the weaker the correlation. Salary and Years correlate strongly with each other, as do Pre and Post, so these respective correlations are indicated by dark red in the heat map. Other correlations are weak, marked by yellow colors.