This document is the R interpretation of the more general, conceptual discussion regarding continuous vs categorical variables.
Data Storage Types
Continuous Variables
The storage type for continuous variables within R are represented as numbers with or without decimal digits.
double. The term double refers to the amount of memory allocated to store the numeric value, in this case, what is called double precision for 64 bits per data value.
integer. An integer value is stored as an exact value. Computers do not store numbers as decimal digits but instead as binary digits. Often, the binary representation of a double precision number is not precisely the same value as its decimal digit representation.
Categorical Variables
character. The term character refers to a variable with alphanumeric data values, that is, text. If numeric, the numbers serve only as labels, not as quantitative information.
integer. Same definition as with continuous variables but different meaning in which the numeric values are labels.
Implementation
To explore these data storage types and their relation to continuous or categorical variables, let’s look at somelessR analysis. Access lessR with library("lessR"). Read the Employee data set, an internal data set downloaded with lessR.
d <-Read("Employee")
Data Types
------------------------------------------------------------
character: Non-numeric data values
integer: Numeric data values, integers only
double: Numeric data values with decimal digits
------------------------------------------------------------
Variable Missing Unique
Name Type Values Values Values First and last values
------------------------------------------------------------------------------------------
1 Years integer 36 1 16 7 NA 7 ... 1 2 10
2 Gender character 37 0 2 M M W ... W W M
3 Dept character 36 1 5 ADMN SALE FINC ... MKTG SALE FINC
4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
5 JobSat character 35 2 3 med low high ... high low high
6 Plan integer 37 0 3 1 1 2 ... 2 2 1
7 Pre integer 37 0 27 82 62 90 ... 83 59 80
8 Post integer 37 0 22 92 74 86 ... 90 71 87
------------------------------------------------------------------------------------------
Describe the data values for the variables in a given data frame at any time during an R analysis with the lessR function details_brief(), abbreviated db(), with the name of the data frame as the first parameter value. Read() automatically calls db() after the data are read. Or, for more information, invoke the full version, detail().
Given this information, we can interpret the output from the data frame with the lessR function Read(). When categorical variables are read into the analysis system, their data values are stored as integers or as text, the R character variable type. The R integer variable storage type is type integer. The R storage type for a character string or text variable is type character.
Interpretation of the output of lessR function details().
R does not attempt to classify variables as continuous or categorical, presumably because that task cannot be fully automated. Instead, manually identify the categorical variables by examining the type of variable as read into the system. For integer-valued variables, compare the number of unique values to the total number of values and generally characterize as categorical when the ratio of unique values to the total number is relatively small.
Analyze Categorical Variables
R Factors
To address these issues, R has a special variable type designed to represent categorical variables.
Before an R analysis begins, convert categorical variables, usually read initially as type integer or character, to formally defined categorical variables of variable type factor.
Data analysis, in general, requires more information for categorical variables. For example, in Python data analysis with Pandas, convert categorical variables to type category. Similar manual transformations must also be accomplished with Tableau.
Express categorical variables as factors with the Base R function factor(). Distinguish between the levels of the categorical variable as they are represented or potentially represented in the data and the labels used to describe those levels in the output of analyses. The two corresponding and appropriately named parameters of the function:
levels parameter: specify the existing and potential data values that define the levels
labels parameter: value labels to attach to the data values for clarifying the output
As explained the following material, when using factor() we can use neither of these parameters, one of them, or both of them.
Accept Existing Levels and Order
Here, use the base R function class() to show the variable type of Gender.
class(d$Gender)
[1] "character"
Although the levels of Gender could be further clarified with labels Man and Woman, in this data set, the levels can be considered sufficiently descriptive: M and W. There is no necessary ordering of the levels, so the arbitrary alphabetical ordering of the levels is appropriate. Although Gender could be analyzed as a type character without transformation, it is better to pursue consistency and have all character variables defined as factors. In this situation, invoke the factor() function without specifying any parameter values.
Identify the corresponding data frame when referencing a variable, such as the Gender variable, so that R can locate the variable. It is possible, for example, to have multiple current data frames, each with a variable called Gender. Identify the containing data frame by preceding the variable name with the name of the data frame followed by a dollar sign, $.
d$Gender <-factor(d$Gender)class(d$Gender)
[1] "factor"
After the transformation, the variable Gender in the d data frame is now of variable type factor instead of type character.
Order Character String Levels
To properly order the character string levels of a categorical variable, convert the variable from type character as initially read into R to type factor with the factor() function. To specify the desired order, use the levels parameter of the factor() function to specify the desired ordering of the categorical variable’s levels as they exist, or could exist, in the data.
Replace the current JobSat variable with its factor version. List the levels in the desired order, what can be called their presentation order in subsequent visualizations.
Once converted, there is no need for the original character version of JobSat, so the preceding transformation replaced the original with the factor version. Or, create a new variable in the d frame by entering a new variable name in the left-hand side of the specification before the assignment operator, <-.
The optional ordered parameter for factor() indicates that the levels progress in magnitude from “less” to “more”. Ordering the levels with the levels parameter specifies their intrinsic order. Setting the ordered parameter to TRUE goes further than specifying the presentation order to indicate that the factor variable is an ordinal variable. By default, the value of ordered is FALSE, which indicates a nominal variable. For subsequent data visualizations, ordered factors have different default color palettes than non-ordered factors that reflect the underlying ordering.
Label Integer Values
To assign integer values, follow the same general procedure as the previous example, which transforms a variable of type character into a factor but also introduces the labels parameter to provide more meaningful value labels. The data values are integers, 1 through 3, so the levels parameter has the corresponding integer values, the integer vector 1:3, an abbreviation for c(1,2,3).
The labels parameter of factor() specifies the value labels to attach to the levels in the data frame for the corresponding categorical variable. In the following function call to factor(), the labels parameter is written underneath the levels parameter to help ensure that the labels match the levels in the desired one-to-one correspondence.
Verify that the variable type of Plan has changed from integer to a factor.
class(d$Plan)
[1] "factor"
The resulting bar chart contains the more descriptive labels in place of the original integers.
This example ordered the levels in the sequence of 1, 2, and 3 because the levels were listed in that order defined by the vector 1:3. Other vectors could have been entered, such as c(3,1,2), to specify a different order.
Regardless of the specified order of the levels, the ordering of the labels must match the exact ordering so that each label matches its corresponding level.
The labels applied in this example are attached to integers. The labels parameter can also apply to variables of type of character. In that situation, display the original character-valued levels with another set of labels. For example, for the categorical variable Gender, display a data value of M on the output with the value label Man.
Add Levels Beyond the Data
We have modified Gender from the original data table, so re-read.
d <-Read("Employee", quiet=TRUE)
The following lessR function call to pivot() generates the frequency table for Gender of the data unmodified as read into the R d data frame.
pivot(d, table, Gender)
Gender n Prop
1 M 18 0.486
2 W 19 0.514
The possible values for Gender are M, W, and O for other. However, in this small data set, there were no responses for O, so the frequency table and corresponding bar chart cannot show a count of a value that does not exist in the data. Fortunately, factor() can include all possible data values, not just those that occurred. Define all possible levels of the categorical variable in their desired presentation order. In this example, also provide the optional labels for greater clarification of each level’s meaning.
Now that O is defined as a level with the corresponding label Other for the Gender variable, the value of Other is included in the analysis output even though it never occurs in the data.