Base R functions are accessed simply by opening the R app, whether by itself or withing RStudio.

Access lessR functions.

library(lessR)

## 
## lessR 4.2.6                         feedback: gerbing@pdx.edu 
## --------------------------------------------------------------
## > d <- Read("")   Read text, Excel, SPSS, SAS, or R data file
##   d is default data frame, data= in analysis routines optional
## 
## Learn about reading, writing, and manipulating data, graphics,
## testing means and proportions, regression, factor analysis,
## customization, and descriptive statistics from pivot tables.
##   Enter:  browseVignettes("lessR")
## 
## View changes in this and recent versions of lessR.
##   Enter: news(package="lessR")
## 
## **Newly Revised**: Interactive data analysis.
##   Enter: interact()

Access tidyverse functions, particularly those from the dplyr package, the primary source of tidyverse data manipulation functions. The tidyverse consists of many packages, usually all loaded at once.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::recode() masks lessR::recode()
## ✖ dplyr::rename() masks lessR::rename()

Read the data, here from a lessR built-in data table, so only need supply the name of the data table.

d <- Read("Employee")

## 
## >>> Suggestions
## Details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
##  2    Gender character     37       0       2   M  M  W ... W  W  M
##  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  high ... high  low  high
##  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
## ------------------------------------------------------------------------------------------

head(d)

##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
## Wu, James           NA      M SALE  94494.58    low    1  62   74
## Downs, Deborah       7      W FINC  57139.90   high    2  90   86
## Hoang, Binh         15      M SALE 111074.86    low    3  96   97
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1  65   62
## Afshari, Anbar       6      W ADMN  69441.93   high    2 100  100

The data will be re-read many times for the following analyses, each time to provide a clean copy of the data for another manipulation. The additional statements for reading are hidden so as to not clutter the output.

The presentation of data manipulation content begins with the simpler 1-dimensional data structure, the vector. Building on those concepts, the material is extended to the primary two-dimensional data structure, the data frame.

There are more than three types of data structures in data analysis systems, but the three described here are the most basic: scalars, vectors, and data frames. The following material shows how to create and do some operations on these data structures with R.

Scalars and Vectors

Scalar

Scalar: A single value.

Here, enter a scalar such as the number 5 into the R console. The [1] in the output indicates the first line of output. In these brief examples, there is only one line.

## [1] 5

Or, set the contents of a variable to the number 5, then display the contents by listing the variable name.

x <- 5
x

## [1] 5

Character values are set in quotes when setting a value.

MyGender <- "Male"
MyGender

## [1] "Male"

R serves as a high-level, free, calculator.

y <- x^2 + 100
y

## [1] 125

Vectors

When working with data, it is often needed to specify a collection of multiple values. The values could be data values or values passed to a function to influence its output, such as a list of various colors for the bars in a bar graph.

Vector: A variable that consists of multiple values.

The most general form for specifying an R vector is the c() function. For example, define a vector of the first five integers beginning with 1.

c(1,2,3,4,5)

## [1] 1 2 3 4 5

When a vector is created its values are assigned to a specific variable or parameter. Here, assign the values of the vector to any chosen variable, such as x, as in this example.

x <- c(1,2,3,4,5)
x

## [1] 1 2 3 4 5

For the special case of sequential values, two shortcuts are available. One shortcut is the : notation for sequential integers.

1:5

## [1] 1 2 3 4 5

The seq() function defines any regular sequence of values. To use seq(), specify three values, the beginning value, the ending value, and the interval between the values, the parameter by.

seq(1, 5, by=1)

## [1] 1 2 3 4 5

We also can have vectors with character values.

G <- c("Male", "Female", "Other")
G

## [1] "Male"   "Female" "Other"

One common use of vectors is passing multiple values to the parameter of a function. Here is an example bar chart with customized colors instead of relying upon the default colors. Each of the two bars is specified to display with a different color by setting the fill parameter to a vector. Each bar has a different border color, specified with the color parameter, again set to a vector, one color per bar.

d <- Read("Employee", quiet=TRUE)

BarChart(Gender, fill=c("darkseagreen3", "darkgoldenrod3"), 
                 color=c("darkseagreen4", "darkgoldenrod4"), quiet=TRUE)

Another example of an R vector at least implicitly present in every data analysis is the vector of variable names for a data frame, which contains the data for analysis. The data frame is a two-dimensional table, but the names of the variables in the data frame, obtained from the Base R names() function, is a vector. In this example we read the into the data frame named d.

names(d)

## [1] "Years"  "Gender" "Dept"   "Salary" "JobSat" "Plan"   "Pre"    "Post"

Sometimes it is useful to know if a specific value is located in a vector.

%in% operator: Identify if a value is contained within a vector.

For example, consider the predefined character vector called letters that contains the 26 lowercase letters of the Roman alphabet. List all 26 letters by entering the name of the vector.

letters

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"
## [22] "v" "w" "x" "y" "z"

Now test if a specific character value is in the vector.

"f" %in% letters

## [1] TRUE

"5" %in% letters

## [1] FALSE

Base R Subset Operator

Subsetting is an essential data preparation process. You have a collection of data values and you only want to analyze some of them, a subset of the original data.

To subset, we need to know the position of the elements in the data structure, such as data values, that we wish to extract. For a vector, identify the position by its ordinal position, 1 for the first value, ten for the tenth value, etc.

Index: A number that indicates the ordinal position of a data value within a data structure.

The concept of an index necessarily applies to any data structure, such as found in R and Python/Pandas. To subset a vector, specify the desired indices within the square brackets, [ ] immediately after the vector’s name.

[ ]: With the specified index or indices inside, refers to a subset of a data structure (in both R and Python/Pandas), the subsetting operator.

For example, refer to just the first 5 letters of the letters vector with the subsetting operator. Here indicate to extract just those values in the range from 1 to 5, that is, the first 5 lowercase letters of the alphabet.

letters[1:5]

## [1] "a" "b" "c" "d" "e"

To extract the first 5 and the last 5 letters, use the c function to define a single vector that includes both specifications.

letters[c(1:5, 22:26)]

##  [1] "a" "b" "c" "d" "e" "v" "w" "x" "y" "z"

Use the - sign to indicate not in reference to a list of indices. Here extract all the letters that do not have indices 1 through 5 or 22 through 26.

letters[-c(1:5, 22:26)]

##  [1] "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"

Consider another vector, the names of the variables in the d data frame. Refer to the name of the second variable as the second element of the Base R names() function applied to the d data frame.

names(d)[2]

## [1] "Gender"

Base R which() Function

The Base R which() function forms the basis for many of the Base R data manipulations. Specify a logical condition and see which values of the data structure satisfy the condition. For example, which numbers in a vector are even? Which column of a data frame is named Gender?

which() function: Return the indices of a data structure that correspond to TRUE according to a specified logical condition.

Calling which() to process a vector of logical values results in index numbers as if the index numbers were entered directly.

To illustrate, specify a logical expression and then apply to one element. Which of the letters in the alphabet is the letter c? That is, where is c located in the letters vector? Of the 26 values of the letters vector, only one is equal to “c”.

The double equals sign, ==, specifies a logical comparison. It compares two entities and returns TRUE if equal and FALSE if not. With R we can evaluate all the elements of a vector at once simply by specifying the vector, in this example, the 26 lowercase letters of the alphabet. R will take the first element of the vector letters, which is “a” and evaluate if it is “c”, then the second element, and so on through all 26 letters.

The corresponding logical vector that evaluates each letter for this equality consists of 25 FALSE values and 1 TRUE value. In this example, the third element of the resulting logical vector has the value of TRUE.

letters == "c"

##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

To get the index of “c”, use which(). As indicated, the function evaluates all the values of the resulting vector of TRUE and FALSE values, and returns the indices of the TRUE values.

which(letters == "c")

## [1] 3

Then can use the indices output of which() to subset. Again, the purpose of the which() function is to get the index value. Because the expression which(letters == "c") reduces to the number 3, the following is the same as entering letters[3].

my_c <- letters[which(letters == "c")]
my_c

## [1] "c"

To reference more than a single value use the R operator %in% followed by a vector with multiple values.

which(letters %in% c("a", "c", "e"))

## [1] 1 3 5

Or, preface the entire expression with a - sign to indicate the corresponding negative values.

-which(letters %in% c("a", "c", "e"))

## [1] -1 -3 -5

Doing a subset with the negative values return all the values not specified. The following is equivalent to entering letters[c(-1, -3, -5)], which subsets all values of the letters vector that are not the first, third, and fifth values in the vector.

letters[-which(letters %in% c("a", "c", "e"))]

##  [1] "b" "d" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
## [22] "y" "z"

Data Frame Subsetting Operations

The fundamental data structure that stores the data table, which contains the variables for analysis, in both R and Python/Pandas is the data frame.

Data frame: Collection of vectors, one for each column (variable) in the data frame, all of the same length.

Every data analysis software contains functions to implement these data manipulations. Even within a single set of analysis software, such as R, there are multiple ways to accomplish each manipulation.

General Extract Function

The 2-dimensional data frame has both rows and columns. Accordingly, to identify a subset of a data frame we need to identify rows and columns to be included (or excluded) from the data frame. The general form of the subsetting expression for the data frame name d:

d[rows, columns]

The concept is to follow the same logic for vectors but apply to rows and/or columns. The rows specifications come before the comma, and the columns specifications come after the comma. An important convention for the subsetting of 2-dimensional tables such as data frames is that if no rows or no columns are specified then all rows or all columns are indicated.

Express the rows and columns of the Extract function d[row,cols] (for data frame d) with either indices or logical expressions.

An example of a set of indices is 1:3, the first three. An example of a logical expression is Gender=="Male", where the two equal signs indicate to check for equality of existing values.

Subset Data Frames by Index

For example, entering the name of an R object, such as the data frame d, implicitly invokes the R print() function, which displays the entire contents of the object, such as a data frame, at the console. Using the subset notation, selected pieces of the object instead can be displayed. In this example, display the first three rows of data for all the variables.

d[1:3,]

##                  Years Gender Dept   Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN 53788.26    med    1  82   92
## Wu, James           NA      M SALE 94494.58    low    1  62   74
## Downs, Deborah       7      W FINC 57139.90   high    2  90   86

Here display the first two rows of data for the first, second, third, and sixth variable.

d[1:2, c(1:3, 6)]

##                  Years Gender Dept Plan
## Ritchie, Darnell     7      M ADMN    1
## Wu, James           NA      M SALE    1

Here explicitly refer to the column numbers for which to delete the variables Pre and Post, where Pre and Post are in Columns 7 and 8. Retain all the rows of data.

d <- d[, -(7:8)]

With the Base R subsetting, directly specify the rows to retain or delete. The nrow() function specifies the number of rows of data, which, in this context, refers to the last row of data.

nrow(d)

## [1] 37

So (nrow(d)-5):nrow(d) gives the indices of the last five rows of the data frame.

(nrow(d)-5):nrow(d)

## [1] 32 33 34 35 36 37

Here extract the last five rows of data. Note that the space after the comma is blank, which means that all variables are referenced.

d <- d[nrow(d)-5:nrow(d), ]

Display the column indices with the lessR details_brief() function, abbreviated db(), the same function Read() calls to display information about the data frame just read. The variables are numbered from 1 to the last variable. As usual with lessR, the default data frame is d.

db()

## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     31       1      14   10  2  13 ... 7  NA  7
##  2    Gender character     32       0       2   W  W  M ... W  M  M
##  3      Dept character     31       1       5   MKTG  ACCT  SALE ... FINC  SALE  ADMN
##  4    Salary    double     32       0      32   61961.29  72502.5 ... 94494.58  53788.26
##  5    JobSat character     30       2       3   high  NA  low ... high  low  med
##  6      Plan   integer     32       0       3   2  2  1 ... 2  1  1
##  7       Pre   integer     32       0      26   80  74  72 ... 90  62  82
##  8      Post   integer     32       0      20   86  87  73 ... 86  74  92
## ------------------------------------------------------------------------------------------

lessR .() Subsetting

A more general solution to subsetting variables refers to the variables by their names. If the position of the variables in the data frame changes due to other modifications, the specified subsetting still works. Perhaps the most convenient method to subset a data frame is to use the Base R Extract function, [ ], in conjunction with lessR function .(), which, returns either row or column indices depending on its position.

Select Rows by Logical Criteria

Rows of data from the data table can be selection according to logical criteria. The retained rows of data are those that fulfill the criteria. Use ! for not. To relate two different criteria, use & for and | for or. For exclusive pairing, one of two criteria are true but not both, invoke the xor function, as in xor(x, y).

For Base R, when variables are directly included in the subsetting expression, their location must be identified within the specified data frame, even if the same data frame that is being subset.

For example, d$Gender and d$Salary when subsetting the d data frame by Gender and Salary. Here subset to only men with salaries above $90,000, retaining all the variables. That is, extract only those rows where the variable Gender is equal to “M” and the variable Salary is greater than $90,000.

d <- d[d$Gender=="M" & d$Salary>90000, ]

Applied to the Base R Extract function, the lessR function .() eliminates the need to pre-pend the data frame name and a $ to each variable name in the specified logical expression to select rows.

d[.(Gender=="M" & Salary>90000), ]

##                 Years Gender Dept    Salary JobSat Plan Pre Post
## Wu, James          NA      M SALE  94494.58    low    1  62   74
## Hoang, Binh        15      M SALE 111074.86    low    3  96   97
## Knox, Michael      18      M MKTG  99062.66    med    3  81   84
## Sheppard, Cory     14      M FINC  95027.55    low    3  66   73
## Correll, Trevon    21      M SALE 134419.23    low    1  97   94
## Capelle, Adam      24      M ADMN 108138.43    med    2  83   81

To select rows by row names, invoke the Base R function rownames(), referencing the data frame, here d. If selecting more than a single row, express the row names as a vector and use %in% in place of double equals, ==. Again, no variables specified after the comma, so all variables are retained in the subset.

d[rownames(d) %in% c("Hoang, Binh", "Downs, Deborah"), ]

##                Years Gender Dept   Salary JobSat Plan Pre Post
## Downs, Deborah     7      W FINC  57139.9   high    2  90   86
## Hoang, Binh       15      M SALE 111074.9    low    3  96   97

Note that most lessR analysis functions also have a rows parameter. When calling a function, such as Histogram(), set the rows parameter equal to a logical criterion in the function call to subset the rows of data in the data frame to be analyzed. That strategy avoids extra statements for subsetting the data frame.

In the following example, display the histogram for only women’s salaries. To save space, set the parameter quiet to TRUE to avoid the text output of the function.

Histogram(Salary, rows=(Gender=="W"), quiet=TRUE)

Select Columns by Variable Names

The Base R Extract function by itself presents two annoying features for variable selection. First, to select specific variables, all variable names must be between quotes. Second, only individual variables can be specified, more than one variable as a vector with the c() function. Variable ranges of contiguous variables are not permitted.

The lessR function .() addresses these deficiencies. No quotes are required around the variable names, and as many variable ranges and single variables as desired may be specified in a single function call.

Here specify all variables in the d data frame of the Employee data starting with Years through Salary, and also add the variable Post to the specification. The .() function returns the corresponding variable indices within the context of the Extract function. List the values of these variables for just the first three rows of data.

d[1:3, .(Years:Salary, Post)]

##                  Years Gender Dept   Salary Post
## Ritchie, Darnell     7      M ADMN 53788.26   92
## Wu, James           NA      M SALE 94494.58   74
## Downs, Deborah       7      W FINC 57139.90   86

To indicate that the specified variables should be excluded instead of included, add a minus sign, -, preceding the .() function call.

d[1:3, -.(Years:Salary, Post)]

##                  JobSat Plan Pre
## Ritchie, Darnell    med    1  82
## Wu, James           low    1  62
## Downs, Deborah     high    2  90

The Extract function has the annoying characteristic of returning a vector instead of a data frame if there is only one variable retained.

d[1:3, .(Salary)]

## [1] 53788.26 94494.58 57139.90

To keep the status of a data frame when retaining only a single variable, set the drop parameter to FALSE.

d[1:3, .(Salary), drop=FALSE]

##                    Salary
## Ritchie, Darnell 53788.26
## Wu, James        94494.58
## Downs, Deborah   57139.90

Base R Subsetting

Using Base R Extract, that is, [ ], in conjunction with lessR .() provides all the tools needed to conveniently extract a subset of information from a data frame. But then we have multiple specialized functions to accomplish the same thing, from both Base R and the tidyverse package dplyr.

Why these specialized functions from multiple sources? The R people provided some, likely because of their limitations of their own Extract function. But why not fix that function instead of introducing more? Regardless, these other functions exist and are used more frequently than Base R Extract, understandable given its limitations, though now addressed with lessR .().

The Base R function subset() provides the option for deleting variables from a data frame with the select parameter. Unmodified, the specified variables are those that are included in the resulting data frame. Add the - sign to indicate deletion.

This example uses the Base R subset() function with the select parameter, here deleting two variables from the data frame.

d <- subset(d, select=-c(Pre, Post))
head(d)

##                  Years Gender Dept    Salary JobSat Plan
## Ritchie, Darnell     7      M ADMN  53788.26    med    1
## Wu, James           NA      M SALE  94494.58    low    1
## Downs, Deborah       7      W FINC  57139.90   high    2
## Hoang, Binh         15      M SALE 111074.86    low    3
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1
## Afshari, Anbar       6      W ADMN  69441.93   high    2

Base R also uses the subset() function to extract rows of data from a data frame, essentially by dropping the select parameter. The name of the data frame appears as the first parameter in the function call.

d <- subset(d, Gender=="M" & Salary>90000)

dplyr Subsetting

To subset columns of a data frame, the tidyverse dplyr package provides the function select().

d <- select(d, -c(Pre, Post))
head(d)

##                  Years Gender Dept    Salary JobSat Plan
## Ritchie, Darnell     7      M ADMN  53788.26    med    1
## Wu, James           NA      M SALE  94494.58    low    1
## Downs, Deborah       7      W FINC  57139.90   high    2
## Hoang, Binh         15      M SALE 111074.86    low    3
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1
## Afshari, Anbar       6      W ADMN  69441.93   high    2

The select() function also permits the specification of each variable separately instead of specifying as a vector.

##                  Years Gender Dept    Salary JobSat Plan
## Ritchie, Darnell     7      M ADMN  53788.26    med    1
## Wu, James           NA      M SALE  94494.58    low    1
## Downs, Deborah       7      W FINC  57139.90   high    2
## Hoang, Binh         15      M SALE 111074.86    low    3
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1
## Afshari, Anbar       6      W ADMN  69441.93   high    2

The tidyverse function in the dpylr package, filter(), does the same thing as Base R subset() with identical syntax: Subset rows of a data frame. (Yet this, and related functions, are heralded by many as an advance by those who apparently do not understand that the same functionality exists within Base R and has since at least Version 1.0 of R around Year 2000.)

d <- filter(d, Gender=="M" & Salary>90000)

The result of this application of filter() is the same as the other approaches, a data frame of only males with salaries above $90,000. The revised data frame is now much smaller.

##                 Years Gender Dept    Salary JobSat Plan
## Wu, James          NA      M SALE  94494.58    low    1
## Hoang, Binh        15      M SALE 111074.86    low    3
## Knox, Michael      18      M MKTG  99062.66    med    3
## Sheppard, Cory     14      M FINC  95027.55    low    3
## Correll, Trevon    21      M SALE 134419.23    low    1
## Capelle, Adam      24      M ADMN 108138.43    med    2

However, filter() only works with logical expressions. Another function, slice() selects rows based on the integer position of the rows in the data frame, the indices. The following selects the first three rows of the d data frame (actually tibble if using only dplyr).

slice(d, 1:3)

##               Years Gender Dept    Salary JobSat Plan
## Wu, James        NA      M SALE  94494.58    low    1
## Hoang, Binh      15      M SALE 111074.86    low    3
## Knox, Michael    18      M MKTG  99062.66    med    3

Note that to subset both rows and columns with dplyr, the result from the first function must be entered into the input for the second function. The best way to do this is with the pipe operator, explored later in this reading.

The tidyverse presents yet another function, the slice() function, to use indices to specify the rows of data to retain or to delete. (The tidyverse presents many new functions.) Use the dplyr function n() to indicate the last row of data, written as n(). Here retain the fifth row of data through the last row.

d <- slice(d, 5:n())

Equivalently, delete the first four rows of data.

d <- slice(d, -1:-4)

The subsetting result follows.

head(d)

##                  Years Gender Dept   Salary JobSat Plan Pre Post
## Jones, Alissa        5      W <NA> 53772.58   <NA>    1  65   62
## Afshari, Anbar       6      W ADMN 69441.93   high    2 100  100
## Knox, Michael       18      M MKTG 99062.66    med    3  81   84
## Campagna, Justin     8      M SALE 72321.36    low    1  76   84
## Kimball, Claire      8      W MKTG 61356.69   high    2  93   92
## Cooper, Lindsay      4      W MKTG 56772.95   high    1  78   91

The result for all of these row deletions provides the same result in terms of the selected rows. The only distinction is that the slice() function, being a tidyverse function, does not recognize row names, so starts the numbering of the retained rows at 1. The Base R functions retain the original row names, so start the numbering at 5.

Create and Transform Variables

Transform a variable with an arithmetic expression that defines a computation based on the values of existing variables. Before writing the transformation in a computer language, begin with its algebraic expression. For example, to transform a measurement of height from feet to inches, multiply each measurement in inches by 12. The following equation expresses this transformation apart from any computer function or data table. \[X_{inches} = 12 * X_{feet}\] This equation is a logical equality in the form of an algebraic equation. The equals sign, =, specifies the logical equality. For specific values of $X_{feet}$ and $X_{inches}$, the statement is either TRUE or FALSE. This equation, by itself, has nothing to do with computers or computer languages.

To express a transformation in the language of a computer function, shift the focus from an algebraic equation to writing an expression that instructs the computer to perform computations on the values of variables, usually in a data table. This transformation is a set of physical actions that computes new values of a variable and stores the result on the computer. Move from writing an algebraic equality to writing an assignment statement that instructs the computer to do computational work to assign values to a variable across rows of data.

Assignment statement: Instruct the computer to compute the values of a variable from a specified arithmetic expression of existing variables.

Every computer language or dialect defines its notation for an assignment statement. Many languages, such as R, explicitly do not use the equals sign, =, of an algebraic or logical equality for an assignment statement. For example, R uses <-. A statement of an equality is a separate concept from having the computer do computations on data values. That is why many computer languages have a different symbol for evaluating a logical equality as TRUE or FALSE. For example, R uses the double equal sign, == for this evaluation.

When defining a transformation, decide either to create an entirely new variable in the data table, or replace the values of the existing variable. If the expression leads to a new variable on the left of the assignment statement, a new variable has been created. Or, the expression can redefine an existing variable by specifying the same variable on the left of the assignment statement as on the right. To express this concept in terms of a computer function to enter into an R program depends on the specific function chosen to accomplish that transformation, with examples following.

Base R Direct Arithmetic Expression

To create variables with arithmetic expressions with Base R that transform one or more variables, simply write the arithmetic expression. The transformation can create a new variable or transform the values of an existing variable.

Because it is possible to have multiple data frames in a working R session, and because it is possible to have different variables of the same name in different data frames, identify each variable by including the data frame that contains the variable. To identify the location of a variable within the specified data frame, prep-end the data frame name that contains the variable, followed by a $, to each variable name.

As always, the R assignment operator <- indicates to take the expression on the right of the <- and assign its value to whatever is on the left of the operator. In this example, PP and PxP are new transformed variables, created from the values of existing variables in the data frame, Pre and Post. Because we want the newly created variables, such as PP or PxP, to reside in the d data frame with the other variables in our analysis, we also prefix the variable name, PP, with a d$.

d$PP <- d$Pre + d$Post
d$PxP <- d$Pre * d$Post

If working within RStudio, the easiest way to enter such equations into either the console directly or into a text file in the Source window is auto-completion. Enter the name of the data frame and the \$ sign, such as d$. The result is a little pop-up window that displays the list of variables in the data frame from which to select. The result is not only less typing, but also no typing errors from misspelling variable names.

A transformation involves simply writing the corresponding formula for the transformation. But, once again, we have specialized functions to accomplish the same task.

Specialized Transform Functions

The primary issue with these transformation equations is the need to repeat the name of the data frame, particularly burdensome if the data frame name is long. To address this situation, Base R offers the transform() function. The first argument of the function is the data frame, which then does not need to repeated for each of the enclosed variables.

In this example, the Base R transform() function creates the new variables PP and PxP. Because this example simultaneously creates two new variables, a comma is needed to separate the transformation equations.

d <- transform(d,
    PP = Pre + Post,
    PxP = Pre * Post
)

From the tidyverse dplyr package the mutate() function creates a new variable. This function has the identical syntax as the Base R transform() function, though mutate() will also lose any row names as the tidyverse even changes the properties of the basic data frame.

d <- mutate(d,
     PP = Pre + Post,
     PxP = Pre * Post
)

The result from any of these functions. The newly created variables in the revised data frame are appended to the right of the existing columns in the data frame.

class(d)

## [1] "data.frame"

head(d)

##                  Years Gender Dept    Salary JobSat Plan Pre Post  PP   PxP
## Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92 174  7544
## Wu, James           NA      M SALE  94494.58    low    1  62   74 136  4588
## Downs, Deborah       7      W FINC  57139.90   high    2  90   86 176  7740
## Hoang, Binh         15      M SALE 111074.86    low    3  96   97 193  9312
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1  65   62 127  4030
## Afshari, Anbar       6      W ADMN  69441.93   high    2 100  100 200 10000

So mutate() performs the same as the standard Base R transform(), with the loss of any row names, and also yields a standard R data frame if passed a data frame as input.

Summarize with Aggregate Data

Note: The following material on lessR function pivot() is extracted from the vignette on the same topic. To view this and other vignettes, each of which provides explanation and examples on a specific topic, enter into R: browseVignettes("lessR").

Aggregation

What is the mean salary across various departments in a company? What is the mean salary across departments for men and for women? The answer to these questions is done by what is called aggregation.

Aggregation: Compute a statistic of a numerical variable over groups defined by one or more categorical variables.

It the above example, the numerical variable is Salary. The groups are defined by the different departments, and then by the men and women within each department. The mean of Salary is computed for each department in the first aggregation. In the second aggregation the mean salary is computed each combination of the levels of Gender and Department, which defines defines the groups. One group is women in sales. Another group is men in accounting.

The following table lists many of the available statistical functions by which to aggregate over the groups.

Statistic	Meaning
`sum`	sum
`mean`	arithmetic mean
`median`	median
`min`	minimum
`max`	maximum
`sd`	standard deviation
`var`	variance
`IQR`	inter-quartile range
`mad`	mean absolute deviation
`tabulate`	count of each cell only

Some statistical functions are available that return multiple values.

Statistic	Meaning
`range`	minimum, maximum
`quantile`	range + quartiles
`summary`	quantile + mean

These later three functions output their values as an R matrix, which then replaces the values of the variable that is aggregated in the resulting output data frame. Custom functions can also be written for computing the statistic over which to aggregate.

lessR pivot()

To aggregate data, lessR provides the function pivot(), following the Excel name pivot table to describe the summary table of aggregated data. After a lifetime of using R, Python, SAS, Excel, and others, I am not aware of any aggregation function as straightforward to use as pivot(), which is why I wrote the function.

Parameters

The following pivot() parameters specify the data, the statistic(s) for the aggregation, the variables(s) over which to aggregate, and the corresponding cells that contain the aggregated values. The parameter names are chosen so that describing an example, as follows, is a regular English sentence.

The first three listed parameter values are required to get one or more statistics computed over the entire data set. The fourth parameter, by, specifies the categorical variable(s) over which to aggregate, that is, compute the one or more specified statistics.

data: The data frame that includes the variables of interest.
compute: The function(s) for the statistic(s) for which to perform the aggregation.
variable: The variable(s) for which to summarize, i.e., aggregate.
by: The categorical variable(s) that define the sub-groups or cells for which to compute the aggregated values.
by_cols: The optional categorical variable(s) (not shown above) that define the sub-groups or cells for which to compute the aggregated values, listed as columns in a two-dimensional table for viewing.

Multiple values of parameters variable, by, and up to two by_cols variables may be specified. Express the multiple categorical variables over which to pivot as a vector, such as with the c() function. If by_cols is not specified, the result is a summary table as a long-form data frame that can be input into other data analysis procedures.

By default, missing values are eliminated from the aggregation. Missing values can occur for the value being aggregated as well as the groups over which the aggregation is conducted. If there are no values for a cell for which to perform the aggregation, and it is desired to have the aggregated value listed as missing, then specify na_value as TRUE. If any of the levels of the by variables are missing, to report those missing cells, specify na_by as TRUE. Set na_remove to TRUE to have any aggregated value defined as missing if any individual data cell for variable is missing.

Examples

To illustrate, use the Employee data set included in lessR. Here read into the d data frame. Begin with no missing data. Custom functions can also be written for computing the statistic over which to aggregate.

d <- Read("Employee")

## 
## >>> Suggestions
## Details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
##  2    Gender character     37       0       2   M  M  W ... W  W  M
##  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  high ... high  low  high
##  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
## ------------------------------------------------------------------------------------------

Output as a Long-Form Data Frame

Two categorical variables in the d data frame are Dept and Salary. A continuous variable is Salary. Create the long-form pivot table as a data frame that expresses the mean of Salary for all combinations of Dept and Salary.

This example includes the parameter names for emphasis, but if the parameters are entered in this order, their names are not necessary. Because the aggregation is over two categorical variables, create a vector of the two variables for the by parameter.

pivot(data=d, compute=mean, variable=Salary, by=c(Dept, Gender))

##    Dept Gender  n na     mean
## 1  ACCT      M  2  0 59626.19
## 2  ADMN      M  2  0 80963.35
## 3  FINC      M  3  0 72967.60
## 4  MKTG      M  1  0 99062.66
## 5  SALE      M 10  0 86150.97
## 6  <NA>      M  0  0       NA
## 7  ACCT      W  3  0 63237.16
## 8  ADMN      W  4  0 81434.00
## 9  FINC      W  1  0 57139.90
## 10 MKTG      W  5  0 64496.02
## 11 SALE      W  5  0 64188.25
## 12 <NA>      W  1  0 53772.58

Or, the way I usually write it, just including the name of the by parameter for readability.

pivot(d, mean, Salary, by=c(Dept, Gender))

##    Dept Gender  n na     mean
## 1  ACCT      M  2  0 59626.19
## 2  ADMN      M  2  0 80963.35
## 3  FINC      M  3  0 72967.60
## 4  MKTG      M  1  0 99062.66
## 5  SALE      M 10  0 86150.97
## 6  <NA>      M  0  0       NA
## 7  ACCT      W  3  0 63237.16
## 8  ADMN      W  4  0 81434.00
## 9  FINC      W  1  0 57139.90
## 10 MKTG      W  5  0 64496.02
## 11 SALE      W  5  0 64188.25
## 12 <NA>      W  1  0 53772.58

The output of pivot() is a data frame of the aggregated variables. This can be saved for further analysis. Here, perform the same analysis, but list the required parameter values in order without the parameter names. Save the result into the data frame a and the display the contents of a.

a <- pivot(d, mean, Salary, by=c(Dept, Gender))
a

##    Dept Gender  n na     mean
## 1  ACCT      M  2  0 59626.19
## 2  ADMN      M  2  0 80963.35
## 3  FINC      M  3  0 72967.60
## 4  MKTG      M  1  0 99062.66
## 5  SALE      M 10  0 86150.97
## 6  <NA>      M  0  0       NA
## 7  ACCT      W  3  0 63237.16
## 8  ADMN      W  4  0 81434.00
## 9  FINC      W  1  0 57139.90
## 10 MKTG      W  5  0 64496.02
## 11 SALE      W  5  0 64188.25
## 12 <NA>      W  1  0 53772.58

Because the output of pivot() with no by_cols variables is a standard R data frame, typical operations such as sorting can be applied, here using the lessR function Sort().

Sort(a, by=Dept)

## 
## Sort Specification
##   Dept -->  ascending

##    Dept Gender  n na     mean
## 1  ACCT      M  2  0 59626.19
## 7  ACCT      W  3  0 63237.16
## 2  ADMN      M  2  0 80963.35
## 8  ADMN      W  4  0 81434.00
## 3  FINC      M  3  0 72967.60
## 9  FINC      W  1  0 57139.90
## 4  MKTG      M  1  0 99062.66
## 10 MKTG      W  5  0 64496.02
## 5  SALE      M 10  0 86150.97
## 11 SALE      W  5  0 64188.25
## 6  <NA>      M  0  0       NA
## 12 <NA>      W  1  0 53772.58

Multiple variables for which to aggregate over can also be specified. Round the numerical aggregated results to two decimal digits.

pivot(d, mean, c(Years, Salary), c(Dept, Gender), digits_d=2)

##    Dept Gender Years_n Years_na Years_mean Salary_n Salary_na Salary_mean
## 1  ACCT      M       2        0       7.00        2         0    59626.20
## 2  ADMN      M       2        0      15.50        2         0    80963.34
## 3  FINC      M       3        0      11.33        3         0    72967.60
## 4  MKTG      M       1        0      18.00        1         0    99062.66
## 5  SALE      M       9        1      12.33       10         0    86150.97
## 6  <NA>      M       0        0         NA        0         0          NA
## 7  ACCT      W       3        0       4.67        3         0    63237.16
## 8  ADMN      W       4        0       7.50        4         0    81434.00
## 9  FINC      W       1        0       7.00        1         0    57139.90
## 10 MKTG      W       5        0       8.20        5         0    64496.02
## 11 SALE      W       5        0       6.60        5         0    64188.25
## 12 <NA>      W       1        0       5.00        1         0    53772.58

Output as a 2-d Table

Specify up to two by_cols categorical variables, to create a two-dimensional table with the specified columns. First, one by_cols variable, Gender. Specifying one or two categorical variables as by_cols variables moves them from their default position in the rows to the columns.

Moving rows to columns changes the output structure from a long-form data frame that can be input into other analysis functions to a cross-classification table with categorical variables in the rows and columns. The primary purpose of the cross-classification table is to view the aggregated results, not to subject them to further analysis. For viewing, the tabular presentation is often easier to interpret.

Here the fifth value entered in the function call is the value of by_cols, Gender in this example.

pivot(d, mean, Salary, by=Dept, by_cols=Gender)

Table: mean of Salary 

 Gender  M          W               
 Dept                        
-------  ---------  ---------
   ACCT   59626.19   63237.16
   ADMN   80963.35   81434.00
   FINC   72967.60   57139.90
   MKTG   99062.66   64496.02
   SALE   86150.97   64188.25
     NA         NA   53772.58

Here are two by_cols variables, specified as a vector.

pivot(d, mean, Salary, by=Dept, by_cols=c(Gender, Plan))

Table: mean of Salary 

 Gender  M                                 W                                     
 Plan    1          2           3          1          2          3             
 Dept                                                                     
-------  ---------  ----------  ---------  ---------  ---------  ---------
   ACCT   69547.60          NA   49704.79         NA   63237.16         NA
   ADMN   53788.26   108138.43         NA         NA   67724.21   122563.4
   FINC   61937.62          NA   95027.55         NA   57139.90         NA
   MKTG         NA          NA   99062.66   56772.95   66426.79         NA
   SALE   89393.40    82442.74   80131.91   60941.54   66352.73         NA
     NA         NA          NA         NA   53772.58         NA         NA

Tabulation

Tabulation is counting. The number of data values in each cell are tabulated, that is, counted, their frequency of occurrence obtained. For the function that specifies what to compute, specify table.

pivot(d, table, Dept, by=Gender)

##   Gender  n na ACCT ADMN FINC MKTG SALE
## 1      M 18  0    2    2    3    1   10
## 2      W 18  1    3    4    1    5    5

Note you can get the same joint frequency table from the lessR function BarChart().

BarChart(Dept, by=Gender)

## >>> Suggestions
## Plot(Dept, Gender)  # bubble plot
## BarChart(Dept, by=Gender, horiz=TRUE)  # horizontal bar chart
## BarChart(Dept, fill="steelblue")  # steelblue bars 
## 
## Joint and Marginal Frequencies 
## ------------------------------ 
##  
##        Dept 
## Gender   ACCT ADMN FINC MKTG SALE Sum 
##   M         2    2    3    1   10  18 
##   W         3    4    1    5    5  18 
##   Sum       5    6    4    6   15  36 
## 
## Cramer's V: 0.415 
##  
## Chi-square Test of Independence:
##      Chisq = 6.200, df = 4, p-value = 0.185 
## >>> Low cell expected frequencies, chi-squared approximation may not be accurate

Of course, many other functions are available from which to obtain the joint frequencies of the two categorical variables.

Other Aggregation Functions

The Base R function, upon which pivot() relies, is aggregate(). Express this function in one of two modes. This example is of formula mode.

a <- aggregate(Salary ~ Dept + Gender, data=d, FUN=mean)
a

##    Dept Gender   Salary
## 1  ACCT      M 59626.20
## 2  ADMN      M 80963.35
## 3  FINC      M 72967.60
## 4  MKTG      M 99062.66
## 5  SALE      M 86150.97
## 6  ACCT      W 63237.16
## 7  ADMN      W 81434.00
## 8  FINC      W 57139.90
## 9  MKTG      W 64496.02
## 10 SALE      W 64188.25

For dplyr, aggregate with two functions, group_by() and summarize(). First define the grouping, then the function and then the variable by which to summarize. As with all tidyverse functions, begin the function call with the name of the relevant data frame.

Specify the name of the aggregated variable. Here the mean of Salary across the groups is named mean. Here write the aggregated data to the a data frame.

d = group_by(d, Dept, Gender)
a = summarize(d, mean = mean(Salary))

## `summarise()` has grouped output by 'Dept'. You can override using the `.groups`
## argument.

## # A tibble: 11 × 3
## # Groups:   Dept [6]
##    Dept  Gender   mean
##    <chr> <chr>   <dbl>
##  1 ACCT  M      59626.
##  2 ACCT  W      63237.
##  3 ADMN  M      80963.
##  4 ADMN  W      81434.
##  5 FINC  M      72968.
##  6 FINC  W      57140.
##  7 MKTG  M      99063.
##  8 MKTG  W      64496.
##  9 SALE  M      86151.
## 10 SALE  W      64188.
## 11 <NA>  W      53773.

As is true of the data output of any tidyverse function that outputs a data frame, the data frame is the tidyverse version called a tibble. We can always convert back to a regular R data frame with the Base R function data.frame().

a <- data.frame(a)
a

##    Dept Gender     mean
## 1  ACCT      M 59626.20
## 2  ACCT      W 63237.16
## 3  ADMN      M 80963.35
## 4  ADMN      W 81434.00
## 5  FINC      M 72967.60
## 6  FINC      W 57139.90
## 7  MKTG      M 99062.66
## 8  MKTG      W 64496.02
## 9  SALE      M 86150.97
## 10 SALE      W 64188.25
## 11 <NA>      W 53772.58

Usually, though, the version of the data frame is not an issue. For example, all lessR analysis functions accept data frames in either form.

Forward Pipe Operator

Stefan Milton’s magrittr 2014 package of the same name introduced the forward pipe (or chain) operator, %>%, pronounced then. The operator transformed the way many people code their data transformations by organizing them as a flow from start to end.

Forward pipe operator: Carries forward the output of one function on the left of the operator into the input of the subsequent function on the right.

By default the input goes into the first argument of the subsequent function, which applies to functions that begin their parameter list with the data frame.

After the magrittr package appeared, the tidyverse adopted the operator (to the extent that many R users think that the tidyverse people invented it and it only works with tidyverse functions). With Version 4 of R, the R people followed Milton’s lead and introduced the Base R pipe operator, |>, used here in these examples.

Obtain the pipe operator directly from the keyboard within RStudio as CNTRL/CMD-SHIFT-M. The default is %>%. However, this default can be changed with a trip to:

RStudio’s Tools –> Global Options…
Select Code followed by the Editing tab, then the relevant checkbox

Checking the relevant checkbox changes the default pipe operator to the Base R |>.

The forward pipe, either the magrittr or Base R version, shows up in much contemporary code, so good to be familiar with it. For basic operations, such as shown here, either version of the pipe works equally well.

First Example

The result is that a series of linked operations can be specified with a single entry into the R console. The function for listing the variable names and, by default, the first six rows of data, is the head() function. For the data frame d the traditional way to invoke the function follows.

head(d)

##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
## Wu, James           NA      M SALE  94494.58    low    1  62   74
## Downs, Deborah       7      W FINC  57139.90   high    2  90   86
## Hoang, Binh         15      M SALE 111074.86    low    3  96   97
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1  65   62
## Afshari, Anbar       6      W ADMN  69441.93   high    2 100  100

Without modification, the forward pipe operator requires that the receiving function have the data frame as its first parameter. Then the data output by the function that precedes the chain operator becomes the data input into the function on the right side of the chain operator.

To access the head() function with the pipe operator, first list the data frame name, which is an implicit call to the print() function. The output of this function is then chained forward as the input to the first parameter in the head() function, the object for which to display the beginning lines. Read as, “take the data frame d and pass to the data (first) parameter of the head() function”.

d |> head()

##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
## Wu, James           NA      M SALE  94494.58    low    1  62   74
## Downs, Deborah       7      W FINC  57139.90   high    2  90   86
## Hoang, Binh         15      M SALE 111074.86    low    3  96   97
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1  65   62
## Afshari, Anbar       6      W ADMN  69441.93   high    2 100  100

Other parameters could also be invoked in the call to head(). For example, to just list the first three lines of data, explicitly set the second parameter of the function, n. Because the data frame d is here piped into the first parameter value of the head() function, the parameter and its value do not appear in the list of parameter values for the function call to head().

d |> head(n=3)

##                  Years Gender Dept   Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN 53788.26    med    1  82   92
## Wu, James           NA      M SALE 94494.58    low    1  62   74
## Downs, Deborah       7      W FINC 57139.90   high    2  90   86

By default the pipe operator takes the output from the preceding function in the chain and inputs the previous output into the first parameter of the subsequent function. The underline notation makes this mechanism explicit. The _ serves as a placeholder for the piped-in incoming information and must be accompanied by the parameter name. For the magrittr pipe operator, %>%, the placeholder is a period, ..

The first parameter of the head() function is x, the R object for which the first lines are displayed. Here, instead of relying on the default, explicitly set this value to the output of the previous function in the chain with the underline placeholder.

d |> head(x=_)

##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
## Wu, James           NA      M SALE  94494.58    low    1  62   74
## Downs, Deborah       7      W FINC  57139.90   high    2  90   86
## Hoang, Binh         15      M SALE 111074.86    low    3  96   97
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1  65   62
## Afshari, Anbar       6      W ADMN  69441.93   high    2 100  100

Another advantage of the _ placeholder is that the output of the previous function could be forwarded to any specified parameter in the subsequent function, not just the default first parameter. The following example is a bit more awkward than the traditional function call, but illustrates that the _ can be the receiver of information passed to any parameter, not just the first. Forward the constant 2 into the head() function as the value of n, the number of lines to display for the input data frame.

2 |> head(d, n=_)

##                  Years Gender Dept   Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN 53788.26    med    1  82   92
## Wu, James           NA      M SALE 94494.58    low    1  62   74

Or, for the magrittr pipe operator, use its placeholder, .. To use this operator, the magrittr package or the tidyverse must be accessed first with the library() function.

2 %>% head(d, n=.)

##                  Years Gender Dept   Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN 53788.26    med    1  82   92
## Wu, James           NA      M SALE 94494.58    low    1  62   74

Because the Base R operator is part of R, no external packages need be referenced for its pipe operator. That said, because the magrittr pipe operator was first its use is quite common by R users.

More Realistic Example

Here create a pivot table of Salary averaged over all combinations of Dept and Gender, just for those employees with health Plan number 2. If this data frame had many variables, for convenience perhaps select just a subset of the variables, those of interest for this and subsequent analyses.

lessR

Employ the lessR functions .() and pivot() in combination with the Base R Extract function [ ]. Again, the general form is the expression for selecting the rows first, and both the row and column expressions enclosed in .( ).

d[.(rows), .(columns)]

This first example does not use the pipe operator. Instead, create an intermediate data frame dd that stores the subsetted data frame, ready to pass to pivot(). Perhaps not as elegant as using the pipe operator, but works well.

Because we are aggregating over two categorical variables, Dept and Gender, need to specify the by variable as vector with the c() function, c(Dept, Gender).

dd <- d[.(Plan == 2), .(Gender:JobSat)]
a <- pivot(dd, mean, Salary, by=c(Dept, Gender))

More elegantly, use these functions with the pipe operator to avoid creating an intermediate data frame. This is my preferred code to accomplish the goal of generating data frame a. The first parameter value of pivot() is data for specifying the data frame, so the function is amenable to the default behavior of |>.

To keep the linear flow of the operations from left to right, turn the assignment statement around from <- to ->, meaning that the output of pivot() is assigned to the a data frame in this example. Setting the expression with a at the end using -> reflects the temporal flow of operations from left to right. However, the more traditional a <- could be used, setting that notation at the beginning of the entire expression.

Remember that inside the square brackets, [ ], the information before the comma selects the rows and the information after the coma selects the columns, the variables.

d[.(Plan == 2), .(Gender:JobSat)] |> 
   pivot(mean, Salary, c(Dept, Gender)) -> a
a

##    Dept Gender n na      mean
## 1  ACCT      M 0  0        NA
## 2  ADMN      M 1  0 108138.43
## 3  FINC      M 0  0        NA
## 4  MKTG      M 0  0        NA
## 5  SALE      M 2  0  82442.74
## 6  ACCT      W 3  0  63237.16
## 7  ADMN      W 3  0  67724.21
## 8  FINC      W 1  0  57139.90
## 9  MKTG      W 4  0  66426.79
## 10 SALE      W 3  0  66352.73

The above code would be created and run for the following management request:

“For the d data frame, let’s analyze the data for only those employees with insurance Plan 2, and only the variables in the data frame from Gender through JobSat. From this subset of the data, calculate the mean salary for each gender across all departments.”

tidyverse

With the tidyverse dplyr package, select rows with filter() and columns with select(). Then use group_by() and summarize() to aggregate. Store the result in data frame a. Note that the R assignment operator works in either direction. Here use -> to assign the result to a.

d |> filter(Plan == 2) |> select(Gender:JobSat) |> 
      group_by(Dept, Gender) |> summarize(mean=mean(Salary)) -> a

## `summarise()` has grouped output by 'Dept'. You can override using the `.groups`
## argument.

## # A tibble: 7 × 3
## # Groups:   Dept [5]
##   Dept  Gender    mean
##   <chr> <chr>    <dbl>
## 1 ACCT  W       63237.
## 2 ADMN  M      108138.
## 3 ADMN  W       67724.
## 4 FINC  W       57140.
## 5 MKTG  W       66427.
## 6 SALE  M       82443.
## 7 SALE  W       66353.

The tidyverse people at the RStudio company, Posit, even redefine the standard data frame. Their functions create their version of a data frame called a tibble, for table.

Compare

The tidyverse dplyr functions get the aggregation done with 121 characters and four invocations of the pipe operator. lessR accomplishes the same task with 79 characters and only one use of the pipe operator. lessR provides more concise code which, in my opinion, is at least as readable as the dplyr code, if not more so.

pivot() also provides for a table for viewing instead of a data frame ready for input into other analysis functions. To create a version as a table instead of a data frame, move the variable Gender to the optional by_cols parameter.

d[.(Plan == 2), .(Gender:JobSat)] |> 
  pivot(mean, Salary, by=Dept, by_cols=Gender)

Table: mean of Salary 

 Gender  M           W               
 Dept                         
-------  ----------  ---------
   ACCT          NA   63237.16
   ADMN   108138.43   67724.21
   FINC          NA   57139.90
   MKTG          NA   66426.79
   SALE    82442.74   66352.73

Showing the actual table also shows the missing data. Again, to not show the missing data,

Comment

The chain operator is useful, good to have in the R toolkit. The new Base R pipe operator, for example, indicates that the maintainers of Base R recognize its usefulness. The primary advantage is that its use follows the natural order in which the functions are called and so, once its meaning is realized, to many easier to write and read than the other technology that creates intermediate data structures for each data processing step.

Sort a Data Frame

lessR

The lessR function Sort() sorts the rows of a data frame according to the values of one or more specified variables. To limit the output, first redefine the data frame with only the first 10 rows.

d <- d[1:10, ]

Then sort, here by the values of the variable Dept. Just display the sorted data frame without saving the sort. The default is to sort in ascending order. The first value to pass to the function is the data frame, here d.

Sort(d, by=Dept)

## 
## Sort Specification
##   Dept -->  ascending

##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
## Afshari, Anbar       6      W ADMN  69441.93   high    2 100  100
## Downs, Deborah       7      W FINC  57139.90   high    2  90   86
## Knox, Michael       18      M MKTG  99062.66    med    3  81   84
## Kimball, Claire      8      W MKTG  61356.69   high    2  93   92
## Cooper, Lindsay      4      W MKTG  56772.95   high    1  78   91
## Wu, James           NA      M SALE  94494.58    low    1  62   74
## Hoang, Binh         15      M SALE 111074.86    low    3  96   97
## Campagna, Justin     8      M SALE  72321.36    low    1  76   84
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1  65   62

Here sort in descending order, specified by the value - for the parameter direction.

Sort(d, by=Dept, direction="-")

## 
## Sort Specification
##   Dept -->  descending

##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Wu, James           NA      M SALE  94494.58    low    1  62   74
## Hoang, Binh         15      M SALE 111074.86    low    3  96   97
## Campagna, Justin     8      M SALE  72321.36    low    1  76   84
## Knox, Michael       18      M MKTG  99062.66    med    3  81   84
## Kimball, Claire      8      W MKTG  61356.69   high    2  93   92
## Cooper, Lindsay      4      W MKTG  56772.95   high    1  78   91
## Downs, Deborah       7      W FINC  57139.90   high    2  90   86
## Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
## Afshari, Anbar       6      W ADMN  69441.93   high    2 100  100
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1  65   62

Sort by Dept in descending order and Gender in ascending order. Use vectors to define multiple variables to sort, and the define multiple directions from which to sort. There is a 1-to-1 correspondence between variables to sort and their corresponding directions.

Sort(d, by=c(Dept, Gender), direction=c("-", "+"))

## 
## Sort Specification
##   Dept -->  descending 
##   Gender -->  ascending

##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Wu, James           NA      M SALE  94494.58    low    1  62   74
## Hoang, Binh         15      M SALE 111074.86    low    3  96   97
## Campagna, Justin     8      M SALE  72321.36    low    1  76   84
## Knox, Michael       18      M MKTG  99062.66    med    3  81   84
## Kimball, Claire      8      W MKTG  61356.69   high    2  93   92
## Cooper, Lindsay      4      W MKTG  56772.95   high    1  78   91
## Downs, Deborah       7      W FINC  57139.90   high    2  90   86
## Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
## Afshari, Anbar       6      W ADMN  69441.93   high    2 100  100
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1  65   62

Can also sort by row names in ascending order.

Sort(d, row.names)

## 
## Sort Specification
##   row.names -->  ascending

##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Afshari, Anbar       6      W ADMN  69441.93   high    2 100  100
## Campagna, Justin     8      M SALE  72321.36    low    1  76   84
## Cooper, Lindsay      4      W MKTG  56772.95   high    1  78   91
## Downs, Deborah       7      W FINC  57139.90   high    2  90   86
## Hoang, Binh         15      M SALE 111074.86    low    3  96   97
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1  65   62
## Kimball, Claire      8      W MKTG  61356.69   high    2  93   92
## Knox, Michael       18      M MKTG  99062.66    med    3  81   84
## Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
## Wu, James           NA      M SALE  94494.58    low    1  62   74

Specify random for the variable to randomly re-shuffle the rows of data.

Sort(d, random)

## 
## Sort Specification
##   random

##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Downs, Deborah       7      W FINC  57139.90   high    2  90   86
## Hoang, Binh         15      M SALE 111074.86    low    3  96   97
## Kimball, Claire      8      W MKTG  61356.69   high    2  93   92
## Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
## Knox, Michael       18      M MKTG  99062.66    med    3  81   84
## Cooper, Lindsay      4      W MKTG  56772.95   high    1  78   91
## Afshari, Anbar       6      W ADMN  69441.93   high    2 100  100
## Wu, James           NA      M SALE  94494.58    low    1  62   74
## Campagna, Justin     8      M SALE  72321.36    low    1  76   84
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1  65   62

tidyverse

The tidyverse provides the function arrange() to sort the rows of a data frame. The default is ascending order.

arrange(d, Dept)

##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
## Afshari, Anbar       6      W ADMN  69441.93   high    2 100  100
## Downs, Deborah       7      W FINC  57139.90   high    2  90   86
## Knox, Michael       18      M MKTG  99062.66    med    3  81   84
## Kimball, Claire      8      W MKTG  61356.69   high    2  93   92
## Cooper, Lindsay      4      W MKTG  56772.95   high    1  78   91
## Wu, James           NA      M SALE  94494.58    low    1  62   74
## Hoang, Binh         15      M SALE 111074.86    low    3  96   97
## Campagna, Justin     8      M SALE  72321.36    low    1  76   84
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1  65   62

To sort in descending order, following the tidyverse way, introduce another function, here desc().

arrange(d, desc(Dept))

##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Wu, James           NA      M SALE  94494.58    low    1  62   74
## Hoang, Binh         15      M SALE 111074.86    low    3  96   97
## Campagna, Justin     8      M SALE  72321.36    low    1  76   84
## Knox, Michael       18      M MKTG  99062.66    med    3  81   84
## Kimball, Claire      8      W MKTG  61356.69   high    2  93   92
## Cooper, Lindsay      4      W MKTG  56772.95   high    1  78   91
## Downs, Deborah       7      W FINC  57139.90   high    2  90   86
## Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
## Afshari, Anbar       6      W ADMN  69441.93   high    2 100  100
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1  65   62

Or, sort by two variables.

arrange(d, Dept, Gender)

##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
## Afshari, Anbar       6      W ADMN  69441.93   high    2 100  100
## Downs, Deborah       7      W FINC  57139.90   high    2  90   86
## Knox, Michael       18      M MKTG  99062.66    med    3  81   84
## Kimball, Claire      8      W MKTG  61356.69   high    2  93   92
## Cooper, Lindsay      4      W MKTG  56772.95   high    1  78   91
## Wu, James           NA      M SALE  94494.58    low    1  62   74
## Hoang, Binh         15      M SALE 111074.86    low    3  96   97
## Campagna, Justin     8      M SALE  72321.36    low    1  76   84
## Jones, Alissa        5      W <NA>  53772.58   <NA>    1  65   62

Missing Data

View Missing Data

d[!complete.cases(d), ]

##                    Years Gender Dept   Salary JobSat Plan Pre Post
## Wu, James             NA      M SALE 94494.58    low    1  62   74
## Jones, Alissa          5      W <NA> 53772.58   <NA>    1  65   62
## Korhalkar, Jessica     2      W ACCT 72502.50   <NA>    2  74   87

There is missing data.

##                     Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell        7      M ADMN  53788.26    med    1  82   92
## Wu, James              NA      M SALE  94494.58    low    1  62   74
## Downs, Deborah          7      W FINC  57139.90   high    2  90   86
## Hoang, Binh            15      M SALE 111074.86    low    3  96   97
## Jones, Alissa           5      W <NA>  53772.58   <NA>    1  65   62
## Afshari, Anbar          6      W ADMN  69441.93   high    2 100  100
## Knox, Michael          18      M MKTG  99062.66    med    3  81   84
## Campagna, Justin        8      M SALE  72321.36    low    1  76   84
## Kimball, Claire         8      W MKTG  61356.69   high    2  93   92
## Cooper, Lindsay         4      W MKTG  56772.95   high    1  78   91
## Saechao, Suzanne        8      W SALE  55545.25    med    1  98  100
## Pham, Scott            13      M SALE  81871.05   high    2  90   94
## Tian, Fang              9      W ACCT  71084.02    med    2  60   61
## Bellingar, Samantha    10      W SALE  66337.83    med    1  67   72
## Sheppard, Cory         14      M FINC  95027.55    low    3  66   73
## Kralik, Laura          10      W SALE  92681.19    med    2  74   71
## Skrotzki, Sara         18      W MKTG  91352.33    med    2  63   61
## Correll, Trevon        21      M SALE 134419.23    low    1  97   94
## James, Leslie          18      W ADMN 122563.38    low    3  70   70
## Osterman, Pascal        5      M ACCT  49704.79   high    3  69   70
## Adib, Hassan           14      M SALE  83014.43    med    2  71   69
## Gvakharia, Kimberly     3      W SALE  49868.68    med    2  83   79
## Stanley, Grayson        9      M SALE  69624.87    low    1  74   73
## Link, Thomas           10      M FINC  66312.89    low    1  83   83
## Portlock, Ryan         13      M SALE  77714.85    low    1  72   73
## Langston, Matthew       5      M SALE  49188.96    low    3  94   93
## Stanley, Emma           3      W ACCT  46124.97   high    2  86   84
## Singh, Niral            2      W ADMN  61055.44   high    2  59   59
## Anderson, David         9      M ACCT  69547.60    low    1  94   91
## Fulton, Scott          13      M SALE  87785.51    low    1  72   73
## Korhalkar, Jessica      2      W ACCT  72502.50   <NA>    2  74   87
## LaRoe, Maria           10      W MKTG  61961.29   high    2  80   86
## Billing, Susan          4      W ADMN  72675.26    med    2  91   90
## Capelle, Adam          24      M ADMN 108138.43    med    2  83   81
## Hamide, Bita            1      W MKTG  51036.85   high    2  83   90
## Anastasiou, Crystal     2      W SALE  56508.32    low    2  59   71
## Cassinelli, Anastis    10      M FINC  57562.36   high    1  80   87

Remove Missing Data

To remove the all rows of data (cases) that contain at least one missing value, use the Base R na.omit() function.

d <- na.omit(d)

The result is a data frame with 9 rows instead of 10 rows.

##                     Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell        7      M ADMN  53788.26    med    1  82   92
## Downs, Deborah          7      W FINC  57139.90   high    2  90   86
## Hoang, Binh            15      M SALE 111074.86    low    3  96   97
## Afshari, Anbar          6      W ADMN  69441.93   high    2 100  100
## Knox, Michael          18      M MKTG  99062.66    med    3  81   84
## Campagna, Justin        8      M SALE  72321.36    low    1  76   84
## Kimball, Claire         8      W MKTG  61356.69   high    2  93   92
## Cooper, Lindsay         4      W MKTG  56772.95   high    1  78   91
## Saechao, Suzanne        8      W SALE  55545.25    med    1  98  100
## Pham, Scott            13      M SALE  81871.05   high    2  90   94
## Tian, Fang              9      W ACCT  71084.02    med    2  60   61
## Bellingar, Samantha    10      W SALE  66337.83    med    1  67   72
## Sheppard, Cory         14      M FINC  95027.55    low    3  66   73
## Kralik, Laura          10      W SALE  92681.19    med    2  74   71
## Skrotzki, Sara         18      W MKTG  91352.33    med    2  63   61
## Correll, Trevon        21      M SALE 134419.23    low    1  97   94
## James, Leslie          18      W ADMN 122563.38    low    3  70   70
## Osterman, Pascal        5      M ACCT  49704.79   high    3  69   70
## Adib, Hassan           14      M SALE  83014.43    med    2  71   69
## Gvakharia, Kimberly     3      W SALE  49868.68    med    2  83   79
## Stanley, Grayson        9      M SALE  69624.87    low    1  74   73
## Link, Thomas           10      M FINC  66312.89    low    1  83   83
## Portlock, Ryan         13      M SALE  77714.85    low    1  72   73
## Langston, Matthew       5      M SALE  49188.96    low    3  94   93
## Stanley, Emma           3      W ACCT  46124.97   high    2  86   84
## Singh, Niral            2      W ADMN  61055.44   high    2  59   59
## Anderson, David         9      M ACCT  69547.60    low    1  94   91
## Fulton, Scott          13      M SALE  87785.51    low    1  72   73
## LaRoe, Maria           10      W MKTG  61961.29   high    2  80   86
## Billing, Susan          4      W ADMN  72675.26    med    2  91   90
## Capelle, Adam          24      M ADMN 108138.43    med    2  83   81
## Hamide, Bita            1      W MKTG  51036.85   high    2  83   90
## Anastasiou, Crystal     2      W SALE  56508.32    low    2  59   71
## Cassinelli, Anastis    10      M FINC  57562.36   high    1  80   87

The function complete.cases() returns a logical vector indicating which cases are complete, i.e., have no missing values.

Data Wrangling: Cleaning and Preparation

The Work Before Analysis Begins

David Gerbing The School of Business Portland State University

Sunday January 29, 2023 at 11:48

Scalars and Vectors

Scalar

Vectors

Base R Subset Operator

Base R which() Function

Data Frame Subsetting Operations

General Extract Function

Subset Data Frames by Index

lessR .() Subsetting

Select Rows by Logical Criteria

Select Columns by Variable Names

Base R Subsetting

dplyr Subsetting

Create and Transform Variables

Base R Direct Arithmetic Expression

Specialized Transform Functions

Summarize with Aggregate Data

Aggregation

lessR pivot()

Parameters

Examples

Output as a Long-Form Data Frame

Output as a 2-d Table

Tabulation

Other Aggregation Functions

Forward Pipe Operator

First Example

More Realistic Example

lessR

tidyverse

Compare

Comment

Sort a Data Frame

lessR

tidyverse

Missing Data

View Missing Data

Remove Missing Data

David Gerbing
The School of Business
Portland State University