Base R functions are accessed simply by opening the R app, whether by itself or withing RStudio.
Access lessR functions.
library(lessR)
##
## lessR 4.2.6 feedback: gerbing@pdx.edu
## --------------------------------------------------------------
## > d <- Read("") Read text, Excel, SPSS, SAS, or R data file
## d is default data frame, data= in analysis routines optional
##
## Learn about reading, writing, and manipulating data, graphics,
## testing means and proportions, regression, factor analysis,
## customization, and descriptive statistics from pivot tables.
## Enter: browseVignettes("lessR")
##
## View changes in this and recent versions of lessR.
## Enter: news(package="lessR")
##
## **Newly Revised**: Interactive data analysis.
## Enter: interact()
Access tidyverse functions, particularly those from the
dplyr package, the primary source of tidyverse data
manipulation functions. The tidyverse consists of many packages, usually
all loaded at once.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::recode() masks lessR::recode()
## ✖ dplyr::rename() masks lessR::rename()
Read the data, here from a lessR built-in data table, so
only need supply the name of the data table.
d <- Read("Employee")
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 7 ... 1 2 10
## 2 Gender character 37 0 2 M M W ... W W M
## 3 Dept character 36 1 5 ADMN SALE FINC ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low high ... high low high
## 6 Plan integer 37 0 3 1 1 2 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 90 ... 83 59 80
## 8 Post integer 37 0 22 92 74 86 ... 90 71 87
## ------------------------------------------------------------------------------------------
head(d)
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James NA M SALE 94494.58 low 1 62 74
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
The data will be re-read many times for the following analyses, each time to provide a clean copy of the data for another manipulation. The additional statements for reading are hidden so as to not clutter the output.
The presentation of data manipulation content begins with the simpler 1-dimensional data structure, the vector. Building on those concepts, the material is extended to the primary two-dimensional data structure, the data frame.
There are more than three types of data structures in data analysis systems, but the three described here are the most basic: scalars, vectors, and data frames. The following material shows how to create and do some operations on these data structures with R.
Scalar: A single value.
Here, enter a scalar such as the number 5 into the R console. The [1] in the output indicates the first line of output. In these brief examples, there is only one line.
5
## [1] 5
Or, set the contents of a variable to the number 5, then display the contents by listing the variable name.
x <- 5
x
## [1] 5
Character values are set in quotes when setting a value.
MyGender <- "Male"
MyGender
## [1] "Male"
R serves as a high-level, free, calculator.
y <- x^2 + 100
y
## [1] 125
When working with data, it is often needed to specify a collection of multiple values. The values could be data values or values passed to a function to influence its output, such as a list of various colors for the bars in a bar graph.
Vector: A variable that consists of multiple values.
The most general form for specifying an R vector is the
c() function. For example, define a vector of the first
five integers beginning with 1.
c(1,2,3,4,5)
## [1] 1 2 3 4 5
When a vector is created its values are assigned to a specific variable or parameter. Here, assign the values of the vector to any chosen variable, such as x, as in this example.
x <- c(1,2,3,4,5)
x
## [1] 1 2 3 4 5
For the special case of sequential values, two shortcuts are
available. One shortcut is the : notation for sequential
integers.
1:5
## [1] 1 2 3 4 5
The seq() function defines any regular sequence of
values. To use seq(), specify three values, the beginning
value, the ending value, and the interval between the values, the
parameter by.
seq(1, 5, by=1)
## [1] 1 2 3 4 5
We also can have vectors with character values.
G <- c("Male", "Female", "Other")
G
## [1] "Male" "Female" "Other"
One common use of vectors is passing multiple values to the parameter
of a function. Here is an example bar chart with customized colors
instead of relying upon the default colors. Each of the two bars is
specified to display with a different color by setting the
fill parameter to a vector. Each bar has a different border
color, specified with the color parameter, again set to a
vector, one color per bar.
d <- Read("Employee", quiet=TRUE)
BarChart(Gender, fill=c("darkseagreen3", "darkgoldenrod3"),
color=c("darkseagreen4", "darkgoldenrod4"), quiet=TRUE)
Another example of an R vector at least implicitly present in every
data analysis is the vector of variable names for a data frame, which
contains the data for analysis. The data frame is a two-dimensional
table, but the names of the variables in the data frame, obtained from
the Base R names() function, is a vector. In this example
we read the into the data frame named d.
names(d)
## [1] "Years" "Gender" "Dept" "Salary" "JobSat" "Plan" "Pre" "Post"
Sometimes it is useful to know if a specific value is located in a vector.
%in%operator: Identify if a value is contained within a vector.
For example, consider the predefined character vector called
letters that contains the 26 lowercase letters of the Roman
alphabet. List all 26 letters by entering the name of the vector.
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"
## [22] "v" "w" "x" "y" "z"
Now test if a specific character value is in the vector.
"f" %in% letters
## [1] TRUE
"5" %in% letters
## [1] FALSE
Subsetting is an essential data preparation process. You have a collection of data values and you only want to analyze some of them, a subset of the original data.
To subset, we need to know the position of the elements in the data structure, such as data values, that we wish to extract. For a vector, identify the position by its ordinal position, 1 for the first value, ten for the tenth value, etc.
Index: A number that indicates the ordinal position of a data value within a data structure.
The concept of an index necessarily applies to any data structure,
such as found in R and Python/Pandas. To subset a vector, specify the
desired indices within the square brackets, [ ] immediately
after the vector’s name.
[ ]: With the specified index or indices inside, refers to a subset of a data structure (in both R and Python/Pandas), the subsetting operator.
For example, refer to just the first 5 letters of the
letters vector with the subsetting operator. Here indicate
to extract just those values in the range from 1 to 5, that is, the
first 5 lowercase letters of the alphabet.
letters[1:5]
## [1] "a" "b" "c" "d" "e"
To extract the first 5 and the last 5 letters, use the c
function to define a single vector that includes both
specifications.
letters[c(1:5, 22:26)]
## [1] "a" "b" "c" "d" "e" "v" "w" "x" "y" "z"
Use the - sign to indicate not in reference to a list of indices. Here extract all the letters that do not have indices 1 through 5 or 22 through 26.
letters[-c(1:5, 22:26)]
## [1] "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"
Consider another vector, the names of the variables in the d
data frame. Refer to the name of the second variable as the second
element of the Base R names() function applied to the
d data frame.
names(d)[2]
## [1] "Gender"
The Base R which() function forms the basis for many of
the Base R data manipulations. Specify a logical condition and see which
values of the data structure satisfy the condition. For example, which
numbers in a vector are even? Which column of a data frame is named
Gender?
which()function: Return the indices of a data structure that correspond to TRUE according to a specified logical condition.
Calling which() to process a vector of logical values
results in index numbers as if the index numbers were entered
directly.
To illustrate, specify a logical expression and then apply to one
element. Which of the letters in the alphabet is the letter c? That is,
where is c located in the letters vector? Of the 26 values
of the letters vector, only one is equal to “c”.
The double equals sign, ==, specifies a logical
comparison. It compares two entities and returns TRUE if equal and FALSE
if not. With R we can evaluate all the elements of a vector at once
simply by specifying the vector, in this example, the 26 lowercase
letters of the alphabet. R will take the first element of the vector
letters, which is “a” and evaluate if it is “c”, then the
second element, and so on through all 26 letters.
The corresponding logical vector that evaluates each letter for this
equality consists of 25 FALSE values and 1 TRUE value. In
this example, the third element of the resulting logical vector has the
value of TRUE.
letters == "c"
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
To get the index of “c”, use which(). As indicated, the
function evaluates all the values of the resulting vector of TRUE and
FALSE values, and returns the indices of the TRUE values.
which(letters == "c")
## [1] 3
Then can use the indices output of which() to subset.
Again, the purpose of the which() function is to get the
index value. Because the expression which(letters == "c")
reduces to the number 3, the following is the same as entering
letters[3].
my_c <- letters[which(letters == "c")]
my_c
## [1] "c"
To reference more than a single value use the R operator
%in% followed by a vector with multiple values.
which(letters %in% c("a", "c", "e"))
## [1] 1 3 5
Or, preface the entire expression with a - sign to
indicate the corresponding negative values.
-which(letters %in% c("a", "c", "e"))
## [1] -1 -3 -5
Doing a subset with the negative values return all the values not
specified. The following is equivalent to entering
letters[c(-1, -3, -5)], which subsets all values of the
letters vector that are not the first, third, and
fifth values in the vector.
letters[-which(letters %in% c("a", "c", "e"))]
## [1] "b" "d" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x"
## [22] "y" "z"
The fundamental data structure that stores the data table, which contains the variables for analysis, in both R and Python/Pandas is the data frame.
Data frame: Collection of vectors, one for each column (variable) in the data frame, all of the same length.
Every data analysis software contains functions to implement these data manipulations. Even within a single set of analysis software, such as R, there are multiple ways to accomplish each manipulation.
The 2-dimensional data frame has both rows and columns. Accordingly, to identify a subset of a data frame we need to identify rows and columns to be included (or excluded) from the data frame. The general form of the subsetting expression for the data frame name d:
d[rows, columns]
The concept is to follow the same logic for vectors but apply to rows and/or columns. The rows specifications come before the comma, and the columns specifications come after the comma. An important convention for the subsetting of 2-dimensional tables such as data frames is that if no rows or no columns are specified then all rows or all columns are indicated.
Express the rows and columns of the Extract function
d[row,cols](for data frame d) with either indices or logical expressions.
An example of a set of indices is 1:3, the first three.
An example of a logical expression is Gender=="Male", where
the two equal signs indicate to check for equality of existing
values.
For example, entering the name of an R object, such as the data frame
d, implicitly invokes the R print() function, which
displays the entire contents of the object, such as a data frame, at the
console. Using the subset notation, selected pieces of the object
instead can be displayed. In this example, display the first three rows
of data for all the variables.
d[1:3,]
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James NA M SALE 94494.58 low 1 62 74
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
Here display the first two rows of data for the first, second, third, and sixth variable.
d[1:2, c(1:3, 6)]
## Years Gender Dept Plan
## Ritchie, Darnell 7 M ADMN 1
## Wu, James NA M SALE 1
Here explicitly refer to the column numbers for which to delete the variables Pre and Post, where Pre and Post are in Columns 7 and 8. Retain all the rows of data.
d <- d[, -(7:8)]
With the Base R subsetting, directly specify the rows to retain or
delete. The nrow() function specifies the number of rows of
data, which, in this context, refers to the last row of data.
nrow(d)
## [1] 37
So (nrow(d)-5):nrow(d) gives the indices of the last five rows of the data frame.
(nrow(d)-5):nrow(d)
## [1] 32 33 34 35 36 37
Here extract the last five rows of data. Note that the space after the comma is blank, which means that all variables are referenced.
d <- d[nrow(d)-5:nrow(d), ]
Display the column indices with the lessR
details_brief() function, abbreviated db(),
the same function Read() calls to display information about
the data frame just read. The variables are numbered from 1 to the last
variable. As usual with lessR, the default data frame is
d.
db()
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 31 1 14 10 2 13 ... 7 NA 7
## 2 Gender character 32 0 2 W W M ... W M M
## 3 Dept character 31 1 5 MKTG ACCT SALE ... FINC SALE ADMN
## 4 Salary double 32 0 32 61961.29 72502.5 ... 94494.58 53788.26
## 5 JobSat character 30 2 3 high NA low ... high low med
## 6 Plan integer 32 0 3 2 2 1 ... 2 1 1
## 7 Pre integer 32 0 26 80 74 72 ... 90 62 82
## 8 Post integer 32 0 20 86 87 73 ... 86 74 92
## ------------------------------------------------------------------------------------------
A more general solution to subsetting variables refers to the
variables by their names. If the position of the variables in the data
frame changes due to other modifications, the specified subsetting still
works. Perhaps the most convenient method to subset a data frame is to
use the Base R Extract function, [ ], in conjunction with
lessR function .(), which, returns either row
or column indices depending on its position.
Rows of data from the data table can be selection according to
logical criteria. The retained rows of data are those that fulfill the
criteria. Use ! for not. To relate two different criteria,
use & for and | for or. For exclusive
pairing, one of two criteria are true but not both, invoke the
xor function, as in xor(x, y).
For Base R, when variables are directly included in the subsetting expression, their location must be identified within the specified data frame, even if the same data frame that is being subset.
For example, d$Gender and d$Salary when
subsetting the d data frame by Gender and Salary. Here subset
to only men with salaries above $90,000, retaining all the variables.
That is, extract only those rows where the variable Gender is equal to
“M” and the variable Salary is greater than $90,000.
d <- d[d$Gender=="M" & d$Salary>90000, ]
Applied to the Base R Extract function, the lessR
function .() eliminates the need to pre-pend the data frame
name and a $ to each variable name in the specified logical expression
to select rows.
d[.(Gender=="M" & Salary>90000), ]
## Years Gender Dept Salary JobSat Plan Pre Post
## Wu, James NA M SALE 94494.58 low 1 62 74
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Sheppard, Cory 14 M FINC 95027.55 low 3 66 73
## Correll, Trevon 21 M SALE 134419.23 low 1 97 94
## Capelle, Adam 24 M ADMN 108138.43 med 2 83 81
To select rows by row names, invoke the Base R function
rownames(), referencing the data frame, here d. If
selecting more than a single row, express the row names as a vector and
use %in% in place of double equals, ==. Again,
no variables specified after the comma, so all variables are retained in
the subset.
d[rownames(d) %in% c("Hoang, Binh", "Downs, Deborah"), ]
## Years Gender Dept Salary JobSat Plan Pre Post
## Downs, Deborah 7 W FINC 57139.9 high 2 90 86
## Hoang, Binh 15 M SALE 111074.9 low 3 96 97
Note that most lessR analysis functions also have a
rows parameter. When calling a function, such as
Histogram(), set the rows parameter equal to a
logical criterion in the function call to subset the rows of data in the
data frame to be analyzed. That strategy avoids extra statements for
subsetting the data frame.
In the following example, display the histogram for only women’s
salaries. To save space, set the parameter quiet to TRUE to
avoid the text output of the function.
Histogram(Salary, rows=(Gender=="W"), quiet=TRUE)
The Base R Extract function by itself presents two
annoying features for variable selection. First, to select specific
variables, all variable names must be between quotes. Second, only
individual variables can be specified, more than one variable as a
vector with the c() function. Variable ranges of contiguous
variables are not permitted.
The lessR function .() addresses these
deficiencies. No quotes are required around the variable names, and as
many variable ranges and single variables as desired may be specified in
a single function call.
Here specify all variables in the d data frame of the
Employee data starting with Years through Salary, and also add the
variable Post to the specification. The .() function
returns the corresponding variable indices within the context of the
Extract function. List the values of these variables for just the first
three rows of data.
d[1:3, .(Years:Salary, Post)]
## Years Gender Dept Salary Post
## Ritchie, Darnell 7 M ADMN 53788.26 92
## Wu, James NA M SALE 94494.58 74
## Downs, Deborah 7 W FINC 57139.90 86
To indicate that the specified variables should be excluded instead
of included, add a minus sign, -, preceding the
.() function call.
d[1:3, -.(Years:Salary, Post)]
## JobSat Plan Pre
## Ritchie, Darnell med 1 82
## Wu, James low 1 62
## Downs, Deborah high 2 90
The Extract function has the annoying characteristic of
returning a vector instead of a data frame if there is only one variable
retained.
d[1:3, .(Salary)]
## [1] 53788.26 94494.58 57139.90
To keep the status of a data frame when retaining only a single
variable, set the drop parameter to FALSE.
d[1:3, .(Salary), drop=FALSE]
## Salary
## Ritchie, Darnell 53788.26
## Wu, James 94494.58
## Downs, Deborah 57139.90
Using Base R Extract, that is, [ ], in conjunction with
lessR .() provides all the tools needed to
conveniently extract a subset of information from a data frame. But then
we have multiple specialized functions to accomplish the same thing,
from both Base R and the tidyverse package dplyr.
Why these specialized functions from multiple sources? The R people
provided some, likely because of their limitations of their own Extract
function. But why not fix that function instead of introducing more?
Regardless, these other functions exist and are used more frequently
than Base R Extract, understandable given its limitations, though now
addressed with lessR .().
The Base R function subset() provides the option for
deleting variables from a data frame with the select parameter.
Unmodified, the specified variables are those that are included in the
resulting data frame. Add the - sign to indicate
deletion.
This example uses the Base R subset() function with the
select parameter, here deleting two variables from the data
frame.
d <- subset(d, select=-c(Pre, Post))
head(d)
## Years Gender Dept Salary JobSat Plan
## Ritchie, Darnell 7 M ADMN 53788.26 med 1
## Wu, James NA M SALE 94494.58 low 1
## Downs, Deborah 7 W FINC 57139.90 high 2
## Hoang, Binh 15 M SALE 111074.86 low 3
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1
## Afshari, Anbar 6 W ADMN 69441.93 high 2
Base R also uses the subset() function to extract rows
of data from a data frame, essentially by dropping the
select parameter. The name of the data frame appears as the
first parameter in the function call.
d <- subset(d, Gender=="M" & Salary>90000)
To subset columns of a data frame, the tidyverse dplyr
package provides the function select().
d <- select(d, -c(Pre, Post))
head(d)
## Years Gender Dept Salary JobSat Plan
## Ritchie, Darnell 7 M ADMN 53788.26 med 1
## Wu, James NA M SALE 94494.58 low 1
## Downs, Deborah 7 W FINC 57139.90 high 2
## Hoang, Binh 15 M SALE 111074.86 low 3
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1
## Afshari, Anbar 6 W ADMN 69441.93 high 2
The select() function also permits the specification of
each variable separately instead of specifying as a vector.
## Years Gender Dept Salary JobSat Plan
## Ritchie, Darnell 7 M ADMN 53788.26 med 1
## Wu, James NA M SALE 94494.58 low 1
## Downs, Deborah 7 W FINC 57139.90 high 2
## Hoang, Binh 15 M SALE 111074.86 low 3
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1
## Afshari, Anbar 6 W ADMN 69441.93 high 2
The tidyverse function in the dpylr package,
filter(), does the same thing as Base R
subset() with identical syntax: Subset rows of a data
frame. (Yet this, and related functions, are heralded by many as an
advance by those who apparently do not understand that the same
functionality exists within Base R and has since at least Version 1.0 of
R around Year 2000.)
d <- filter(d, Gender=="M" & Salary>90000)
The result of this application of filter() is the same
as the other approaches, a data frame of only males with salaries above
$90,000. The revised data frame is now much smaller.
d
## Years Gender Dept Salary JobSat Plan
## Wu, James NA M SALE 94494.58 low 1
## Hoang, Binh 15 M SALE 111074.86 low 3
## Knox, Michael 18 M MKTG 99062.66 med 3
## Sheppard, Cory 14 M FINC 95027.55 low 3
## Correll, Trevon 21 M SALE 134419.23 low 1
## Capelle, Adam 24 M ADMN 108138.43 med 2
However, filter() only works with logical expressions.
Another function, slice() selects rows based on the integer
position of the rows in the data frame, the indices. The following
selects the first three rows of the d data frame (actually
tibble if using only dplyr).
slice(d, 1:3)
## Years Gender Dept Salary JobSat Plan
## Wu, James NA M SALE 94494.58 low 1
## Hoang, Binh 15 M SALE 111074.86 low 3
## Knox, Michael 18 M MKTG 99062.66 med 3
Note that to subset both rows and columns with dplyr,
the result from the first function must be entered into the input for
the second function. The best way to do this is with the pipe operator,
explored later in this reading.
The tidyverse presents yet another function, the slice()
function, to use indices to specify the rows of data to retain or to
delete. (The tidyverse presents many new functions.) Use the
dplyr function n() to indicate the last row of
data, written as n(). Here retain the fifth row of data
through the last row.
d <- slice(d, 5:n())
Equivalently, delete the first four rows of data.
d <- slice(d, -1:-4)
The subsetting result follows.
head(d)
## Years Gender Dept Salary JobSat Plan Pre Post
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
The result for all of these row deletions provides the same result in
terms of the selected rows. The only distinction is that the
slice() function, being a tidyverse function, does not
recognize row names, so starts the numbering of the retained rows at 1.
The Base R functions retain the original row names, so start the
numbering at 5.
Transform a variable with an arithmetic expression that defines a
computation based on the values of existing variables. Before writing
the transformation in a computer language, begin with its algebraic
expression. For example, to transform a measurement of height from feet
to inches, multiply each measurement in inches by 12. The following
equation expresses this transformation apart from any computer function
or data table. \[X_{inches} = 12 *
X_{feet}\] This equation is a logical equality in the form of an
algebraic equation. The equals sign, =, specifies the
logical equality. For specific values of \(X_{feet}\) and \(X_{inches}\), the statement is either TRUE
or FALSE. This equation, by itself, has nothing to do with computers or
computer languages.
To express a transformation in the language of a computer function, shift the focus from an algebraic equation to writing an expression that instructs the computer to perform computations on the values of variables, usually in a data table. This transformation is a set of physical actions that computes new values of a variable and stores the result on the computer. Move from writing an algebraic equality to writing an assignment statement that instructs the computer to do computational work to assign values to a variable across rows of data.
Assignment statement: Instruct the computer to compute the values of a variable from a specified arithmetic expression of existing variables.
Every computer language or dialect defines its notation for an
assignment statement. Many languages, such as R, explicitly do not use
the equals sign, =, of an algebraic or logical equality for
an assignment statement. For example, R uses <-. A
statement of an equality is a separate concept from having the computer
do computations on data values. That is why many computer languages have
a different symbol for evaluating a logical equality as TRUE or FALSE.
For example, R uses the double equal sign, == for this
evaluation.
When defining a transformation, decide either to create an entirely new variable in the data table, or replace the values of the existing variable. If the expression leads to a new variable on the left of the assignment statement, a new variable has been created. Or, the expression can redefine an existing variable by specifying the same variable on the left of the assignment statement as on the right. To express this concept in terms of a computer function to enter into an R program depends on the specific function chosen to accomplish that transformation, with examples following.
To create variables with arithmetic expressions with Base R that transform one or more variables, simply write the arithmetic expression. The transformation can create a new variable or transform the values of an existing variable.
Because it is possible to have multiple data frames in a working R session, and because it is possible to have different variables of the same name in different data frames, identify each variable by including the data frame that contains the variable. To identify the location of a variable within the specified data frame, prep-end the data frame name that contains the variable, followed by a $, to each variable name.
As always, the R assignment operator <- indicates to
take the expression on the right of the <- and assign
its value to whatever is on the left of the operator. In this example,
PP and PxP are new transformed variables, created from the values of
existing variables in the data frame, Pre and Post. Because we want the
newly created variables, such as PP or PxP, to reside in the d
data frame with the other variables in our analysis, we also prefix the
variable name, PP, with a d$.
d$PP <- d$Pre + d$Post
d$PxP <- d$Pre * d$Post
If working within RStudio, the easiest way to enter such equations
into either the console directly or into a text file in the Source
window is auto-completion. Enter the name of the data frame and the
\$ sign, such as d$. The result is a little pop-up window
that displays the list of variables in the data frame from which to
select. The result is not only less typing, but also no typing errors
from misspelling variable names.
A transformation involves simply writing the corresponding formula for the transformation. But, once again, we have specialized functions to accomplish the same task.
The primary issue with these transformation equations is the need to
repeat the name of the data frame, particularly burdensome if the data
frame name is long. To address this situation, Base R offers the
transform() function. The first argument of the function is
the data frame, which then does not need to repeated for each of the
enclosed variables.
In this example, the Base R transform() function creates
the new variables PP and PxP. Because this example simultaneously
creates two new variables, a comma is needed to separate the
transformation equations.
d <- transform(d,
PP = Pre + Post,
PxP = Pre * Post
)
From the tidyverse dplyr package the
mutate() function creates a new variable. This function has
the identical syntax as the Base R transform() function,
though mutate() will also lose any row names as the
tidyverse even changes the properties of the basic data frame.
d <- mutate(d,
PP = Pre + Post,
PxP = Pre * Post
)
The result from any of these functions. The newly created variables in the revised data frame are appended to the right of the existing columns in the data frame.
class(d)
## [1] "data.frame"
head(d)
## Years Gender Dept Salary JobSat Plan Pre Post PP PxP
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92 174 7544
## Wu, James NA M SALE 94494.58 low 1 62 74 136 4588
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86 176 7740
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97 193 9312
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62 127 4030
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100 200 10000
So mutate() performs the same as the standard Base R
transform(), with the loss of any row names, and also
yields a standard R data frame if passed a data frame as input.
Note: The following material on lessR function
pivot() is extracted from the vignette on the same topic.
To view this and other vignettes, each of which provides explanation and
examples on a specific topic, enter into R:
browseVignettes("lessR").
What is the mean salary across various departments in a company? What is the mean salary across departments for men and for women? The answer to these questions is done by what is called aggregation.
Aggregation: Compute a statistic of a numerical variable over groups defined by one or more categorical variables.
It the above example, the numerical variable is Salary. The groups are defined by the different departments, and then by the men and women within each department. The mean of Salary is computed for each department in the first aggregation. In the second aggregation the mean salary is computed each combination of the levels of Gender and Department, which defines defines the groups. One group is women in sales. Another group is men in accounting.
The following table lists many of the available statistical functions by which to aggregate over the groups.
| Statistic | Meaning |
|---|---|
sum |
sum |
mean |
arithmetic mean |
median |
median |
min |
minimum |
max |
maximum |
sd |
standard deviation |
var |
variance |
IQR |
inter-quartile range |
mad |
mean absolute deviation |
tabulate |
count of each cell only |
Some statistical functions are available that return multiple values.
| Statistic | Meaning |
|---|---|
range |
minimum, maximum |
quantile |
range + quartiles |
summary |
quantile + mean |
These later three functions output their values as an R matrix, which then replaces the values of the variable that is aggregated in the resulting output data frame. Custom functions can also be written for computing the statistic over which to aggregate.
To aggregate data, lessR provides the function
pivot(), following the Excel name pivot table to describe
the summary table of aggregated data. After a lifetime of using R,
Python, SAS, Excel, and others, I am not aware of any aggregation
function as straightforward to use as pivot(), which is why
I wrote the function.
The following pivot() parameters specify the data, the
statistic(s) for the aggregation, the variables(s) over which to
aggregate, and the corresponding cells that contain the aggregated
values. The parameter names are chosen so that describing an example, as
follows, is a regular English sentence.
The first three listed parameter values are required to get one or
more statistics computed over the entire data set. The fourth parameter,
by, specifies the categorical variable(s) over which to
aggregate, that is, compute the one or more specified statistics.
data: The data frame that includes the variables of
interest.compute: The function(s) for the statistic(s) for which
to perform the aggregation.variable: The variable(s) for which to summarize, i.e.,
aggregate.by: The categorical variable(s) that define the
sub-groups or cells for which to compute the aggregated values.by_cols: The optional categorical variable(s) (not
shown above) that define the sub-groups or cells for which to compute
the aggregated values, listed as columns in a two-dimensional table for
viewing.Multiple values of parameters variable, by,
and up to two by_cols variables may be specified. Express
the multiple categorical variables over which to pivot as a vector, such
as with the c() function. If by_cols is not
specified, the result is a summary table as a long-form data frame that
can be input into other data analysis procedures.
By default, missing values are eliminated from the aggregation.
Missing values can occur for the value being aggregated as well as the
groups over which the aggregation is conducted. If there are no values
for a cell for which to perform the aggregation, and it is desired to
have the aggregated value listed as missing, then specify
na_value as TRUE. If any of the levels of the
by variables are missing, to report those missing cells,
specify na_by as TRUE. Set
na_remove to TRUE to have any aggregated value
defined as missing if any individual data cell for variable
is missing.
To illustrate, use the Employee data set included in
lessR. Here read into the d data frame. Begin with
no missing data. Custom functions can also be written for computing the
statistic over which to aggregate.
d <- Read("Employee")
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 7 ... 1 2 10
## 2 Gender character 37 0 2 M M W ... W W M
## 3 Dept character 36 1 5 ADMN SALE FINC ... MKTG SALE FINC
## 4 Salary double 37 0 37 53788.26 94494.58 ... 56508.32 57562.36
## 5 JobSat character 35 2 3 med low high ... high low high
## 6 Plan integer 37 0 3 1 1 2 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 90 ... 83 59 80
## 8 Post integer 37 0 22 92 74 86 ... 90 71 87
## ------------------------------------------------------------------------------------------
Two categorical variables in the d data frame are Dept and Salary. A continuous variable is Salary. Create the long-form pivot table as a data frame that expresses the mean of Salary for all combinations of Dept and Salary.
This example includes the parameter names for emphasis, but if the
parameters are entered in this order, their names are not necessary.
Because the aggregation is over two categorical variables, create a
vector of the two variables for the by parameter.
pivot(data=d, compute=mean, variable=Salary, by=c(Dept, Gender))
## Dept Gender n na mean
## 1 ACCT M 2 0 59626.19
## 2 ADMN M 2 0 80963.35
## 3 FINC M 3 0 72967.60
## 4 MKTG M 1 0 99062.66
## 5 SALE M 10 0 86150.97
## 6 <NA> M 0 0 NA
## 7 ACCT W 3 0 63237.16
## 8 ADMN W 4 0 81434.00
## 9 FINC W 1 0 57139.90
## 10 MKTG W 5 0 64496.02
## 11 SALE W 5 0 64188.25
## 12 <NA> W 1 0 53772.58
Or, the way I usually write it, just including the name of the
by parameter for readability.
pivot(d, mean, Salary, by=c(Dept, Gender))
## Dept Gender n na mean
## 1 ACCT M 2 0 59626.19
## 2 ADMN M 2 0 80963.35
## 3 FINC M 3 0 72967.60
## 4 MKTG M 1 0 99062.66
## 5 SALE M 10 0 86150.97
## 6 <NA> M 0 0 NA
## 7 ACCT W 3 0 63237.16
## 8 ADMN W 4 0 81434.00
## 9 FINC W 1 0 57139.90
## 10 MKTG W 5 0 64496.02
## 11 SALE W 5 0 64188.25
## 12 <NA> W 1 0 53772.58
The output of pivot() is a data frame of the aggregated
variables. This can be saved for further analysis. Here, perform the
same analysis, but list the required parameter values in order without
the parameter names. Save the result into the data frame a and
the display the contents of a.
a <- pivot(d, mean, Salary, by=c(Dept, Gender))
a
## Dept Gender n na mean
## 1 ACCT M 2 0 59626.19
## 2 ADMN M 2 0 80963.35
## 3 FINC M 3 0 72967.60
## 4 MKTG M 1 0 99062.66
## 5 SALE M 10 0 86150.97
## 6 <NA> M 0 0 NA
## 7 ACCT W 3 0 63237.16
## 8 ADMN W 4 0 81434.00
## 9 FINC W 1 0 57139.90
## 10 MKTG W 5 0 64496.02
## 11 SALE W 5 0 64188.25
## 12 <NA> W 1 0 53772.58
Because the output of pivot() with no
by_cols variables is a standard R data frame, typical
operations such as sorting can be applied, here using the
lessR function Sort().
Sort(a, by=Dept)
##
## Sort Specification
## Dept --> ascending
## Dept Gender n na mean
## 1 ACCT M 2 0 59626.19
## 7 ACCT W 3 0 63237.16
## 2 ADMN M 2 0 80963.35
## 8 ADMN W 4 0 81434.00
## 3 FINC M 3 0 72967.60
## 9 FINC W 1 0 57139.90
## 4 MKTG M 1 0 99062.66
## 10 MKTG W 5 0 64496.02
## 5 SALE M 10 0 86150.97
## 11 SALE W 5 0 64188.25
## 6 <NA> M 0 0 NA
## 12 <NA> W 1 0 53772.58
Multiple variables for which to aggregate over can also
be specified. Round the numerical aggregated results to two decimal
digits.
pivot(d, mean, c(Years, Salary), c(Dept, Gender), digits_d=2)
## Dept Gender Years_n Years_na Years_mean Salary_n Salary_na Salary_mean
## 1 ACCT M 2 0 7.00 2 0 59626.20
## 2 ADMN M 2 0 15.50 2 0 80963.34
## 3 FINC M 3 0 11.33 3 0 72967.60
## 4 MKTG M 1 0 18.00 1 0 99062.66
## 5 SALE M 9 1 12.33 10 0 86150.97
## 6 <NA> M 0 0 NA 0 0 NA
## 7 ACCT W 3 0 4.67 3 0 63237.16
## 8 ADMN W 4 0 7.50 4 0 81434.00
## 9 FINC W 1 0 7.00 1 0 57139.90
## 10 MKTG W 5 0 8.20 5 0 64496.02
## 11 SALE W 5 0 6.60 5 0 64188.25
## 12 <NA> W 1 0 5.00 1 0 53772.58
Specify up to two by_cols categorical variables, to
create a two-dimensional table with the specified columns. First, one
by_cols variable, Gender. Specifying one or two categorical
variables as by_cols variables moves them from their
default position in the rows to the columns.
Moving rows to columns changes the output structure from a long-form data frame that can be input into other analysis functions to a cross-classification table with categorical variables in the rows and columns. The primary purpose of the cross-classification table is to view the aggregated results, not to subject them to further analysis. For viewing, the tabular presentation is often easier to interpret.
Here the fifth value entered in the function call is the value of
by_cols, Gender in this example.
pivot(d, mean, Salary, by=Dept, by_cols=Gender)
Table: mean of Salary
Gender M W
Dept
------- --------- ---------
ACCT 59626.19 63237.16
ADMN 80963.35 81434.00
FINC 72967.60 57139.90
MKTG 99062.66 64496.02
SALE 86150.97 64188.25
NA NA 53772.58
Here are two by_cols variables, specified as a
vector.
pivot(d, mean, Salary, by=Dept, by_cols=c(Gender, Plan))
Table: mean of Salary
Gender M W
Plan 1 2 3 1 2 3
Dept
------- --------- ---------- --------- --------- --------- ---------
ACCT 69547.60 NA 49704.79 NA 63237.16 NA
ADMN 53788.26 108138.43 NA NA 67724.21 122563.4
FINC 61937.62 NA 95027.55 NA 57139.90 NA
MKTG NA NA 99062.66 56772.95 66426.79 NA
SALE 89393.40 82442.74 80131.91 60941.54 66352.73 NA
NA NA NA NA 53772.58 NA NA
Tabulation is counting. The number of data values in each cell are
tabulated, that is, counted, their frequency of occurrence obtained. For
the function that specifies what to compute, specify
table.
pivot(d, table, Dept, by=Gender)
## Gender n na ACCT ADMN FINC MKTG SALE
## 1 M 18 0 2 2 3 1 10
## 2 W 18 1 3 4 1 5 5
Note you can get the same joint frequency table from the
lessR function BarChart().
BarChart(Dept, by=Gender)
## >>> Suggestions
## Plot(Dept, Gender) # bubble plot
## BarChart(Dept, by=Gender, horiz=TRUE) # horizontal bar chart
## BarChart(Dept, fill="steelblue") # steelblue bars
##
## Joint and Marginal Frequencies
## ------------------------------
##
## Dept
## Gender ACCT ADMN FINC MKTG SALE Sum
## M 2 2 3 1 10 18
## W 3 4 1 5 5 18
## Sum 5 6 4 6 15 36
##
## Cramer's V: 0.415
##
## Chi-square Test of Independence:
## Chisq = 6.200, df = 4, p-value = 0.185
## >>> Low cell expected frequencies, chi-squared approximation may not be accurate
Of course, many other functions are available from which to obtain the joint frequencies of the two categorical variables.
The Base R function, upon which pivot() relies, is
aggregate(). Express this function in one of two modes.
This example is of formula mode.
a <- aggregate(Salary ~ Dept + Gender, data=d, FUN=mean)
a
## Dept Gender Salary
## 1 ACCT M 59626.20
## 2 ADMN M 80963.35
## 3 FINC M 72967.60
## 4 MKTG M 99062.66
## 5 SALE M 86150.97
## 6 ACCT W 63237.16
## 7 ADMN W 81434.00
## 8 FINC W 57139.90
## 9 MKTG W 64496.02
## 10 SALE W 64188.25
For dplyr, aggregate with two functions,
group_by() and summarize(). First define the
grouping, then the function and then the variable by which to summarize.
As with all tidyverse functions, begin the function call with the name
of the relevant data frame.
Specify the name of the aggregated variable. Here the mean of Salary across the groups is named mean. Here write the aggregated data to the a data frame.
d = group_by(d, Dept, Gender)
a = summarize(d, mean = mean(Salary))
## `summarise()` has grouped output by 'Dept'. You can override using the `.groups`
## argument.
a
## # A tibble: 11 × 3
## # Groups: Dept [6]
## Dept Gender mean
## <chr> <chr> <dbl>
## 1 ACCT M 59626.
## 2 ACCT W 63237.
## 3 ADMN M 80963.
## 4 ADMN W 81434.
## 5 FINC M 72968.
## 6 FINC W 57140.
## 7 MKTG M 99063.
## 8 MKTG W 64496.
## 9 SALE M 86151.
## 10 SALE W 64188.
## 11 <NA> W 53773.
As is true of the data output of any tidyverse function that outputs
a data frame, the data frame is the tidyverse version called a
tibble. We can always convert back to a regular R data
frame with the Base R function data.frame().
a <- data.frame(a)
a
## Dept Gender mean
## 1 ACCT M 59626.20
## 2 ACCT W 63237.16
## 3 ADMN M 80963.35
## 4 ADMN W 81434.00
## 5 FINC M 72967.60
## 6 FINC W 57139.90
## 7 MKTG M 99062.66
## 8 MKTG W 64496.02
## 9 SALE M 86150.97
## 10 SALE W 64188.25
## 11 <NA> W 53772.58
Usually, though, the version of the data frame is not an issue. For
example, all lessR analysis functions accept data frames in
either form.
Stefan Milton’s magrittr 2014 package of the same name
introduced the forward pipe (or chain) operator, %>%,
pronounced then. The operator transformed the way many people
code their data transformations by organizing them as a flow from start
to end.
Forward pipe operator: Carries forward the output of one function on the left of the operator into the input of the subsequent function on the right.
By default the input goes into the first argument of the subsequent function, which applies to functions that begin their parameter list with the data frame.
After the magrittr package appeared, the tidyverse
adopted the operator (to the extent that many R users think that the
tidyverse people invented it and it only works with tidyverse
functions). With Version 4 of R, the R people followed Milton’s lead and
introduced the Base R pipe operator, |>, used here in
these examples.
Obtain the pipe operator directly from the keyboard within RStudio as
CNTRL/CMD-SHIFT-M. The default is %>%. However, this
default can be changed with a trip to:
Checking the relevant checkbox changes the default pipe operator to
the Base R |>.
The forward pipe, either the magrittr or
Base R version, shows up in much contemporary code, so good
to be familiar with it. For basic operations, such as shown here, either
version of the pipe works equally well.
The result is that a series of linked operations can be specified
with a single entry into the R console. The function for listing the
variable names and, by default, the first six rows of data, is the
head() function. For the data frame d the
traditional way to invoke the function follows.
head(d)
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James NA M SALE 94494.58 low 1 62 74
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
Without modification, the forward pipe operator requires that the receiving function have the data frame as its first parameter. Then the data output by the function that precedes the chain operator becomes the data input into the function on the right side of the chain operator.
To access the head() function with the pipe operator,
first list the data frame name, which is an implicit call to the
print() function. The output of this function is then
chained forward as the input to the first parameter in the
head() function, the object for which to display the
beginning lines. Read as, “take the data frame d and pass to
the data (first) parameter of the head() function”.
d |> head()
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James NA M SALE 94494.58 low 1 62 74
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
Other parameters could also be invoked in the call to
head(). For example, to just list the first three lines of
data, explicitly set the second parameter of the function, n.
Because the data frame d is here piped into the first parameter
value of the head() function, the parameter and its value
do not appear in the list of parameter values for the function call to
head().
d |> head(n=3)
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James NA M SALE 94494.58 low 1 62 74
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
By default the pipe operator takes the output from the preceding
function in the chain and inputs the previous output into the first
parameter of the subsequent function. The underline notation makes this
mechanism explicit. The _ serves as a placeholder for the
piped-in incoming information and must be accompanied by the parameter
name. For the magrittr pipe operator, %>%,
the placeholder is a period, ..
The first parameter of the head() function is
x, the R object for which the first lines are displayed. Here,
instead of relying on the default, explicitly set this value to the
output of the previous function in the chain with the underline
placeholder.
d |> head(x=_)
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James NA M SALE 94494.58 low 1 62 74
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
Another advantage of the _ placeholder is that the
output of the previous function could be forwarded to any
specified parameter in the subsequent function, not just the default
first parameter. The following example is a bit more awkward than the
traditional function call, but illustrates that the _ can
be the receiver of information passed to any parameter, not just the
first. Forward the constant 2 into the head()
function as the value of n, the number of lines to display
for the input data frame.
2 |> head(d, n=_)
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James NA M SALE 94494.58 low 1 62 74
Or, for the magrittr pipe operator, use its placeholder,
.. To use this operator, the magrittr package
or the tidyverse must be accessed first with the library()
function.
2 %>% head(d, n=.)
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James NA M SALE 94494.58 low 1 62 74
Because the Base R operator is part of R, no external packages need
be referenced for its pipe operator. That said, because the
magrittr pipe operator was first its use is quite common by
R users.
Here create a pivot table of Salary averaged over all combinations of Dept and Gender, just for those employees with health Plan number 2. If this data frame had many variables, for convenience perhaps select just a subset of the variables, those of interest for this and subsequent analyses.
Employ the lessR functions .() and
pivot() in combination with the Base R Extract function
[ ]. Again, the general form is the expression for
selecting the rows first, and both the row and column expressions
enclosed in .( ).
d[.(rows), .(columns)]
This first example does not use the pipe operator. Instead, create an
intermediate data frame dd that stores the subsetted data
frame, ready to pass to pivot(). Perhaps not as elegant as
using the pipe operator, but works well.
Because we are aggregating over two categorical variables, Dept and
Gender, need to specify the by variable as vector with the
c() function, c(Dept, Gender).
dd <- d[.(Plan == 2), .(Gender:JobSat)]
a <- pivot(dd, mean, Salary, by=c(Dept, Gender))
More elegantly, use these functions with the pipe operator to avoid
creating an intermediate data frame. This is my preferred code to
accomplish the goal of generating data frame a. The first
parameter value of pivot() is data for
specifying the data frame, so the function is amenable to the default
behavior of |>.
To keep the linear flow of the operations from left to right, turn
the assignment statement around from <- to
->, meaning that the output of pivot() is
assigned to the a data frame in this example. Setting the
expression with a at the end using -> reflects
the temporal flow of operations from left to right. However, the more
traditional a <- could be used, setting that notation at
the beginning of the entire expression.
Remember that inside the square brackets, [ ], the
information before the comma selects the rows and the information after
the coma selects the columns, the variables.
d[.(Plan == 2), .(Gender:JobSat)] |>
pivot(mean, Salary, c(Dept, Gender)) -> a
a
## Dept Gender n na mean
## 1 ACCT M 0 0 NA
## 2 ADMN M 1 0 108138.43
## 3 FINC M 0 0 NA
## 4 MKTG M 0 0 NA
## 5 SALE M 2 0 82442.74
## 6 ACCT W 3 0 63237.16
## 7 ADMN W 3 0 67724.21
## 8 FINC W 1 0 57139.90
## 9 MKTG W 4 0 66426.79
## 10 SALE W 3 0 66352.73
The above code would be created and run for the following management request:
“For the d data frame, let’s analyze the data for only those employees with insurance Plan 2, and only the variables in the data frame from Gender through JobSat. From this subset of the data, calculate the mean salary for each gender across all departments.”
With the tidyverse dplyr package, select rows with
filter() and columns with select(). Then use
group_by() and summarize() to aggregate. Store
the result in data frame a. Note that the R assignment operator
works in either direction. Here use -> to assign the
result to a.
d |> filter(Plan == 2) |> select(Gender:JobSat) |>
group_by(Dept, Gender) |> summarize(mean=mean(Salary)) -> a
## `summarise()` has grouped output by 'Dept'. You can override using the `.groups`
## argument.
a
## # A tibble: 7 × 3
## # Groups: Dept [5]
## Dept Gender mean
## <chr> <chr> <dbl>
## 1 ACCT W 63237.
## 2 ADMN M 108138.
## 3 ADMN W 67724.
## 4 FINC W 57140.
## 5 MKTG W 66427.
## 6 SALE M 82443.
## 7 SALE W 66353.
The tidyverse people at the RStudio company, Posit, even redefine the standard data frame. Their functions create their version of a data frame called a tibble, for table.
The tidyverse dplyr functions get the aggregation done
with 121 characters and four invocations of the pipe operator.
lessR accomplishes the same task with 79 characters and
only one use of the pipe operator. lessR provides more
concise code which, in my opinion, is at least as readable as the
dplyr code, if not more so.
pivot() also provides for a table for viewing instead of
a data frame ready for input into other analysis functions. To create a
version as a table instead of a data frame, move the variable Gender to
the optional by_cols parameter.
d[.(Plan == 2), .(Gender:JobSat)] |>
pivot(mean, Salary, by=Dept, by_cols=Gender)
Table: mean of Salary
Gender M W
Dept
------- ---------- ---------
ACCT NA 63237.16
ADMN 108138.43 67724.21
FINC NA 57139.90
MKTG NA 66426.79
SALE 82442.74 66352.73
Showing the actual table also shows the missing data. Again, to not show the missing data,
The chain operator is useful, good to have in the R toolkit. The new Base R pipe operator, for example, indicates that the maintainers of Base R recognize its usefulness. The primary advantage is that its use follows the natural order in which the functions are called and so, once its meaning is realized, to many easier to write and read than the other technology that creates intermediate data structures for each data processing step.
The lessR function Sort() sorts the rows of
a data frame according to the values of one or more specified variables.
To limit the output, first redefine the data frame with only the first
10 rows.
d <- d[1:10, ]
Then sort, here by the values of the variable Dept. Just display the sorted data frame without saving the sort. The default is to sort in ascending order. The first value to pass to the function is the data frame, here d.
Sort(d, by=Dept)
##
## Sort Specification
## Dept --> ascending
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
## Wu, James NA M SALE 94494.58 low 1 62 74
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
Here sort in descending order, specified by the value -
for the parameter direction.
Sort(d, by=Dept, direction="-")
##
## Sort Specification
## Dept --> descending
## Years Gender Dept Salary JobSat Plan Pre Post
## Wu, James NA M SALE 94494.58 low 1 62 74
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
Sort by Dept in descending order and Gender in ascending order. Use vectors to define multiple variables to sort, and the define multiple directions from which to sort. There is a 1-to-1 correspondence between variables to sort and their corresponding directions.
Sort(d, by=c(Dept, Gender), direction=c("-", "+"))
##
## Sort Specification
## Dept --> descending
## Gender --> ascending
## Years Gender Dept Salary JobSat Plan Pre Post
## Wu, James NA M SALE 94494.58 low 1 62 74
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
Can also sort by row names in ascending order.
Sort(d, row.names)
##
## Sort Specification
## row.names --> ascending
## Years Gender Dept Salary JobSat Plan Pre Post
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James NA M SALE 94494.58 low 1 62 74
Specify random for the variable to randomly re-shuffle
the rows of data.
Sort(d, random)
##
## Sort Specification
## random
## Years Gender Dept Salary JobSat Plan Pre Post
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Wu, James NA M SALE 94494.58 low 1 62 74
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
The tidyverse provides the function arrange() to sort
the rows of a data frame. The default is ascending order.
arrange(d, Dept)
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
## Wu, James NA M SALE 94494.58 low 1 62 74
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
To sort in descending order, following the tidyverse way, introduce
another function, here desc().
arrange(d, desc(Dept))
## Years Gender Dept Salary JobSat Plan Pre Post
## Wu, James NA M SALE 94494.58 low 1 62 74
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
Or, sort by two variables.
arrange(d, Dept, Gender)
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
## Wu, James NA M SALE 94494.58 low 1 62 74
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
d[!complete.cases(d), ]
## Years Gender Dept Salary JobSat Plan Pre Post
## Wu, James NA M SALE 94494.58 low 1 62 74
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Korhalkar, Jessica 2 W ACCT 72502.50 <NA> 2 74 87
There is missing data.
d
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James NA M SALE 94494.58 low 1 62 74
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
## Saechao, Suzanne 8 W SALE 55545.25 med 1 98 100
## Pham, Scott 13 M SALE 81871.05 high 2 90 94
## Tian, Fang 9 W ACCT 71084.02 med 2 60 61
## Bellingar, Samantha 10 W SALE 66337.83 med 1 67 72
## Sheppard, Cory 14 M FINC 95027.55 low 3 66 73
## Kralik, Laura 10 W SALE 92681.19 med 2 74 71
## Skrotzki, Sara 18 W MKTG 91352.33 med 2 63 61
## Correll, Trevon 21 M SALE 134419.23 low 1 97 94
## James, Leslie 18 W ADMN 122563.38 low 3 70 70
## Osterman, Pascal 5 M ACCT 49704.79 high 3 69 70
## Adib, Hassan 14 M SALE 83014.43 med 2 71 69
## Gvakharia, Kimberly 3 W SALE 49868.68 med 2 83 79
## Stanley, Grayson 9 M SALE 69624.87 low 1 74 73
## Link, Thomas 10 M FINC 66312.89 low 1 83 83
## Portlock, Ryan 13 M SALE 77714.85 low 1 72 73
## Langston, Matthew 5 M SALE 49188.96 low 3 94 93
## Stanley, Emma 3 W ACCT 46124.97 high 2 86 84
## Singh, Niral 2 W ADMN 61055.44 high 2 59 59
## Anderson, David 9 M ACCT 69547.60 low 1 94 91
## Fulton, Scott 13 M SALE 87785.51 low 1 72 73
## Korhalkar, Jessica 2 W ACCT 72502.50 <NA> 2 74 87
## LaRoe, Maria 10 W MKTG 61961.29 high 2 80 86
## Billing, Susan 4 W ADMN 72675.26 med 2 91 90
## Capelle, Adam 24 M ADMN 108138.43 med 2 83 81
## Hamide, Bita 1 W MKTG 51036.85 high 2 83 90
## Anastasiou, Crystal 2 W SALE 56508.32 low 2 59 71
## Cassinelli, Anastis 10 M FINC 57562.36 high 1 80 87
To remove the all rows of data (cases) that contain at least one
missing value, use the Base R na.omit() function.
d <- na.omit(d)
The result is a data frame with 9 rows instead of 10 rows.
d
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
## Saechao, Suzanne 8 W SALE 55545.25 med 1 98 100
## Pham, Scott 13 M SALE 81871.05 high 2 90 94
## Tian, Fang 9 W ACCT 71084.02 med 2 60 61
## Bellingar, Samantha 10 W SALE 66337.83 med 1 67 72
## Sheppard, Cory 14 M FINC 95027.55 low 3 66 73
## Kralik, Laura 10 W SALE 92681.19 med 2 74 71
## Skrotzki, Sara 18 W MKTG 91352.33 med 2 63 61
## Correll, Trevon 21 M SALE 134419.23 low 1 97 94
## James, Leslie 18 W ADMN 122563.38 low 3 70 70
## Osterman, Pascal 5 M ACCT 49704.79 high 3 69 70
## Adib, Hassan 14 M SALE 83014.43 med 2 71 69
## Gvakharia, Kimberly 3 W SALE 49868.68 med 2 83 79
## Stanley, Grayson 9 M SALE 69624.87 low 1 74 73
## Link, Thomas 10 M FINC 66312.89 low 1 83 83
## Portlock, Ryan 13 M SALE 77714.85 low 1 72 73
## Langston, Matthew 5 M SALE 49188.96 low 3 94 93
## Stanley, Emma 3 W ACCT 46124.97 high 2 86 84
## Singh, Niral 2 W ADMN 61055.44 high 2 59 59
## Anderson, David 9 M ACCT 69547.60 low 1 94 91
## Fulton, Scott 13 M SALE 87785.51 low 1 72 73
## LaRoe, Maria 10 W MKTG 61961.29 high 2 80 86
## Billing, Susan 4 W ADMN 72675.26 med 2 91 90
## Capelle, Adam 24 M ADMN 108138.43 med 2 83 81
## Hamide, Bita 1 W MKTG 51036.85 high 2 83 90
## Anastasiou, Crystal 2 W SALE 56508.32 low 2 59 71
## Cassinelli, Anastis 10 M FINC 57562.36 high 1 80 87
The function complete.cases() returns a logical vector
indicating which cases are complete, i.e., have no missing values.