##
## lessR 3.9.4 feedback: gerbing@pdx.edu web: lessRstats.com/new
## -------------------------------------------------------------------
## > d <- Read("") Read text, Excel, SPSS, SAS or R data file
## d: default data frame, no need for data=
## > l <- Read("", var_labels=TRUE) Read variable labels into l,
## required name for data frame of labels
## > Help() Get help, and, e.g., Help(Read)
## > hs(), bc(), or ca() All histograms, all bar charts, or both
## > Plot(X) or Plot(X,Y) For continuous and categorical variables
## > by1= , by2= Trellis graphics, a plot for each by1, by2
## > reg(Y ~ X, Rmd="eg") Regression with full interpretative output
## > style("gray") Grayscale theme, + many others available
## > style(show=TRUE) all color/style options and current values
## > getColors() create many styles of color palettes
## > d[.(rows), .(cols)] subset with . more flexible than base R
Extract
The most general way to subset a data frame by rows and/or columns is the base R Extract
function, called by d[rows, columms]
, where d is the data frame. Ultimately, to use this function, for the rows parameter, pass the row names of the selected rows, the indices or actual names. For the cols parameter, pass the column indices of the selected columns.
Extract
In many uses, perhaps most uses, the indices are not directly provided, but obtained indirectly through some logical specification. Here Extract
has several annoying features.
rows
: Any reference to the variables in the data frame for this specification must contain the name of the data frame followed by a $. But this name has already been specified in the function call, and now is redundant, simply repeated for every variable reference.cols
: Usually specified with a vector of variable names.rows
and cols
. Instead enter the literal conditions, which can make the entire expression quite large.To address these deficiencies, lessR provides the function .()
for obtaining the indices of selected rows and of selected columns. This function is only callable within the base R Extract
function, what R refers to as non-standard evaluation. That basically means that the annoying restrictions are removed.
The general form of the subsetting follows.
d[.(rows), .(columns)]
That is, call the Extract
function as before, but now wrap the row and column expressions with .()
.
To illustrate, use the Employee data set contained in lessR, here read into the d data frame.
Subset the data frame by only listing observations with a Gender of “M” with scores on Post larger than 90. Only list columns for the variables in the range from Years to Salary, and Post. Referring back to the output of Read()
, the variable range includes Years, Gender, Dept, and Salary.
## Years Gender Dept Salary Post
## Ritchie, Darnell 7 M ADMN 53788.26 92
## Hoang, Binh 15 M SALE 111074.86 97
## Pham, Scott 13 M SALE 81871.05 94
## Correll, Trevon 21 M SALE 134419.23 94
## Langston, Matthew 5 M SALE 49188.96 93
## Anderson, David 9 M ACCT 69547.60 91
Following is the traditional R call for subsetting.
## Years Gender Dept Salary Post
## Ritchie, Darnell 7 M ADMN 53788.26 92
## Hoang, Binh 15 M SALE 111074.86 97
## Pham, Scott 13 M SALE 81871.05 94
## Correll, Trevon 21 M SALE 134419.23 94
## Langston, Matthew 5 M SALE 49188.96 93
## Anderson, David 9 M ACCT 69547.60 91
To negate a row selection, add a !
to the beginning of the logical condition passed to .()
, within the call to .()
. To exclude the specified variables, place a -
, in front of the call to .()
.
## Years Gender Post
## Wu, James NA M 74
## Jones, Alissa 5 F 62
## Downs, Deborah 7 F 86
## Afshari, Anbar 6 F 100
## Knox, Michael 18 M 84
## Campagna, Justin 8 M 84
## Kimball, Claire 8 F 92
## Cooper, Lindsay 4 F 91
## Saechao, Suzanne 8 F 100
## Tian, Fang 9 F 61
## Bellingar, Samantha 10 F 72
## Sheppard, Cory 14 M 73
## Kralik, Laura 10 F 71
## Skrotzki, Sara 18 F 61
## James, Leslie 18 F 70
## Osterman, Pascal 5 M 70
## Adib, Hassan 14 M 69
## Gvakharia, Kimberly 3 F 79
## Stanley, Grayson 9 M 73
## Link, Thomas 10 M 83
## Portlock, Ryan 13 M 73
## Stanley, Emma 3 F 84
## Singh, Niral 2 F 59
## Fulton, Scott 13 M 73
## Korhalkar, Jessica 2 F 87
## LaRoe, Maria 10 F 86
## Billing, Susan 4 F 90
## Capelle, Adam 24 M 81
## Hamide, Bita 1 F 90
## Anastasiou, Crystal 2 F 71
## Cassinelli, Anastis 10 M 87
Can still provide the indices directly for one or both of the expressions as the base R Extract
function is unmodified.
## Years Gender Dept Salary Post
## Ritchie, Darnell 7 M ADMN 53788.26 92
## Wu, James NA M SALE 94494.58 74
## Hoang, Binh 15 M SALE 111074.86 97
## Years Gender Dept
## Ritchie, Darnell 7 M ADMN
## Hoang, Binh 15 M SALE
## Pham, Scott 13 M SALE
## Correll, Trevon 21 M SALE
## Langston, Matthew 5 M SALE
## Anderson, David 9 M ACCT
To enhance readability, store the specified row or column conditions as character strings. Each string must be named rows
or cols
. Because the entire expression is a character string, differentiate between single and double quotes as needed. For example, use single quotes within the string and double quotes to define the entire string, as illustrated next.
To subset, pass the respective character strings, rows
and cols
, to .()
, respectively.
## Gender Dept Salary Post
## Hoang, Binh M SALE 111074.86 97
## Pham, Scott M SALE 81871.05 94
## Correll, Trevon M SALE 134419.23 94
To negate, as with the literal expressions, use !
for the rows and -
for the columns. Notice their placement.
## Years JobSat Plan Pre
## Ritchie, Darnell 7 med 1 82
## Wu, James NA low 1 62
## Jones, Alissa 5 <NA> 1 65
## Downs, Deborah 7 high 2 90
## Afshari, Anbar 6 high 2 100
## Knox, Michael 18 med 3 81
## Campagna, Justin 8 low 1 76
## Kimball, Claire 8 high 2 93
## Cooper, Lindsay 4 high 1 78
## Saechao, Suzanne 8 med 1 98
## Tian, Fang 9 med 2 60
## Bellingar, Samantha 10 med 1 67
## Sheppard, Cory 14 low 3 66
## Kralik, Laura 10 med 2 74
## Skrotzki, Sara 18 med 2 63
## James, Leslie 18 low 3 70
## Osterman, Pascal 5 high 3 69
## Adib, Hassan 14 med 2 71
## Gvakharia, Kimberly 3 med 2 83
## Stanley, Grayson 9 low 1 74
## Link, Thomas 10 low 1 83
## Portlock, Ryan 13 low 1 72
## Langston, Matthew 5 low 3 94
## Stanley, Emma 3 high 2 86
## Singh, Niral 2 high 2 59
## Anderson, David 9 low 1 94
## Fulton, Scott 13 low 1 72
## Korhalkar, Jessica 2 <NA> 2 74
## LaRoe, Maria 10 high 2 80
## Billing, Susan 4 med 2 91
## Capelle, Adam 24 med 2 83
## Hamide, Bita 1 high 2 83
## Anastasiou, Crystal 2 low 2 59
## Cassinelli, Anastis 10 high 1 80
Can also randomly select the specified number of rows from the data frame to subset. To do so, specify the random()
function for the logical criterion of the rows
. The value passed to random()
can either be the actual number of rows to select, as below, or, the proportion of rows to select.
## Years Gender Dept Salary
## James, Leslie 18 F ADMN 122563.38
## Kralik, Laura 10 F SALE 92681.19
## Fulton, Scott 13 M SALE 87785.51
## Langston, Matthew 5 M SALE 49188.96
## Pham, Scott 13 M SALE 81871.05
Here specify a proportion of rows to select.
## Years Gender Dept Salary
## Gvakharia, Kimberly 3 F SALE 49868.68
## Stanley, Grayson 9 M SALE 69624.87
## Knox, Michael 18 M MKTG 99062.66
## Capelle, Adam 24 M ADMN 108138.43