Subset a Data Frame

Subset a Data Frame with Base R `Extract`

The most general way to subset a data frame by rows and/or columns is the base R Extract function, called by d[rows, columms], where d is the data frame. Ultimately, to use this function, for the rows parameter, pass the row names of the selected rows, the indices or actual names. For the cols parameter, pass the column indices of the selected columns.

Annoying Features of Base R `Extract`

In many uses, perhaps most uses, the indices are not directly provided, but obtained indirectly through some logical specification. Here Extract has several annoying features.

rows: Any reference to the variables in the data frame for this specification must contain the name of the data frame followed by a $. But this name has already been specified in the function call, and now is redundant, simply repeated for every variable reference.
cols: Usually specified with a vector of variable names.

All variable names in in the provided vector of names must be quoted, and no variable ranges are permitted.
No variable ranges.

No character strings for values passed to rows and cols. Instead enter the literal conditions, which can make the entire expression quite large.

More Flexible lessR Enhancement

To address these deficiencies, lessR provides the function .() for obtaining the indices of selected rows and of selected columns. This function is only callable within the base R Extract function, what R refers to as non-standard evaluation. That basically means that the annoying restrictions are removed.

The general form of the subsetting follows.

d[.(rows), .(columns)]

That is, call the Extract function as before, but now wrap the row and column expressions with .().

To illustrate, use the Employee data set contained in lessR, here read into the d data frame.

d <- Read("Employee", quiet=TRUE)

Subset the data frame by only listing observations with a Gender of “M” with scores on Post larger than 90. Only list columns for the variables in the range from Years to Salary, and Post. Referring back to the output of Read(), the variable range includes Years, Gender, Dept, and Salary.

d[.(Gender=="M" & Post>90), .(Years:Salary, Post)]

##                   Years Gender Dept    Salary Post
## Ritchie, Darnell      7      M ADMN  53788.26   92
## Hoang, Binh          15      M SALE 111074.86   97
## Pham, Scott          13      M SALE  81871.05   94
## Correll, Trevon      21      M SALE 134419.23   94
## Langston, Matthew     5      M SALE  49188.96   93
## Anderson, David       9      M ACCT  69547.60   91

Following is the traditional R call for subsetting.

d[d$Gender=="M" & d$Post>90, c("Years", "Gender", "Dept", "Salary", "Post")]

##                   Years Gender Dept    Salary Post
## Ritchie, Darnell      7      M ADMN  53788.26   92
## Hoang, Binh          15      M SALE 111074.86   97
## Pham, Scott          13      M SALE  81871.05   94
## Correll, Trevon      21      M SALE 134419.23   94
## Langston, Matthew     5      M SALE  49188.96   93
## Anderson, David       9      M ACCT  69547.60   91

To negate a row selection, add a ! to the beginning of the logical condition passed to .(), within the call to .(). To exclude the specified variables, place a -, in front of the call to .().

d[.(!(Gender=="M" & Post>90)), -.(Dept:Plan, Pre)]

##                     Years Gender Post
## Wu, James              NA      M   74
## Jones, Alissa           5      F   62
## Downs, Deborah          7      F   86
## Afshari, Anbar          6      F  100
## Knox, Michael          18      M   84
## Campagna, Justin        8      M   84
## Kimball, Claire         8      F   92
## Cooper, Lindsay         4      F   91
## Saechao, Suzanne        8      F  100
## Tian, Fang              9      F   61
## Bellingar, Samantha    10      F   72
## Sheppard, Cory         14      M   73
## Kralik, Laura          10      F   71
## Skrotzki, Sara         18      F   61
## James, Leslie          18      F   70
## Osterman, Pascal        5      M   70
## Adib, Hassan           14      M   69
## Gvakharia, Kimberly     3      F   79
## Stanley, Grayson        9      M   73
## Link, Thomas           10      M   83
## Portlock, Ryan         13      M   73
## Stanley, Emma           3      F   84
## Singh, Niral            2      F   59
## Fulton, Scott          13      M   73
## Korhalkar, Jessica      2      F   87
## LaRoe, Maria           10      F   86
## Billing, Susan          4      F   90
## Capelle, Adam          24      M   81
## Hamide, Bita            1      F   90
## Anastasiou, Crystal     2      F   71
## Cassinelli, Anastis    10      M   87

Can still provide the indices directly for one or both of the expressions as the base R Extract function is unmodified.

d[1:3, .(Years:Salary, Post)]

##                  Years Gender Dept    Salary Post
## Ritchie, Darnell     7      M ADMN  53788.26   92
## Wu, James           NA      M SALE  94494.58   74
## Hoang, Binh         15      M SALE 111074.86   97

d[.(Gender=="M" & Post>90), 1:3]

##                   Years Gender Dept
## Ritchie, Darnell      7      M ADMN
## Hoang, Binh          15      M SALE
## Pham, Scott          13      M SALE
## Correll, Trevon      21      M SALE
## Langston, Matthew     5      M SALE
## Anderson, David       9      M ACCT

To enhance readability, store the specified row or column conditions as character strings. Each string must be named rows or cols. Because the entire expression is a character string, differentiate between single and double quotes as needed. For example, use single quotes within the string and double quotes to define the entire string, as illustrated next.

rows <- "Gender=='M' & Post>93"
cols <- "Gender:Salary, Post"

To subset, pass the respective character strings, rows and cols, to .(), respectively.

d[.(rows), .(cols)]

##                 Gender Dept    Salary Post
## Hoang, Binh          M SALE 111074.86   97
## Pham, Scott          M SALE  81871.05   94
## Correll, Trevon      M SALE 134419.23   94

To negate, as with the literal expressions, use ! for the rows and - for the columns. Notice their placement.

d[.(!rows), -.(cols)]

##                     Years JobSat Plan Pre
## Ritchie, Darnell        7    med    1  82
## Wu, James              NA    low    1  62
## Jones, Alissa           5   <NA>    1  65
## Downs, Deborah          7   high    2  90
## Afshari, Anbar          6   high    2 100
## Knox, Michael          18    med    3  81
## Campagna, Justin        8    low    1  76
## Kimball, Claire         8   high    2  93
## Cooper, Lindsay         4   high    1  78
## Saechao, Suzanne        8    med    1  98
## Tian, Fang              9    med    2  60
## Bellingar, Samantha    10    med    1  67
## Sheppard, Cory         14    low    3  66
## Kralik, Laura          10    med    2  74
## Skrotzki, Sara         18    med    2  63
## James, Leslie          18    low    3  70
## Osterman, Pascal        5   high    3  69
## Adib, Hassan           14    med    2  71
## Gvakharia, Kimberly     3    med    2  83
## Stanley, Grayson        9    low    1  74
## Link, Thomas           10    low    1  83
## Portlock, Ryan         13    low    1  72
## Langston, Matthew       5    low    3  94
## Stanley, Emma           3   high    2  86
## Singh, Niral            2   high    2  59
## Anderson, David         9    low    1  94
## Fulton, Scott          13    low    1  72
## Korhalkar, Jessica      2   <NA>    2  74
## LaRoe, Maria           10   high    2  80
## Billing, Susan          4    med    2  91
## Capelle, Adam          24    med    2  83
## Hamide, Bita            1   high    2  83
## Anastasiou, Crystal     2    low    2  59
## Cassinelli, Anastis    10   high    1  80

Can also randomly select the specified number of rows from the data frame to subset. To do so, specify the random() function for the logical criterion of the rows. The value passed to random() can either be the actual number of rows to select, as below, or, the proportion of rows to select.

d[.(random(5)), .(Years:Salary)]

##                   Years Gender Dept    Salary
## James, Leslie        18      F ADMN 122563.38
## Kralik, Laura        10      F SALE  92681.19
## Fulton, Scott        13      M SALE  87785.51
## Langston, Matthew     5      M SALE  49188.96
## Pham, Scott          13      M SALE  81871.05

Here specify a proportion of rows to select.

d[.(random(0.1)), .(Years:Salary)]

##                     Years Gender Dept    Salary
## Gvakharia, Kimberly     3      F SALE  49868.68
## Stanley, Grayson        9      M SALE  69624.87
## Knox, Michael          18      M MKTG  99062.66
## Capelle, Adam          24      M ADMN 108138.43

Subset a Data Frame

David Gerbing