1. Data: Subset a Data Frame

David Gerbing

Subset a Data Frame with Base R Extract[]

The most general way to subset a data frame by rows and/or columns is the base R Extract[] function, indicated by matched square brackets instead of the usual matched parentheses. For a data frame named d the general format is d[rows, columms].

For the rows parameter, pass either

To specify a logical expression for the rows parameter, use the standard R operators.

operator meaning
& and
| or
! not
== is equal to
!= is not equal to
%in% is in a vector

For example, to obtain a subset of the data frame that consists of only those who report a value of the Gender variable as Female, specify a logical condition such as:

Gender=="Female"

For the cols parameter, pass either

If subsetting is done by only rows or only columns, then leave the other value blank. For example, to subset the d data frame only by rows, the general form reduces to d[rows,]. Similarly, to subset only by columns, d[,cols].

Annoying Features of Base R Extract

When Extract[] evaluates the row or column specifications to obtain the indices, there are several annoying properties.

  1. rows: Any reference to the variables in the data frame for this specification must contain the name of the data frame followed by a $. But this name has already been specified in the call to Extract[] by listing the data frame name in front of the square brackets, so now is redundant, repeated for every variable reference.
  2. rows: When specifying a value of a variable for a row selection, any missing values for the variable are also provided even though the missing data values are not the requested value of the variable.
  3. cols: Usually specified with a vector of variable names but all variable names in in the provided vector of names must be quoted.
  4. cols: No variable ranges specified with a colon : such as m01:m10 to specify 10 variables: m01, m02, …, up to m10.
  5. No character strings that store the values passed to rows and cols. Instead directly enter the conditions for both rows and columns, which can make the entire expression quite large.

More Flexible Use of Extract[]

To address the first two deficiencies, one possibility is the base R subset() function. To address these deficiencies and still use Extract[] directly, lessR provides the function .() for obtaining the indices of selected rows and of selected columns. This function is only callable within the base R Extract[] function, with what R refers to as non-standard evaluation. That basically means that the annoying restrictions are removed, though in some advanced programming uses the .() may not apply.

The general form of the subsetting with the .() function follows.

d[.(rows), .(columns)]

That is, call the same Extract[] function with reference to rows and cols, but now wrap the row and column expressions with the lessR function call .().

To illustrate, use the Employee data set contained in lessR, here read into the d data frame.

d <- Read("Employee")

Subset the data frame by only listing observations with a Gender of “M” with scores on Post larger than 90. Only list columns for the variables in the range from Years to Salary, and Post. Referring back to the output of Read(), the variable range includes Years, Gender, Dept, and Salary.

d[.(Gender=="M" & Post>90), .(Years:Salary, Post)]
##                   Years Gender Dept    Salary Post
## Ritchie, Darnell      7      M ADMN  63788.26   92
## Hoang, Binh          15      M SALE 121074.86   97
## Pham, Scott          13      M SALE  91871.05   94
## Correll, Trevon      21      M SALE 144419.23   94
## Langston, Matthew     5      M SALE  59188.96   93
## Anderson, David       9      M ACCT  79547.60   91

Following is the traditional R call to Extract[] to obtain the same subsetting.

d[d$Gender=="M" & d$Post>90, c("Years", "Gender", "Dept", "Salary", "Post")]
##                   Years Gender Dept    Salary Post
## Ritchie, Darnell      7      M ADMN  63788.26   92
## Hoang, Binh          15      M SALE 121074.86   97
## Pham, Scott          13      M SALE  91871.05   94
## Correll, Trevon      21      M SALE 144419.23   94
## Langston, Matthew     5      M SALE  59188.96   93
## Anderson, David       9      M ACCT  79547.60   91

A row selection is a logical condition. To negate a row selection, add a ! to the beginning of the condition passed to .(), within the call to .(). To exclude the specified variables, place a -, in front of the call to .().

d[.(!(Gender=="M" & Post>90)), -.(Dept:Plan, Pre)]
##                     Years Gender Post
## Wu, James              NA      M   74
## Downs, Deborah          7      W   86
## Jones, Alissa           5      W   62
## Afshari, Anbar          6      W  100
## Knox, Michael          18      M   84
## Campagna, Justin        8      M   84
## Kimball, Claire         8      W   92
## Cooper, Lindsay         4      W   91
## Saechao, Suzanne        8      W  100
## Tian, Fang              9      W   61
## Bellingar, Samantha    10      W   72
## Sheppard, Cory         14      M   73
## Kralik, Laura          10      W   71
## Skrotzki, Sara         18      W   61
## James, Leslie          18      W   70
## Osterman, Pascal        5      M   70
## Adib, Hassan           14      M   69
## Gvakharia, Kimberly     3      W   79
## Stanley, Grayson        9      M   73
## Link, Thomas           10      M   83
## Portlock, Ryan         13      M   73
## Stanley, Emma           3      W   84
## Singh, Niral            2      W   59
## Fulton, Scott          13      M   73
## Korhalkar, Jessica      2      W   87
## LaRoe, Maria           10      W   86
## Billing, Susan          4      W   90
## Capelle, Adam          24      M   81
## Hamide, Bita            1      W   90
## Anastasiou, Crystal     2      W   71
## Cassinelli, Anastis    10      M   87

Can still provide the indices directly for one or both of the expressions as the base R Extract[] function is unmodified with the use of .(). The purpose of .() is simply to return the row or column row indices to identify specific rows or columns of the specified data frame. You can either specify the indices directory for the rows or columns, or let .() identify them for you.

d[1:3, .(Years:Salary, Post)]
##                  Years Gender Dept    Salary Post
## Ritchie, Darnell     7      M ADMN  63788.26   92
## Wu, James           NA      M SALE 104494.58   74
## Downs, Deborah       7      W FINC  67139.90   86
d[.(Gender=="M" & Post>90), 1:3]
##                   Years Gender Dept
## Ritchie, Darnell      7      M ADMN
## Hoang, Binh          15      M SALE
## Pham, Scott          13      M SALE
## Correll, Trevon      21      M SALE
## Langston, Matthew     5      M SALE
## Anderson, David       9      M ACCT

To enhance readability, store the specified row or column conditions as character strings. Each string must be named either rows or cols. Because the entire expression for rows or cols is a character string, differentiate between single and double quotes as needed. For example, use single quotes within the string and double quotes to define the entire string, illustrated here.

rows <- "Gender=='M' & Post>93"
cols <- "Gender:Salary, Post"

To subset, pass the respective character strings, rows and cols, to .(), respectively.

d[.(rows), .(cols)]
##                 Gender Dept    Salary Post
## Hoang, Binh          M SALE 121074.86   97
## Pham, Scott          M SALE  91871.05   94
## Correll, Trevon      M SALE 144419.23   94

To negate, as with the literal expressions, use ! for the logical expression that defines the rows and - for the columns. Notice their placement, where the ! is inside the call to .(), and the - is outside the call.

d[.(!rows), -.(cols)]
##                     Years JobSat Plan Pre
## Ritchie, Darnell        7    med    1  82
## Wu, James              NA    low    1  62
## Downs, Deborah          7   high    2  90
## Jones, Alissa           5   <NA>    1  65
## Afshari, Anbar          6   high    2 100
## Knox, Michael          18    med    3  81
## Campagna, Justin        8    low    1  76
## Kimball, Claire         8   high    2  93
## Cooper, Lindsay         4   high    1  78
## Saechao, Suzanne        8    med    1  98
## Tian, Fang              9    med    2  60
## Bellingar, Samantha    10    med    1  67
## Sheppard, Cory         14    low    3  66
## Kralik, Laura          10    med    2  74
## Skrotzki, Sara         18    med    2  63
## James, Leslie          18    low    3  70
## Osterman, Pascal        5   high    3  69
## Adib, Hassan           14    med    2  71
## Gvakharia, Kimberly     3    med    2  83
## Stanley, Grayson        9    low    1  74
## Link, Thomas           10    low    1  83
## Portlock, Ryan         13    low    1  72
## Langston, Matthew       5    low    3  94
## Stanley, Emma           3   high    2  86
## Singh, Niral            2   high    2  59
## Anderson, David         9    low    1  94
## Fulton, Scott          13    low    1  72
## Korhalkar, Jessica      2   <NA>    2  74
## LaRoe, Maria           10   high    2  80
## Billing, Susan          4    med    2  91
## Capelle, Adam          24    med    2  83
## Hamide, Bita            1   high    2  83
## Anastasiou, Crystal     2    low    2  59
## Cassinelli, Anastis    10   high    1  80

Missing Data

The variable Dept is missing for the fourth row of data.

d[1:5,]
##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN  63788.26    med    1  82   92
## Wu, James           NA      M SALE 104494.58    low    1  62   74
## Downs, Deborah       7      W FINC  67139.90   high    2  90   86
## Hoang, Binh         15      M SALE 121074.86    low    3  96   97
## Jones, Alissa        5      W <NA>  63772.58   <NA>    1  65   62

Here with the traditional use of Extract[], specify rows of data only when the value of Dept is ADMN.

d[d$Dept=="ADMN", c('Gender', 'Dept', 'Salary')]
##                  Gender Dept    Salary
## Ritchie, Darnell      M ADMN  63788.26
## NA                 <NA> <NA>        NA
## Afshari, Anbar        W ADMN  79441.93
## James, Leslie         W ADMN 132563.38
## Singh, Niral          W ADMN  71055.44
## Billing, Susan        W ADMN  82675.26
## Capelle, Adam         M ADMN 118138.43

The result provides what is requested, and also when Dept is <NA>, which is not requested. The requested value of ADMN is not the same as <NA>.

Use the .() function to obtain what is requested, rows of data in which the value of Dept is ADMN.

d[.(Dept=="ADMN"), .(Gender:Salary)]
##                  Gender Dept    Salary
## Ritchie, Darnell      M ADMN  63788.26
## Afshari, Anbar        W ADMN  79441.93
## James, Leslie         W ADMN 132563.38
## Singh, Niral          W ADMN  71055.44
## Billing, Susan        W ADMN  82675.26
## Capelle, Adam         M ADMN 118138.43

Random Selection of Rows

The function .() also provides for random selection of rows. To randomly select the specified number of rows from the data frame to subset, specify the random() function for the logical criterion of the rows. The value passed to random() can either be the actual number of rows to select, or the proportion of rows to select.

Here randomly select five rows of data from the d data frame.

d[.(random(5)), .(Years:Salary)]
##                 Years Gender Dept    Salary
## Jones, Alissa       5      W <NA>  63772.58
## Skrotzki, Sara     18      W MKTG 101352.33
## Correll, Trevon    21      M SALE 144419.23
## Capelle, Adam      24      M ADMN 118138.43
## Portlock, Ryan     13      M SALE  87714.85

Here specify a proportion of rows to select.

d[.(random(0.1)), .(Years:Salary)]
##                 Years Gender Dept   Salary
## Billing, Susan      4      W ADMN 82675.26
## Stanley, Emma       3      W ACCT 56124.97
## Kimball, Claire     8      W MKTG 71356.69
## Downs, Deborah      7      W FINC 67139.90