Data transformations for continuous variables are straightforward, just enter the arithmetic expression for the transformation. For each variable identify the corresponding data frame that contains the variable if there is one. For example, the following creates a new variable xsq that is the square of the values of a variable x in the d data frame.
d$xsq <- d$x^2
Or, use the base R transform()
function to accomplish
the same, plus other functions from other packages that accomplish the
same result.
For variables that define discrete categories, however, the
transformation may not be so straightforward with base R functions such
as a nested string of ifelse()
functions. An alternative is
the lessR function recode()
.
To use recode()
, specify the variable to be recoded with
the old_vars
parameter, the first parameter in the function
call. Specify values to be recoded with the required old
parameter. Specify the corresponding recoded values with the required
new
parameter. There must be a 1-to-1 correspondence
between the two sets of values, such as 0:5 recoded to 5:0, six items in
the old
set and six items in the new
set.
To illustrate, construct the following small data frame.
d <- read.table(text="Severity Description
1 Mild
4 Moderate
3 Moderate
2 Mild
1 Severe", header=TRUE, stringsAsFactors=FALSE)
d
## Severity Description
## 1 1 Mild
## 2 4 Moderate
## 3 3 Moderate
## 4 2 Mild
## 5 1 Severe
Now change the integer values of the variable Severity from
1 through 4 to 10 through 40. Because the parameter
old_vars
is the first parameter in the definition of
recode()
, and because it is listed first, the parameter
name need not be specified. The default data frame is d,
otherwise specify with the data
parameter.
## Severity Description
## 1 10 Mild
## 2 40 Moderate
## 3 30 Moderate
## 4 20 Mild
## 5 10 Severe
In the previous example, the values of the variable were overwritten
with the new values. In the following example, instead write the recoded
values to a new variable with the new_vars
parameter, here
SevereNew.
A convenient application of recode()
is to Likert data,
with responses scored to items on a survey such as from 0 for Strongly
Disagree to 5 for Strongly Agree. To encourage responders to carefully
read the items, some items are written in the opposite direction so that
disagreement indicates agreement with the overall attitude being
assessed.
As an example, reverse score Items m01, m02, m03, and m10 from survey responses to the 20-item Mach IV scale. That is, score a 0 as a 5 and so forth. The responses are included as part of lessR and so can be directly read.
The function also addresses missing data. Existing data values can be converted to an R missing value. In this example, all values of 1 for the variable Plan are considered missing.
Now values of 1 for Plan are missing, having the value of
NA
for not available, as shown by listing the first six
rows of data with the base R function head()
.
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James NA M SALE 94494.58 low 1 62 74
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
The procedure can be reversed in which values that are missing
according to the R code NA
are converted to non-missing
values. To illustrate with the Employee data set, examine the
first six rows of data. The value of Years
is missing in
the second row of data.
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James NA M SALE 94494.58 low 1 62 74
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
Here convert all missing data values for the variables Years and Salary to the value of 99.
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James 99 M SALE 94494.58 low 1 62 74
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
Now the value of Years in the second row of data is 99.
Sorts the values of a data frame according to the values of one or more variables contained in the data frame, or the row names. Variable types include numeric and factor variables. Factors are sorted by the ordering of their values, which, by default is alphabetical. Sorting by row names is also possible.
To illustrate, use the lessR Employee data set, here just the first 12 rows of data to save space.
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Wu, James NA M SALE 94494.58 low 1 62 74
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Pham, Scott 13 M SALE 81871.05 high 2 90 94
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
## Saechao, Suzanne 8 W SALE 55545.25 med 1 98 100
## Years Gender Dept Salary JobSat Plan Pre Post
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Wu, James NA M SALE 94494.58 low 1 62 74
## Pham, Scott 13 M SALE 81871.05 high 2 90 94
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
## Saechao, Suzanne 8 W SALE 55545.25 med 1 98 100
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
Sort by row names in ascending order.
## row.names --> ascending
## Years Gender Dept Salary JobSat Plan Pre Post
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Pham, Scott 13 M SALE 81871.05 high 2 90 94
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Saechao, Suzanne 8 W SALE 55545.25 med 1 98 100
## Wu, James NA M SALE 94494.58 low 1 62 74
Randomize the order of the data values.
## random
## Years Gender Dept Salary JobSat Plan Pre Post
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Saechao, Suzanne 8 W SALE 55545.25 med 1 98 100
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
## Hoang, Binh 15 M SALE 111074.86 low 3 96 97
## Pham, Scott 13 M SALE 81871.05 high 2 90 94
## Cooper, Lindsay 4 W MKTG 56772.95 high 1 78 91
## Campagna, Justin 8 M SALE 72321.36 low 1 76 84
## Wu, James NA M SALE 94494.58 low 1 62 74
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## [1] -0.926 -0.139 1.347 -0.837 -0.545 1.950 0.484 -0.775 0.005 1.118 -0.757 -0.925
List the name of the data frame, the existing variable name, and the new name, in that order.
## [1] "Years" "Gender" "Dept" "Salary" "JobSat" "Plan" "Pre" "Post"
## Change the following variable names for data frame d :
##
## Salary --> AnnualSalary
## [1] "Years" "Gender" "Dept" "AnnualSalary" "JobSat" "Plan" "Pre"
## [8] "Post"
LikertCats <- c("Strongly Disagree", "Disagree", "Slightly Disagree",
"Slightly Agree", "Agree", "Strongly Agree")
Convert the specified variables to factors according to the given vector of three variables only. Leave the original variables unmodified, create new variables.
Now copy the variable labels from the original integer variables to the newly created factor variables.
A wide-form data table has multiple measurements from the same unit of analysis (e.g., person) across the row of data, usually repeated over time. The conversion to long-form forms three new columns from the input wide-form: the name of the grouping variable, the name of the response values, and the name of the ID field.
Read the data.
## Person sup1 sup2 sup3 sup4
## 1 p1 2 4 4 3
## 2 p2 2 5 4 6
## 3 p3 8 6 7 9
## 4 p4 4 3 5 7
## 5 p5 2 1 2 3
## 6 p6 5 5 6 8
## 7 p7 2 3 2 4
Go with the default variable names in the long-form.
## ID Person Group Response
## 1 ID1 p1 sup1 2
## 2 ID2 p2 sup1 2
## 3 ID3 p3 sup1 8
## 4 ID4 p4 sup1 4
## 5 ID5 p5 sup1 2
## 6 ID6 p6 sup1 5
## 7 ID7 p7 sup1 2
## 8 ID1 p1 sup2 4
## 9 ID2 p2 sup2 5
## 10 ID3 p3 sup2 6
## 11 ID4 p4 sup2 3
## 12 ID5 p5 sup2 1
## 13 ID6 p6 sup2 5
## 14 ID7 p7 sup2 3
## 15 ID1 p1 sup3 4
## 16 ID2 p2 sup3 4
## 17 ID3 p3 sup3 7
## 18 ID4 p4 sup3 5
## 19 ID5 p5 sup3 2
## 20 ID6 p6 sup3 6
## 21 ID7 p7 sup3 2
## 22 ID1 p1 sup4 3
## 23 ID2 p2 sup4 6
## 24 ID3 p3 sup4 9
## 25 ID4 p4 sup4 7
## 26 ID5 p5 sup4 3
## 27 ID6 p6 sup4 8
## 28 ID7 p7 sup4 4
Specify custom variable names in the long-form, take advantage of the
usual organization that the columns to be transformed are all sequential
in the data frame. Use the ordering sup1:sup4
to identify
the variables. Only the first two parameter values are required, the
data frame that contains the variables and the variables to be
transformed.
## Person Supplement Reps
## 1 Pp1 sup1 2
## 2 Pp2 sup1 2
## 3 Pp3 sup1 8
## 4 Pp4 sup1 4
## 5 Pp5 sup1 2
## 6 Pp6 sup1 5
## 7 Pp7 sup1 2
## 8 Pp1 sup2 4
## 9 Pp2 sup2 5
## 10 Pp3 sup2 6
## 11 Pp4 sup2 3
## 12 Pp5 sup2 1
## 13 Pp6 sup2 5
## 14 Pp7 sup2 3
## 15 Pp1 sup3 4
## 16 Pp2 sup3 4
## 17 Pp3 sup3 7
## 18 Pp4 sup3 5
## 19 Pp5 sup3 2
## 20 Pp6 sup3 6
## 21 Pp7 sup3 2
## 22 Pp1 sup4 3
## 23 Pp2 sup4 6
## 24 Pp3 sup4 9
## 25 Pp4 sup4 7
## 26 Pp5 sup4 3
## 27 Pp6 sup4 8
## 28 Pp7 sup4 4
Can also reshape a long-form data frame to wide-form.
Here, begin with a wide-form data frame and convert to long-form.
## Person sup1 sup2 sup3 sup4
## 1 p1 2 4 4 3
## 2 p2 2 5 4 6
## 3 p3 8 6 7 9
## 4 p4 4 3 5 7
## 5 p5 2 1 2 3
## 6 p6 5 5 6 8
## 7 p7 2 3 2 4
Convert back to wide form.
## Person sup1 sup2 sup3 sup4
## 1 p1 2 4 4 3
## 2 p2 2 5 4 6
## 3 p3 8 6 7 9
## 4 p4 4 3 5 7
## 5 p5 2 1 2 3
## 6 p6 5 5 6 8
## 7 p7 2 3 2 4
Here covert with the name of the response prefixed to the column names.
## Person Response.sup1 Response.sup2 Response.sup3 Response.sup4
## 1 p1 2 4 4 3
## 2 p2 2 5 4 6
## 3 p3 8 6 7 9
## 4 p4 4 3 5 7
## 5 p5 2 1 2 3
## 6 p6 5 5 6 8
## 7 p7 2 3 2 4
Get the data, the Employee data set.
Create four component data frames: out$train_x, out$train_y, out$test_x, and out$test_y. Specify the response variable as Salary.
## [1] "train_x" "train_y" "test_x" "test_y"
Create two component data frames: out\(train and out\)test. All the variables in the original data frame are included in the component data frames.
## [1] "train" "test"