1. Data: Utilities

David Gerbing

library("lessR")
## 
## lessR 4.5.2                          feedback: gerbing@pdx.edu 
## --------------------------------------------------------------
## > d <- Read("")  Read data file, many formats available, e.g., Excel
##   d is the default data frame, data= in analysis routines optional
## 
## Find examples of reading, writing, and manipulating data, graphics,
## testing means and proportions, regression, factor analysis,
## customization, forecasting, and aggregation to pivot tables.
##   Enter: browseVignettes("lessR")
## 
## Although most previous function calls still work, most
## visualization functions are now reorganized to three functions:
##    Chart(): type = "bar", "pie", "radar", "bubble", "dot",
##                    "sunburst", "treemap", "icicle"
##    X(): type="histogram", "density", "vbs", and more
##    XY(): type="scatter" for a scatterplot, or "contour", "smooth"
## There is also Flows() for Sankey flow diagrams.
## 
## View lessR updates, now including modern time series forecasting.
##   Enter: news(package="lessR"), or ?Chart, ?X, or ?XY
## 
## Interactive data analysis for constructing visualizations.
##   Enter: interact()

Recode Data Values

Data transformations for continuous variables are straightforward, just enter the arithmetic expression for the transformation. For each variable identify the corresponding data frame that contains the variable if there is one. For example, the following creates a new variable xsq that is the square of the values of a variable x in the d data frame.

d$xsq <- d$x^2

Or, use the base R transform() function to accomplish the same, plus other functions from other packages that accomplish the same result.

For variables that define discrete categories, however, the transformation may not be so straightforward with base R functions such as a nested string of ifelse() functions. An alternative is the lessR function recode().

To use recode(), specify the variable to be recoded with the old_vars parameter, the first parameter in the function call. Specify values to be recoded with the required old parameter. Specify the corresponding recoded values with the required new parameter. There must be a 1-to-1 correspondence between the two sets of values, such as 0:5 recoded to 5:0, six items in the old set and six items in the new set.

Examples

To illustrate, construct the following small data frame.

d <- read.table(text="Severity Description
1 Mild
4 Moderate
3 Moderate
2 Mild
1 Severe", header=TRUE, stringsAsFactors=FALSE)

d
##   Severity Description
## 1        1        Mild
## 2        4    Moderate
## 3        3    Moderate
## 4        2        Mild
## 5        1      Severe

Now change the integer values of the variable Severity from 1 through 4 to 10 through 40. Because the parameter old_vars is the first parameter in the definition of recode(), and because it is listed first, the parameter name need not be specified. The default data frame is d, otherwise specify with the data parameter.

d <- recode(Severity, old=1:4, new=c(10,20,30,40))
## 
## --------------------------------------------------------
## First four rows of data to recode for data frame: d 
## --------------------------------------------------------
##   Severity
## 1        1
## 2        4
## 3        3
## 4        2
## 
## 
## Recoding Specification
## ----------------------
##    1 --> 10 
##    2 --> 20 
##    3 --> 30 
##    4 --> 40 
## 
## Number of cases (rows) to recode: 5 
## 
## Replace existing values of each specified variable, no value for option: new.var
## 
## ---  Recode: Severity ---------------------------------
## Number of unique values of Severity in the data: 4 
## Number of values of Severity to recode: 4 
## 
## 
## ------------------------------------------------
## First four rows of recoded data
## ------------------------------------------------
##   Severity
## 1       10
## 2       40
## 3       30
## 4       20
d
##   Severity Description
## 1       10        Mild
## 2       40    Moderate
## 3       30    Moderate
## 4       20        Mild
## 5       10      Severe

In the previous example, the values of the variable were overwritten with the new values. In the following example, instead write the recoded values to a new variable with the new_vars parameter, here SevereNew.

d <- recode(Severity, new_vars="SevereNew", old=1:4, new=c(10,20,30,40))
## 
## --------------------------------------------------------
## First four rows of data to recode for data frame: d 
## --------------------------------------------------------
##   Severity
## 1       10
## 2       40
## 3       30
## 4       20
## 
## 
## Recoding Specification
## ----------------------
##    1 --> 10 
##    2 --> 20 
##    3 --> 30 
##    4 --> 40 
## 
## Number of cases (rows) to recode: 5 
## 
## ---  Recode: Severity ---------------------------------
## Number of unique values of Severity in the data: 4 
## >>> Note: A value specified to recode, 1, is not in the data.
## 
## >>> Note: A value specified to recode, 2, is not in the data.
## 
## >>> Note: A value specified to recode, 3, is not in the data.
## 
## >>> Note: A value specified to recode, 4, is not in the data.
## 
## Number of values of Severity to recode: 4 
## Recode to variable: SevereNew 
## 
## 
## ------------------------------------------------
## First four rows of recoded data
## ------------------------------------------------
##   Severity SevereNew
## 1       10        10
## 2       40        40
## 3       30        30
## 4       20        20

A convenient application of recode() is to Likert data, with responses scored to items on a survey such as from 0 for Strongly Disagree to 5 for Strongly Agree. To encourage responders to carefully read the items, some items are written in the opposite direction so that disagreement indicates agreement with the overall attitude being assessed.

As an example, reverse score Items m01, m02, m03, and m10 from survey responses to the 20-item Mach IV scale. That is, score a 0 as a 5 and so forth. The responses are included as part of lessR and so can be directly read.

d <- Read("Mach4")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## integer: Numeric data values, integers only
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1    Gender   integer    351       0       2   0  0  1 ... 0  0  1
##  2       m01   integer    351       0       6   0  0  2 ... 2  1  3
##  3       m02   integer    351       0       6   4  1  1 ... 3  4  3
##  4       m03   integer    351       0       6   1  4  0 ... 3  4  3
##  5       m04   integer    351       0       6   5  4  5 ... 3  4  4
##  6       m05   integer    351       0       6   0  0  4 ... 2  3  3
##  7       m06   integer    351       0       6   5  3  4 ... 4  4  2
##  8       m07   integer    351       0       6   4  3  0 ... 4  4  2
##  9       m08   integer    351       0       6   1  0  5 ... 3  2  3
## 10       m09   integer    351       0       6   5  4  3 ... 3  3  3
## 11       m10   integer    351       0       6   4  4  4 ... 3  4  3
## 12       m11   integer    351       0       6   0  0  1 ... 1  1  2
## 13       m12   integer    351       0       6   0  1  4 ... 2  1  3
## 14       m13   integer    351       0       6   0  1  0 ... 3  1  2
## 15       m14   integer    351       0       6   0  1  0 ... 2  2  2
## 16       m15   integer    351       0       6   4  2  2 ... 3  5  3
## 17       m16   integer    351       0       6   0  4  0 ... 0  2  5
## 18       m17   integer    351       0       6   1  4  2 ... 0  2  2
## 19       m18   integer    351       0       6   3  3  4 ... 4  4  3
## 20       m19   integer    351       0       6   2  1  0 ... 0  0  1
## 21       m20   integer    351       0       6   4  0  1 ... 1  0  3
## ------------------------------------------------------------------------------------------
d <- recode(c(m01:m03,m10), old=0:5, new=5:0)
## 
## --------------------------------------------------------
## First four rows of data to recode for data frame: d 
## --------------------------------------------------------
##   m01 m02 m03 m10
## 1   0   4   1   4
## 2   0   1   4   4
## 3   2   1   0   4
## 4   0   5   2   2
## 
## 
## Recoding Specification
## ----------------------
##    0 --> 5 
##    1 --> 4 
##    2 --> 3 
##    3 --> 2 
##    4 --> 1 
##    5 --> 0 
## 
## Number of cases (rows) to recode: 351 
## 
## Replace existing values of each specified variable, no value for option: new.var
## 
## ---  Recode: m01 ---------------------------------
## Number of unique values of m01 in the data: 6 
## Number of values of m01 to recode: 6 
## 
## ---  Recode: m02 ---------------------------------
## Number of unique values of m02 in the data: 6 
## Number of values of m02 to recode: 6 
## 
## ---  Recode: m03 ---------------------------------
## Number of unique values of m03 in the data: 6 
## Number of values of m03 to recode: 6 
## 
## ---  Recode: m10 ---------------------------------
## Number of unique values of m10 in the data: 6 
## Number of values of m10 to recode: 6 
## 
## 
## ------------------------------------------------
## First four rows of recoded data
## ------------------------------------------------
##   m01 m02 m03 m10
## 1   5   1   4   1
## 2   5   4   1   1
## 3   3   4   5   1
## 4   5   0   3   3

Missing Data

The function also addresses missing data. Existing data values can be converted to an R missing value. In this example, all values of 1 for the variable Plan are considered missing.

d <- Read("Employee")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
##  2    Gender character     37       0       2   M  M  W ... W  W  M
##  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   63788.26  104494.58 ... 66508.32  67562.36
##  5    JobSat character     35       2       3   med  low  high ... high  low  high
##  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
## ------------------------------------------------------------------------------------------
newdata <- recode(Plan, old=1, new="missing")
## 
## --------------------------------------------------------
## First four rows of data to recode for data frame: d 
## --------------------------------------------------------
##                  Plan
## Ritchie, Darnell    1
## Wu, James           1
## Downs, Deborah      2
## Hoang, Binh         3
## 
## 
## Recoding Specification
## ----------------------
##    1 --> missing 
## 
## 
## R represents missing data with a NA for 'not assigned'.
## 
## Number of cases (rows) to recode: 37 
## 
## Replace existing values of each specified variable, no value for option: new.var
## 
## ---  Recode: Plan ---------------------------------
## Number of unique values of Plan in the data: 3 
## Number of values of Plan to recode: 1 
## 
## 
## ------------------------------------------------
## First four rows of recoded data
## ------------------------------------------------
##                  Plan
## Ritchie, Darnell   NA
## Wu, James          NA
## Downs, Deborah      2
## Hoang, Binh         3

Now values of 1 for Plan are missing, having the value of NA for not available, as shown by listing the first six rows of data with the base R function head().

head(d)
##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN  63788.26    med    1  82   92
## Wu, James           NA      M SALE 104494.58    low    1  62   74
## Downs, Deborah       7      W FINC  67139.90   high    2  90   86
## Hoang, Binh         15      M SALE 121074.86    low    3  96   97
## Jones, Alissa        5      W <NA>  63772.58   <NA>    1  65   62
## Afshari, Anbar       6      W ADMN  79441.93   high    2 100  100

The procedure can be reversed in which values that are missing according to the R code NA are converted to non-missing values. To illustrate with the Employee data set, examine the first six rows of data. The value of Years is missing in the second row of data.

d <- Read("Employee")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
##  2    Gender character     37       0       2   M  M  W ... W  W  M
##  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   63788.26  104494.58 ... 66508.32  67562.36
##  5    JobSat character     35       2       3   med  low  high ... high  low  high
##  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
## ------------------------------------------------------------------------------------------
head(d)
##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN  63788.26    med    1  82   92
## Wu, James           NA      M SALE 104494.58    low    1  62   74
## Downs, Deborah       7      W FINC  67139.90   high    2  90   86
## Hoang, Binh         15      M SALE 121074.86    low    3  96   97
## Jones, Alissa        5      W <NA>  63772.58   <NA>    1  65   62
## Afshari, Anbar       6      W ADMN  79441.93   high    2 100  100

Here convert all missing data values for the variables Years and Salary to the value of 99.

d <- recode(c(Years, Salary), old="missing", new=99)
## 
## --------------------------------------------------------
## First four rows of data to recode for data frame: d 
## --------------------------------------------------------
##                  Years    Salary
## Ritchie, Darnell     7  63788.26
## Wu, James           NA 104494.58
## Downs, Deborah       7  67139.90
## Hoang, Binh         15 121074.86
## 
## 
## Recoding Specification
## ----------------------
##    missing --> 99 
## 
## Number of cases (rows) to recode: 37 
## 
## Replace existing values of each specified variable, no value for option: new.var
## 
## ---  Recode: Years ---------------------------------
## Number of unique values of Years in the data: 16 
## >>> Note: A value specified to recode, missing, is not in the data.
## 
## Number of values of Years to recode: 1 
## 
## ---  Recode: Salary ---------------------------------
## Unique values of Salary in the data: 56124.97 59188.96 59704.79 59868.68 61036.85 63772.58 63788.26 65545.25 66508.32 66772.95 67139.9 67562.36 71055.44 71356.69 71961.29 76312.89 76337.83 79441.93 79547.6 79624.87 81084.02 82321.36 82502.5 82675.26 87714.85 91871.05 93014.43 97785.51 101352.3 102681.2 104494.6 105027.6 109062.7 118138.4 121074.9 132563.4 144419.2 
## Number of unique values of Salary in the data: 37 
## >>> Note: A value specified to recode, missing, is not in the data.
## 
## Number of values of Salary to recode: 1 
## 
## 
## ------------------------------------------------
## First four rows of recoded data
## ------------------------------------------------
##                  Years    Salary
## Ritchie, Darnell     7  63788.26
## Wu, James           99 104494.58
## Downs, Deborah       7  67139.90
## Hoang, Binh         15 121074.86
head(d)
##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN  63788.26    med    1  82   92
## Wu, James           99      M SALE 104494.58    low    1  62   74
## Downs, Deborah       7      W FINC  67139.90   high    2  90   86
## Hoang, Binh         15      M SALE 121074.86    low    3  96   97
## Jones, Alissa        5      W <NA>  63772.58   <NA>    1  65   62
## Afshari, Anbar       6      W ADMN  79441.93   high    2 100  100

Now the value of Years in the second row of data is 99.

Sort Rows of Data

Sorts the values of a data frame according to the values of one or more variables contained in the data frame, or the row names. Variable types include numeric and factor variables. Factors are sorted by the ordering of their values, which, by default is alphabetical. Sorting by row names is also possible.

To illustrate, use the lessR Employee data set, here just the first 12 rows of data to save space.

d <- Read("Employee")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  7 ... 1  2  10
##  2    Gender character     37       0       2   M  M  W ... W  W  M
##  3      Dept character     36       1       5   ADMN  SALE  FINC ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   63788.26  104494.58 ... 66508.32  67562.36
##  5    JobSat character     35       2       3   med  low  high ... high  low  high
##  6      Plan   integer     37       0       3   1  1  2 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  90 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  86 ... 90  71  87
## ------------------------------------------------------------------------------------------
d <- d[1:12,]
d <- order_by(d, Gender)
## 
## Sort Specification
##   Gender -->  ascending
d
##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Ritchie, Darnell     7      M ADMN  63788.26    med    1  82   92
## Wu, James           NA      M SALE 104494.58    low    1  62   74
## Hoang, Binh         15      M SALE 121074.86    low    3  96   97
## Knox, Michael       18      M MKTG 109062.66    med    3  81   84
## Campagna, Justin     8      M SALE  82321.36    low    1  76   84
## Pham, Scott         13      M SALE  91871.05   high    2  90   94
## Downs, Deborah       7      W FINC  67139.90   high    2  90   86
## Jones, Alissa        5      W <NA>  63772.58   <NA>    1  65   62
## Afshari, Anbar       6      W ADMN  79441.93   high    2 100  100
## Kimball, Claire      8      W MKTG  71356.69   high    2  93   92
## Cooper, Lindsay      4      W MKTG  66772.95   high    1  78   91
## Saechao, Suzanne     8      W SALE  65545.25    med    1  98  100
d <- order_by(d, c(Gender, Salary), direction=c("+", "-"))
## 
## Sort Specification
##   Gender -->  ascending 
##   Salary -->  descending
d
##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Hoang, Binh         15      M SALE 121074.86    low    3  96   97
## Knox, Michael       18      M MKTG 109062.66    med    3  81   84
## Wu, James           NA      M SALE 104494.58    low    1  62   74
## Pham, Scott         13      M SALE  91871.05   high    2  90   94
## Campagna, Justin     8      M SALE  82321.36    low    1  76   84
## Ritchie, Darnell     7      M ADMN  63788.26    med    1  82   92
## Afshari, Anbar       6      W ADMN  79441.93   high    2 100  100
## Kimball, Claire      8      W MKTG  71356.69   high    2  93   92
## Downs, Deborah       7      W FINC  67139.90   high    2  90   86
## Cooper, Lindsay      4      W MKTG  66772.95   high    1  78   91
## Saechao, Suzanne     8      W SALE  65545.25    med    1  98  100
## Jones, Alissa        5      W <NA>  63772.58   <NA>    1  65   62

Sort by row names in ascending order.

d <- order_by(d, row.names)
## 
## Sort Specification
##   row.names -->  ascending
d
##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Afshari, Anbar       6      W ADMN  79441.93   high    2 100  100
## Campagna, Justin     8      M SALE  82321.36    low    1  76   84
## Cooper, Lindsay      4      W MKTG  66772.95   high    1  78   91
## Downs, Deborah       7      W FINC  67139.90   high    2  90   86
## Hoang, Binh         15      M SALE 121074.86    low    3  96   97
## Jones, Alissa        5      W <NA>  63772.58   <NA>    1  65   62
## Kimball, Claire      8      W MKTG  71356.69   high    2  93   92
## Knox, Michael       18      M MKTG 109062.66    med    3  81   84
## Pham, Scott         13      M SALE  91871.05   high    2  90   94
## Ritchie, Darnell     7      M ADMN  63788.26    med    1  82   92
## Saechao, Suzanne     8      W SALE  65545.25    med    1  98  100
## Wu, James           NA      M SALE 104494.58    low    1  62   74

Randomize the order of the data values.

d <- order_by(d, random)
## 
## Sort Specification
##   random
d
##                  Years Gender Dept    Salary JobSat Plan Pre Post
## Saechao, Suzanne     8      W SALE  65545.25    med    1  98  100
## Kimball, Claire      8      W MKTG  71356.69   high    2  93   92
## Hoang, Binh         15      M SALE 121074.86    low    3  96   97
## Pham, Scott         13      M SALE  91871.05   high    2  90   94
## Campagna, Justin     8      M SALE  82321.36    low    1  76   84
## Cooper, Lindsay      4      W MKTG  66772.95   high    1  78   91
## Knox, Michael       18      M MKTG 109062.66    med    3  81   84
## Ritchie, Darnell     7      M ADMN  63788.26    med    1  82   92
## Jones, Alissa        5      W <NA>  63772.58   <NA>    1  65   62
## Afshari, Anbar       6      W ADMN  79441.93   high    2 100  100
## Wu, James           NA      M SALE 104494.58    low    1  62   74
## Downs, Deborah       7      W FINC  67139.90   high    2  90   86

Rescale Data

rescale(Salary)
##  [1] -0.837 -0.545  1.950  0.484  0.005 -0.775  1.347 -0.925 -0.926 -0.139
## [11]  1.118 -0.757

Rename a Variable in a Data Frame

List the name of the data frame, the existing variable name, and the new name, in that order.

names(d)
## [1] "Years"  "Gender" "Dept"   "Salary" "JobSat" "Plan"   "Pre"    "Post"
d <- rename(d, Salary, AnnualSalary)
## Change the following variable names for data frame d :
## 
## Salary --> AnnualSalary
names(d)
## [1] "Years"        "Gender"       "Dept"         "AnnualSalary" "JobSat"      
## [6] "Plan"         "Pre"          "Post"

Create Factor Variables

d <- rd("Mach4", quiet=TRUE)
l <- rd("Mach4_lbl")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     label character     20       0      20   Never tell anyone the real reason you did something unless it is useful to do so ... Most people forget more easily the death of a parent than the loss of their property
## ------------------------------------------------------------------------------------------
LikertCats <- c("Strongly Disagree", "Disagree", "Slightly Disagree",
                "Slightly Agree", "Agree", "Strongly Agree")
d <- factors(m01:m20, levels=0:5, labels=LikertCats)

Convert the specified variables to factors according to the given vector of three variables only. Leave the original variables unmodified, create new variables.

d <- factors(c(m06, m07, m20), levels=0:5, labels=LikertCats, new=TRUE)

Now copy the variable labels from the original integer variables to the newly created factor variables.

l <- factors(c(m06, m07, m20), var_labels=TRUE)

Reshape Data

Reshape Data Wide to Long

A wide-form data table has multiple measurements from the same unit of analysis (e.g., person) across the row of data, usually repeated over time. The conversion to long-form forms three new columns from the input wide-form: the name of the grouping variable, the name of the response values, and the name of the ID field.

Read the data.

d <- Read("Anova_rb")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1    Person character      7       0       7   p1  p2  p3 ... p5  p6  p7
##  2      sup1   integer      7       0       4   2  2  8 ... 2  5  2
##  3      sup2   integer      7       0       5   4  5  6 ... 1  5  3
##  4      sup3   integer      7       0       5   4  4  7 ... 2  6  2
##  5      sup4   integer      7       0       6   3  6  9 ... 3  8  4
## ------------------------------------------------------------------------------------------
## 
## 
## For the column Person, each row of data is unique. Are these values
## a unique ID for each row? To define as a row name, re-read the data file
## with the following setting added to your Read() statement: row_names=1
d
##   Person sup1 sup2 sup3 sup4
## 1     p1    2    4    4    3
## 2     p2    2    5    4    6
## 3     p3    8    6    7    9
## 4     p4    4    3    5    7
## 5     p5    2    1    2    3
## 6     p6    5    5    6    8
## 7     p7    2    3    2    4

Go with the default variable names in the long-form.

reshape_long(d, c("sup1", "sup2", "sup3", "sup4"))
##     ID Person Group Response
## 1  ID1     p1  sup1        2
## 2  ID2     p2  sup1        2
## 3  ID3     p3  sup1        8
## 4  ID4     p4  sup1        4
## 5  ID5     p5  sup1        2
## 6  ID6     p6  sup1        5
## 7  ID7     p7  sup1        2
## 8  ID1     p1  sup2        4
## 9  ID2     p2  sup2        5
## 10 ID3     p3  sup2        6
## 11 ID4     p4  sup2        3
## 12 ID5     p5  sup2        1
## 13 ID6     p6  sup2        5
## 14 ID7     p7  sup2        3
## 15 ID1     p1  sup3        4
## 16 ID2     p2  sup3        4
## 17 ID3     p3  sup3        7
## 18 ID4     p4  sup3        5
## 19 ID5     p5  sup3        2
## 20 ID6     p6  sup3        6
## 21 ID7     p7  sup3        2
## 22 ID1     p1  sup4        3
## 23 ID2     p2  sup4        6
## 24 ID3     p3  sup4        9
## 25 ID4     p4  sup4        7
## 26 ID5     p5  sup4        3
## 27 ID6     p6  sup4        8
## 28 ID7     p7  sup4        4

Specify custom variable names in the long-form, take advantage of the usual organization that the columns to be transformed are all sequential in the data frame. Use the ordering sup1:sup4 to identify the variables. Only the first two parameter values are required, the data frame that contains the variables and the variables to be transformed.

reshape_long(d, sup1:sup4, 
             group="Supplement", response="Reps", ID="Person", prefix="P")
##    Person Supplement Reps
## 1     Pp1       sup1    2
## 2     Pp2       sup1    2
## 3     Pp3       sup1    8
## 4     Pp4       sup1    4
## 5     Pp5       sup1    2
## 6     Pp6       sup1    5
## 7     Pp7       sup1    2
## 8     Pp1       sup2    4
## 9     Pp2       sup2    5
## 10    Pp3       sup2    6
## 11    Pp4       sup2    3
## 12    Pp5       sup2    1
## 13    Pp6       sup2    5
## 14    Pp7       sup2    3
## 15    Pp1       sup3    4
## 16    Pp2       sup3    4
## 17    Pp3       sup3    7
## 18    Pp4       sup3    5
## 19    Pp5       sup3    2
## 20    Pp6       sup3    6
## 21    Pp7       sup3    2
## 22    Pp1       sup4    3
## 23    Pp2       sup4    6
## 24    Pp3       sup4    9
## 25    Pp4       sup4    7
## 26    Pp5       sup4    3
## 27    Pp6       sup4    8
## 28    Pp7       sup4    4

Reshape Data Long to Wide

Can also reshape a long-form data frame to wide-form.

Here, begin with a wide-form data frame and convert to long-form.

d <- Read("Anova_rb")
## 
## >>> Suggestions
## Recommended binary format for data files: feather
##   Create with Write(d, "your_file", format="feather")
## More details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1    Person character      7       0       7   p1  p2  p3 ... p5  p6  p7
##  2      sup1   integer      7       0       4   2  2  8 ... 2  5  2
##  3      sup2   integer      7       0       5   4  5  6 ... 1  5  3
##  4      sup3   integer      7       0       5   4  4  7 ... 2  6  2
##  5      sup4   integer      7       0       6   3  6  9 ... 3  8  4
## ------------------------------------------------------------------------------------------
## 
## 
## For the column Person, each row of data is unique. Are these values
## a unique ID for each row? To define as a row name, re-read the data file
## with the following setting added to your Read() statement: row_names=1
d
##   Person sup1 sup2 sup3 sup4
## 1     p1    2    4    4    3
## 2     p2    2    5    4    6
## 3     p3    8    6    7    9
## 4     p4    4    3    5    7
## 5     p5    2    1    2    3
## 6     p6    5    5    6    8
## 7     p7    2    3    2    4
dl <- reshape_long(d, sup1:sup4)  # convert to long-form
head(dl)
##    ID Person Group Response
## 1 ID1     p1  sup1        2
## 2 ID2     p2  sup1        2
## 3 ID3     p3  sup1        8
## 4 ID4     p4  sup1        4
## 5 ID5     p5  sup1        2
## 6 ID6     p6  sup1        5

Convert back to wide form.

reshape_wide(dl, widen=Group, response=Response, ID=Person)
##   Person sup1 sup2 sup3 sup4
## 1     p1    2    4    4    3
## 2     p2    2    5    4    6
## 3     p3    8    6    7    9
## 4     p4    4    3    5    7
## 5     p5    2    1    2    3
## 6     p6    5    5    6    8
## 7     p7    2    3    2    4

Here covert with the name of the response prefixed to the column names.

reshape_wide(dl, widen=Group, response=Response, ID=Person,
             prefix=TRUE, sep=".")
##   Person Response.sup1 Response.sup2 Response.sup3 Response.sup4
## 1     p1             2             4             4             3
## 2     p2             2             5             4             6
## 3     p3             8             6             7             9
## 4     p4             4             3             5             7
## 5     p5             2             1             2             3
## 6     p6             5             5             6             8
## 7     p7             2             3             2             4

Create Training and Testing Data

Get the data, the Employee data set.

d <- Read("Employee", quiet=TRUE)

Create four component data frames: out$train_x, out$train_y, out$test_x, and out$test_y. Specify the response variable as Salary.

out <- train_test(d, Salary)
names(out)
## [1] "train_x" "train_y" "test_x"  "test_y"

Create two component data frames: out\(train and out\)test. All the variables in the original data frame are included in the component data frames.

out <- train_test(d)
names(out)
## [1] "train" "test"