Descriptive summarize characteristics of the sample data values for
one or more variables. The lessR pivot()
function serves as a single source for a wide variety and types of
descriptive statistics for one or more variables communicated in the
form of a pivot table. Compute statistics either for the entire
data set at once or separately for different groups of data.
Aggregation: From the values of categorical variables, form groups of data and then compute one or more statistics for each group.
As an example of aggregation, compute the mean Years worked for each
combination of the levels of Gender and Dept
(department) of employment. To introduce the ease of use of
pivot()
, and to provide context for the following
discussion, the function call listed below includes the parameter names
(in red) for emphasis. However, if the parameters are entered in this
order, listing their names is not necessary. Enter multiple values, such
as for Gender and Dept as an R vector, defined such as
with the base R combine c()
function. The output of the
analysis is shown later.
The pivot()
function computes statistics for three
classes of variables:
Any function that processes a single vector of data, such as a column
of data values for a variable in a data frame, and outputs a single
computed value, the statistic, can be passed to pivot()
.
The phrase “any function” is quite general, including any function in
any active package as well as user-defined functions that satisfy these
criteria. The function accessed by pivot()
can be one of
the many available R statistical functions that summarize data,
including those in the following table.
Statistic | Meaning |
---|---|
sum |
sum |
mean |
arithmetic mean |
median |
median |
min |
minimum |
max |
maximum |
sd |
standard deviation |
var |
variance |
skew |
skew |
kurtosis |
kurtosis |
IQR |
inter-quartile range |
mad |
mean absolute deviation |
The statistics skew
and kurtosis
have no
counterparts in base R, so are provided by lessR.
Computations of all other statistics follow from base R functions.
The quantile
and table
computations return
multiple values. They are accessible to pivot()
only
because of internal programming that recognizes these functions and then
processes the resulting multiple output values.
Statistic | Meaning |
---|---|
quantile |
min, quantiles, max |
table |
cell counts or proportions |
The default quantiles for quantile
are quartiles.
Specify a custom number of quantiles with the q_num
parameter other than 4 for quartiles.
The table
computation applies to an aggregated variable
that consists of discrete categories, such as the numbers 1 through 5
for responses to a 5-pt Likert scale. The result is a table of
frequencies or proportions, referred to for two or more variables as
either a contingency table, a cross-tabulation table, or a joint
frequency distribution. Only the table
computation applies
to non-numeric as well as numeric variables, though only meaningful if
the aggregated variable consists of a relatively small set of discrete
values, character strings or numeric.
When calculating a statistic, the analyst should be aware of the
present and missing (not available or NA
) data that
underlies the computed statistic. Accordingly, pivot()
by
default provides the sample size information for each calculation. Turn
off the display of information by setting the show_n
parameter to FALSE
.
The following pivot()
parameters specify the data, one
or more statistics to compute for the aggregation, the variables over
which to aggregate, and the corresponding groups that contain the
aggregated values. The first three parameter values listed below are
required: the data frame, at least one statistic to compute, and at
least one variable for which to compute the statistic(s).
data
: The data frame that includes the variables of
interest.compute
: One or more functions that specify the
corresponding statistics to compute.variable
: One or more numerical variables to summarize,
either by aggregation over groups or the entire sample as a single
group.by
: Specify the optional aggregation according to the
categorical variable(s) that define the groups.by_cols
: The optional categorical variable(s) that
define the groups or cells for which to compute the aggregated values,
listed as columns in a two-dimensional table instead of a data
frame.For the given data
, compute
at least one
statistic for at least one variable
. If aggregating across
groups, compute the statistic(s) for each group specified by
by
and possibly by_cols
.
If no categorical (by
) variables are selected to define
groups, then one or more statistics are computed over multiple variables
over the entire data table defined as a single group. For categorical
(grouping) variables, defined with by
and optionally
by_cols
, choose to either compute
multiple
statistics or multiple variable
(s), but not both.
Select any two of the three possibilities for multiple parameter values: Multiple compute functions, multiple variables over which to compute, and multiple categorical variables by which to define groups for aggregation.
Specify multiple descriptive statistics to compute, multiple values
for which to do the computation, and multiple categorical variables to
define groups as vectors such as with the base R c()
function.
The default output of pivot()
is a data frame, a
two-dimensional table, rows and columns. The output table can have
multiple rows and multiple columns according to the choice of parameter
values. For each numerical variable in the analysis,
pivot()
displays both the corresponding sample size as
n_
and amount of missing or not available data as
na_
. Prevent displaying sample size information by setting
parameter show_n
to FALSE
.
The output is structured according to the following.
by_cols
parameter, the output is a table instead of a data
frame with the specified categorical variables in the columns, amenable
for viewing.by
variables,
the corresponding statistics are computed over the entire input data
frame defined as a single group.by
variables.To illustrate, use the 37-row Employee data set included with lessR, here read into the d data frame.
Two categorical variables in the d data frame are Dept and Gender, encoded as character strings. Continuous variables include Years employed at the company and annual Salary.
This example does the analysis previously illustrated. The name of
the computed mean of Years by default is Years_mean.
With no by_cols
variables, the output of
pivot()
is a data frame of the aggregated variables.
Optionally save this output for further analysis, as the data frame
a in this example.
## Dept Gender Years_n Years_na Years_mean
## 1 ACCT M 2 0 7.000
## 2 ADMN M 2 0 15.500
## 3 FINC M 3 0 11.333
## 4 MKTG M 1 0 18.000
## 5 SALE M 9 1 12.333
## 6 <NA> M 0 0 NA
## 7 ACCT W 3 0 4.667
## 8 ADMN W 4 0 7.500
## 9 FINC W 1 0 7.000
## 10 MKTG W 5 0 8.200
## 11 SALE W 5 0 6.600
## 12 <NA> W 1 0 5.000
Using the lessR function Read()
, can
read the original data table from which the pivot table was constructed,
such as in the form of an Excel worksheet. For many analyses, easier to
read the Excel data into R and do the analysis in R than remain in
Excel. The result can also be written back into an Excel file with the
lessR function Write()
.
In this example, create an Excel file called
MyPivotTable.xlsx from the pivot table stored in the a
data frame. To avoid creating this file, the function call is commented
out with the #
symbol in the first column.
The abbreviation wrt_x()
for the function name
simplifies the preceding expression for writing to an Excel file, with
the format
parameter dropped.
In this next example, specify multiple statistics for which to
aggregate for each group for the specified value variable
Salary. For each group, compare the mean to the median, and the
standard deviation to the interquartile range. By default, each column
of an aggregated statistic is the variable
name, here
Salary, followed by a “_”, then either the name of the
statistic or an abbreviation. The respective abbreviations for
mean
and median
are mn
and
md
.
## Dept Salary_n Salary_na Salary_mean Salary_mdn Salary_sd Salary_IQR
## 1 ACCT 5 0 61792.78 69547.60 12774.61 21379.23
## 2 ADMN 6 0 81277.12 71058.60 27585.15 36120.57
## 3 FINC 4 0 69010.68 61937.62 17852.50 16034.81
## 4 MKTG 6 0 70257.13 61658.99 19869.81 26085.69
## 5 SALE 15 0 78830.07 77714.85 23476.84 28810.28
## 6 <NA> 1 0 53772.58 53772.58 NA 0.00
Also have available two functions provided by lessR
that are not part of base R: skew()
and
kurtosis()
.
## Dept Salary_n Salary_na Salary_mean Salary_sd Salary_skew Salary_kurt
## 1 ACCT 5 0 61792.78 12774.61 -0.623 -3.032
## 2 ADMN 6 0 81277.12 27585.15 0.835 -1.185
## 3 FINC 4 0 69010.68 17852.50 1.689 2.752
## 4 MKTG 6 0 70257.13 19869.81 0.859 -1.458
## 5 SALE 15 0 78830.07 23476.84 0.863 0.856
## 6 <NA> 1 0 53772.58 NA NA NA
Can also customize the variable names of the aggregated statistics
with the out_names
parameter. Here calculate the mean and
median Salary for each group defined by each combination of
levels for Gender and Dept.
## Gender Dept Salary_n Salary_na MeanSalary MedianSalary
## 1 M ACCT 2 0 59626.19 59626.19
## 2 W ACCT 3 0 63237.16 71084.02
## 3 M ADMN 2 0 80963.35 80963.35
## 4 W ADMN 4 0 81434.00 71058.60
## 5 M FINC 3 0 72967.60 66312.89
## 6 W FINC 1 0 57139.90 57139.90
## 7 M MKTG 1 0 99062.66 99062.66
## 8 W MKTG 5 0 64496.02 61356.69
## 9 M SALE 10 0 86150.97 82442.74
## 10 W SALE 5 0 64188.25 56508.32
## 11 M <NA> 0 0 NA NA
## 12 W <NA> 1 0 53772.58 53772.58
The pivot()
function can aggregate over multiple
variables
. Here, aggregate Years and
Salary. Round the numerical aggregated results to the nearest
integer with the digits_d
parameter, which specifies the
number of decimal digits in the output. Different variables can have
different amounts of missing data, so the sample size, n, and
number missing, the number of values Not Available, na, are
listed separately for each aggregated variable.
## Dept Gender Years_n Years_na Years_mean Salary_n Salary_na Salary_mean
## 1 ACCT M 2 0 7 2 0 59626
## 2 ADMN M 2 0 16 2 0 80963
## 3 FINC M 3 0 11 3 0 72968
## 4 MKTG M 1 0 18 1 0 99063
## 5 SALE M 9 1 12 10 0 86151
## 6 <NA> M 0 0 NA 0 0 NA
## 7 ACCT W 3 0 5 3 0 63237
## 8 ADMN W 4 0 8 4 0 81434
## 9 FINC W 1 0 7 1 0 57140
## 10 MKTG W 5 0 8 5 0 64496
## 11 SALE W 5 0 7 5 0 64188
## 12 <NA> W 1 0 5 1 0 53773
Turn off the display of the sample size and number of missing values
for each group by setting the parameter show_n
to
FALSE
.
## Dept Years_mean Salary_mean
## 1 ACCT 5.60 61792.78
## 2 ADMN 10.17 81277.12
## 3 FINC 10.25 69010.68
## 4 MKTG 9.83 70257.13
## 5 SALE 10.29 78830.06
## 6 <NA> 5.00 53772.58
Aggregation computes one or more statistics for one or more variables
across different groups. Or, compute the variables for each statistic
for all the data. To compute over all the rows of data, do not
specify groups, that is, drop the by
parameter.
Get the grand mean of Years, that is, for all the data.
## n na mean
## 36 1 9.389
Get the grand mean of Years and Salary. Specify custom names for the results.
## Years_n Years_na MeanYear Salary_n Salary_na MeanSalary
## 36 1 9.39 37 0 73795.56
Consider an example with more variables. Analyze the 6-pt Likert
scale responses to the Mach IV scale that assesses Machiavellianism.
Items are scored from 0 to 5, Strongly Disagree to Strongly Agree. The
data are included with lessR as the Mach4 data
file. Suppress output when reading by setting quiet
to
TRUE
.
Calculate the mean, standard deviation, skew, and kurtosis for all of the data for each of the 20 items on the scale. With this specification, the form of a data frame is statistics in the columns and the variables in the rows. The result are the specified summary statistics for the specified variables over the entire data set.
## n na mean sd skew kurtosis
## m01 351 0 1.279 1.286 0.98 0.36
## m02 351 0 1.746 1.480 0.48 -0.78
## m03 351 0 2.900 1.450 -0.36 -0.83
## m04 351 0 3.339 1.174 -0.85 0.46
## m05 351 0 2.234 1.583 0.08 -1.09
## m06 351 0 3.074 1.478 -0.45 -0.92
## m07 351 0 2.775 1.473 -0.07 -1.15
## m08 351 0 2.100 1.456 0.13 -1.08
## m09 351 0 4.225 1.155 -1.74 2.74
## m10 351 0 3.991 1.138 -1.15 0.93
## m11 351 0 1.641 1.395 0.64 -0.39
## m12 351 0 1.801 1.625 0.42 -1.12
## m13 351 0 1.385 1.368 0.84 -0.13
## m14 351 0 1.954 1.304 0.20 -0.92
## m15 351 0 2.123 1.367 -0.09 -0.99
## m16 351 0 2.177 1.782 0.24 -1.36
## m17 351 0 2.407 1.604 0.09 -1.14
## m18 351 0 2.915 1.326 -0.56 -0.33
## m19 351 0 1.157 1.425 1.29 0.87
## m20 351 0 0.895 1.351 1.50 1.22
Aggregating a statistical computation of a continuous variable over
groups with pivot()
, such as computing the mean for each
combination of Dept and Gender, by default includes
the tabulation for each group (cell). A tabulation can be requested with
no analysis of a variable, instead only a counting of the available
levels of the specified categorical variables.
The table
value of the compute
parameter
specifies to compute the frequency table for a categorical aggregated
variable
across all combinations of the by
variables. Specify one categorical variable
with the
remaining categorical variables specified with the by
parameter. Missing values for each combination of the levels of the
grouping variables are displayed.
Begin with a one-way frequency table computed over the entire data set.
## m06 n Prop
## 1 0 18 0.05
## 2 1 47 0.13
## 3 2 63 0.18
## 4 3 44 0.13
## 5 4 121 0.34
## 6 5 58 0.17
In this example, compute a two-way cross-tabulation table with the
levels of variable
m06 as columns and the levels
of the by
categorical variable m07 as rows.
## m07 0 1 2 3 4 5 n _na
## 1 0 4 3 2 3 3 2 17 0
## 2 1 7 24 7 6 18 2 64 0
## 3 2 4 14 30 13 24 2 87 0
## 4 3 2 1 10 16 12 2 43 0
## 5 4 0 3 13 5 56 16 93 0
## 6 5 1 2 1 1 8 34 47 0
To output the data in long form, one tabulation per row, set the
table_long
parameter to TRUE.
If interested in the inferential analysis of the cross-tabulation
table, access the lessR function
Prop_test()
to obtain both the descriptive and inferential
results, though limited to a one- or two-way table.
The default data table is d, but included explicitly in the
following example to illustrate the data
parameter.
## variable: m06
## by: m07
##
## <<< Pearson's Chi-squared test
##
## --- Description
##
## m06
## m07 0 1 2 3 4 5 Sum
## 0 4 3 2 3 3 2 17
## 1 7 24 7 6 18 2 64
## 2 4 14 30 13 24 2 87
## 3 2 1 10 16 12 2 43
## 4 0 3 13 5 56 16 93
## 5 1 2 1 1 8 34 47
## Sum 18 47 63 44 121 58 351
##
## Cramer's V: 0.380
##
## Row Col Observed Expected Residual Stnd Res
## 1 1 4 0.872 3.128 3.526
## 1 2 3 2.276 0.724 0.528
## 1 3 2 3.051 -1.051 -0.681
## 1 4 3 2.131 0.869 0.652
## 1 5 3 5.860 -2.860 -1.496
## 1 6 2 2.809 -0.809 -0.542
## 2 1 7 3.282 3.718 2.330
## 2 2 24 8.570 15.430 6.263
## 2 3 7 11.487 -4.487 -1.616
## 2 4 6 8.023 -2.023 -0.844
## 2 5 18 22.063 -4.063 -1.182
## 2 6 2 10.575 -8.575 -3.192
## 3 1 4 4.462 -0.462 -0.259
## 3 2 14 11.650 2.350 0.853
## 3 3 30 15.615 14.385 4.634
## 3 4 13 10.906 2.094 0.782
## 3 5 24 29.991 -5.991 -1.558
## 3 6 2 14.376 -12.376 -4.119
## 4 1 2 2.205 -0.205 -0.151
## 4 2 1 5.758 -4.758 -2.274
## 4 3 10 7.718 2.282 0.968
## 4 4 16 5.390 10.610 5.216
## 4 5 12 14.823 -2.823 -0.967
## 4 6 2 7.105 -5.105 -2.238
## 5 1 0 4.769 -4.769 -2.615
## 5 2 3 12.453 -9.453 -3.357
## 5 3 13 16.692 -3.692 -1.164
## 5 4 5 11.658 -6.658 -2.432
## 5 5 56 32.060 23.940 6.092
## 5 6 16 15.368 0.632 0.206
## 6 1 1 2.410 -1.410 -1.002
## 6 2 2 6.293 -4.293 -1.976
## 6 3 1 8.436 -7.436 -3.037
## 6 4 1 5.892 -4.892 -2.315
## 6 5 8 16.202 -8.202 -2.705
## 6 6 34 7.766 26.234 11.071
##
## --- Inference
##
## Chi-square statistic: 253.103
## Degrees of freedom: 25
## Hypothesis test of equal population proportions: p-value = 0.000
Can also aggregate other statistics simultaneously in addition to the
frequency table, though, of course, only meaningful if the aggregated
variable is numerical. Here, create a 3-way cross-tabulation table with
responses to variable
m06 in the column and
responses to by
variables m07 and m10 in
the rows, plus the mean and standard deviation of each combination of
m07 and m10 across levels of m06.
## m07 m10 m06_n m06_na m06_mean m06_sd 2 3 4 5 n _na
## 1 0 0 1 0 0.000 NA 0 0 0 0 1 0
## 2 1 0 1 0 1.000 NA 0 0 0 0 1 0
## 3 2 0 NA 0 NA NA 0 0 0 0 0 0
## 4 3 0 1 0 2.000 NA 1 0 0 0 1 0
## 5 4 0 1 0 2.000 NA 1 0 0 0 1 0
## 6 5 0 NA 0 NA NA 0 0 0 0 0 0
## 7 0 1 NA 0 NA NA 0 0 0 0 0 0
## 8 1 1 4 0 0.750 0.500 0 0 0 0 4 0
## 9 2 1 2 0 1.500 0.707 1 0 0 0 2 0
## 10 3 1 1 0 2.000 NA 1 0 0 0 1 0
## 11 4 1 NA 0 NA NA 0 0 0 0 0 0
## 12 5 1 NA 0 NA NA 0 0 0 0 0 0
## 13 0 2 2 0 2.000 1.414 0 1 0 0 2 0
## 14 1 2 9 0 1.222 1.202 1 0 1 0 9 0
## 15 2 2 15 0 2.000 0.845 8 2 1 0 15 0
## 16 3 2 3 0 3.000 0.000 0 3 0 0 3 0
## 17 4 2 3 0 3.333 1.155 1 0 2 0 3 0
## 18 5 2 NA 0 NA NA 0 0 0 0 0 0
## 19 0 3 5 0 1.200 1.304 1 1 0 0 5 0
## 20 1 3 16 0 2.250 1.390 4 3 4 0 16 0
## 21 2 3 13 0 2.769 0.832 6 4 3 0 13 0
## 22 3 3 7 0 2.000 1.528 2 2 1 0 7 0
## 23 4 3 9 0 3.333 1.323 2 0 5 1 9 0
## 24 5 3 NA 0 NA NA 0 0 0 0 0 0
## 25 0 4 7 0 2.714 1.799 1 1 2 1 7 0
## 26 1 4 19 0 2.895 1.449 1 2 9 1 19 0
## 27 2 4 29 0 2.690 1.417 8 4 9 2 29 0
## 28 3 4 18 0 3.278 1.074 3 6 6 2 18 0
## 29 4 4 32 0 3.781 1.008 6 2 17 7 32 0
## 30 5 4 5 0 4.400 0.548 0 0 3 2 5 0
## 31 0 5 2 0 4.500 0.707 0 0 1 1 2 0
## 32 1 5 15 0 2.133 1.685 1 1 4 1 15 0
## 33 2 5 28 0 2.571 1.372 7 3 11 0 28 0
## 34 3 5 13 0 3.154 0.801 3 5 5 0 13 0
## 35 4 5 48 0 3.854 0.922 3 3 32 8 48 0
## 36 5 5 42 0 4.452 1.234 1 1 5 32 42 0
Can also express the frequencies as proportions. To convert the
frequencies into proportions, invoke the table_prop
parameter. The value of "all"
computes cell frequencies
across the entire table. The values of "row"
and
"col"
compute the proportions with either row sums or
column sums.
In this example of a two-way cross-tabulation table, convert the table counts to row proportions, that is, the proportion of each level of m06 for each combination of levels for m07. The sum of the proportions in each row is 1.0.
## m07 0 1 2 3 4 5 n _na
## 1 0 4 3 0.07 0.11 0.11 0.07 0.630 0
## 2 1 7 24 0.07 0.06 0.19 0.02 0.660 0
## 3 2 4 14 0.19 0.08 0.15 0.01 0.558 0
## 4 3 2 1 0.12 0.19 0.14 0.02 0.518 0
## 5 4 0 3 0.07 0.03 0.31 0.09 0.508 0
## 6 5 1 2 0.01 0.01 0.09 0.37 0.516 0
Return to the Employee data set for the remaining examples.
One way to understand the characteristics of a distribution of data values of a continuous variable is to sort the values and then split into equal-sized groups. The simplest example is the median, which splits a sorted distribution into two groups of equal size, the bottom lowest values and the top highest values. Quartiles divide the sorted distribution into four groups. The first quartile is the smallest 25% of the data values, etc.
Quantiles: Numeric values, not necessarily data values, that divide a distribution of sorted data values into n groups of equal size.
By default, calling the quantile function computes quartiles. Here calculate the quartiles for Years aggregated across levels of Dept and Gender.
## Dept Gender Years_n Years_na p0 p25 p50 p75 p100
## 1 ACCT M 2 0 5 6.00 7.0 8.00 9
## 2 ADMN M 2 0 7 11.25 15.5 19.75 24
## 3 FINC M 3 0 10 10.00 10.0 12.00 14
## 4 MKTG M 1 0 18 18.00 18.0 18.00 18
## 5 SALE M 9 1 5 9.00 13.0 14.00 21
## 6 <NA> M 0 0 NA NA NA NA NA
## 7 ACCT W 3 0 2 2.50 3.0 6.00 9
## 8 ADMN W 4 0 2 3.50 5.0 9.00 18
## 9 FINC W 1 0 7 7.00 7.0 7.00 7
## 10 MKTG W 5 0 1 4.00 8.0 10.00 18
## 11 SALE W 5 0 2 3.00 8.0 10.00 10
## 12 <NA> W 1 0 5 5.00 5.0 5.00 5
To compute quantiles other than quartiles, invoke the
q_num
parameter, the number of quantile intervals. The
default value is 4 for quartiles. In the following example, compute the
quintiles for Years and Salary, plus the mean and
standard deviation. No specification of by
, so these
descriptive statistics are computed over the entire data set for both
specified variables.
## n na mean sd p0 p20 p40 p60 p80 p100
## Years 36 1 9.39 5.72 1.00 4.00 8.00 10.00 14.00 24.0
## Salary 37 0 73795.56 21799.53 46124.97 55737.86 63701.93 72430.04 92415.42 134419.2
One data analysis strategy examines the values of a variable, such as Sales for a business or Mortality for an epidemiology study, at progressively finer levels of detail. Examine by Country or State or City or whatever level of granularity is appropriate.
Data drill down: Examine the values of a variable when holding the values of one or more categorical variables constant.
pivot()
indicates drilling down deeper into the data
with a display of all categorical variables with unique values, which
precedes the primary output. Initiate the drill-down by a previous
subset of the data frame, or indicate the subset with
pivot()
directly. As with other lessR
analysis functions, the rows
parameter specifies a logical
condition for which to subset rows of the data frame for analysis.
In this example, compute the mean of Salary for each level of Dept for just those rows of data with the value of Gender equal to “W”.
## Gender: W
## Dept Salary_n Salary_na Salary_mean
## 1 ACCT 3 0 63237.16
## 2 ADMN 4 0 81434.00
## 3 FINC 1 0 57139.90
## 4 MKTG 5 0 64496.02
## 5 SALE 5 0 64188.25
## 6 <NA> 1 0 53772.58
The parentheses for the rows
parameter are not
necessary, but do enhance readability.
Can also drill down by subsetting the data frame with a logical
condition directly in the data parameter in the call to
pivot()
. Here drill down with base R
Extract[ ]
in conjunction with the lessR
function .()
to simplify the syntax (explained in the
vignette Subset a Data Frame). The Extract[ ]
function
specifies the rows of the data frame to extract before the comma, and
the columns to extract after the comma. Here select only those rows of
data with Gender declared as Female. There is no information
after the comma, so no columns are specified, which means to retain all
columns, the variables in the data frame.
## Gender: W
## Dept Salary_n Salary_na Salary_mean
## 1 ACCT 3 0 63237.16
## 2 ADMN 4 0 81434.00
## 3 FINC 1 0 57139.90
## 4 MKTG 5 0 64496.02
## 5 SALE 5 0 64188.25
## 6 <NA> 1 0 53772.58
Specify the sort as part of the call to pivot()
with the
parameter sort
. This internal sort works for a single
value
variable, by default the last column in the output
data frame. Set to "-"
for a descending sort. Set to
"+"
for an ascending sort.
## Dept Gender Salary_n Salary_na Salary_mean
## 4 MKTG M 1 0 99062.66
## 5 SALE M 10 0 86150.97
## 8 ADMN W 4 0 81434.00
## 2 ADMN M 2 0 80963.35
## 3 FINC M 3 0 72967.60
## 10 MKTG W 5 0 64496.02
## 11 SALE W 5 0 64188.25
## 7 ACCT W 3 0 63237.16
## 1 ACCT M 2 0 59626.19
## 9 FINC W 1 0 57139.90
## 12 <NA> W 1 0 53772.58
## 6 <NA> M 0 0 NA
Specify the sort_var
parameter to specify the name or
the column number of the variable to sort.
pivot(d, c(mean, median), Salary, c(Gender,Dept), out_names=c("MeanSalary", "MedianSalary"),
sort="-", sort_var="MeanSalary")
## Gender Dept Salary_n Salary_na MeanSalary MedianSalary
## 7 M MKTG 1 0 99062.66 99062.66
## 9 M SALE 10 0 86150.97 82442.74
## 4 W ADMN 4 0 81434.00 71058.60
## 3 M ADMN 2 0 80963.35 80963.35
## 5 M FINC 3 0 72967.60 66312.89
## 8 W MKTG 5 0 64496.02 61356.69
## 10 W SALE 5 0 64188.25 56508.32
## 2 W ACCT 3 0 63237.16 71084.02
## 1 M ACCT 2 0 59626.19 59626.19
## 6 W FINC 1 0 57139.90 57139.90
## 12 W <NA> 1 0 53772.58 53772.58
## 11 M <NA> 0 0 NA NA
Because the output of pivot()
with no
by_cols
variables is a standard R data frame, the external
call to the lessR function sort_by()
is
available for custom sorting by one or more variables. Sort in the
specified direction with the direction
parameter.
## Dept Gender Salary_n Salary_na Salary_mean
## 4 MKTG M 1 0 99062.66
## 5 SALE M 10 0 86150.97
## 8 ADMN W 4 0 81434.00
## 2 ADMN M 2 0 80963.35
## 3 FINC M 3 0 72967.60
## 10 MKTG W 5 0 64496.02
## 11 SALE W 5 0 64188.25
## 7 ACCT W 3 0 63237.16
## 1 ACCT M 2 0 59626.19
## 9 FINC W 1 0 57139.90
## 12 <NA> W 1 0 53772.58
## 6 <NA> M 0 0 NA
Specify multiple variables to sort with a vector of variable names,
and a corresponding vector of "+"
and "-"
signs of the same length for the directions
parameter.
The following illustrates as of R 4.1.0 the base R pipe operator
|>
with pivot()
. The pipe operator by
default inserts the object on the left-hand side of an expression into
the first parameter value for the function on the right-hand side.
In this example, input the d data frame into the first
parameter of pivot()
, the data
parameter. Then
direct the output to the data frame a with the standard R
assignment statement, though written pointing to the right hand side of
the expression.
To avoid problems installing this version of lessR
from source with a previous version of R, the code is commented out with
a #
sign in the first column.
Specify up to two by_cols
categorical variables to
create a two-dimensional table with the specified columns. Specifying
one or two categorical variables as by_cols
variables moves
them from their default position in the rows to the columns, which
changes the output structure from a long-form data frame to a table with
categorical variables in the rows and columns.
In this example, specify Gender as the by_cols
variable.
Table: mean of Salary
Gender M W
Dept
------- --------- ---------
ACCT 59626.19 63237.16
ADMN 80963.35 81434.00
FINC 72967.60 57139.90
MKTG 99062.66 64496.02
SALE 86150.97 64188.25
NA NA 53772.58
Here two by_cols
variables, specified as a vector. There
is much missing data for this three-way classification as there is not
much data in each group, with many groups having no data.
Table: mean of Salary
Gender M W
Plan 1 2 3 1 2 3
Dept
------- --------- ---------- --------- --------- --------- ---------
ACCT 69547.60 NA 49704.79 NA 63237.16 NA
ADMN 53788.26 108138.43 NA NA 67724.21 122563.4
FINC 61937.62 NA 95027.55 NA 57139.90 NA
MKTG NA NA 99062.66 56772.95 66426.79 NA
SALE 89393.40 82442.74 80131.91 60941.54 66352.73 NA
NA NA NA NA 53772.58 NA NA
There are three different types of missing data in an aggregation:
some missing values for one or more aggregated
variable
s for which the statistic is computed, all
missing values for the variable
so that the group (cell)
defined by one or more by
variables has no data values, and
the data value for the by
variable itself is missing.
Accordingly, there are three different missing data parameters with
pivot()
.
na_remove
: Remove any missing data from a value of the
variable
, then perform the aggregation on the remaining
values, reporting how many values are missing. Otherwise report the
computed statistic as NA (missing). [default is TRUE
]na_by_show
: If all values of variable
are
missing for a group so that the entire level of the by
variables is missing, show those missing cells with a reported value of
computed variable n as 0. Otherwise delete the row from the
output. [default is TRUE
]na_group_show
: The by
data value is
missing so no data is available for any of the aggregated statistics,
all displayed as missing values. Otherwise the row with the missing
by
data value is deleted from the output. [default is
TRUE
]by
VariableBy default, the data frame output of pivot()
lists the
number of occurrences of missing data for each group of the variable
over which the statistic is computed. The na_remove
parameter specifies if NA (missing) values of the aggregated variable
should be removed before the computation proceeds. The default value
na_remove
is TRUE
, so the missing values are
removed, their number of occurrences reported, and then the computed
statistic displayed.
In this example, the variable Years has one missing value, which occurs in the Sales department. That value is dropped from the computation of the mean of Years, which is based on the 14 non-missing values.
## Dept Years_n Years_na Years_mean
## 1 ACCT 5 0 5.600
## 2 ADMN 6 0 10.167
## 3 FINC 4 0 10.250
## 4 MKTG 6 0 9.833
## 5 SALE 14 1 10.286
## 6 <NA> 1 0 5.000
Set na_remove
to FALSE
to not
remove any missing data in a cell with values to be aggregated.
According to the way in which R functions process the data with missing
values entered into the function, one missing value implies that the
computed statistic has a missing value.
## Dept Years_n Years_na Years_mean
## 1 ACCT 5 0 5.600
## 2 ADMN 6 0 10.167
## 3 FINC 4 0 10.250
## 4 MKTG 6 0 9.833
## 5 SALE 14 1 NA
## 6 <NA> 1 0 5.000
by
VariableAnother possibility accounts for missing values of the categorical
by
variables for which all values of the variable for which
to compute the statistic for that cell are missing. In that situation,
there is no data value for which to compute the statistic.
For this example, first create a cell with no values in the data
aggregation. Use base R Extract[]
with
lessR .()
to drop the one male in the
Sales department, leaving no data values for that group. Save the result
into the dd data frame.
Also, to focus only on this type of missing data, remove the row of data that has a missing value for Dept, which is the third row.
## Years Gender Dept Salary JobSat Plan Pre Post
## Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
## Downs, Deborah 7 W FINC 57139.90 high 2 90 86
## Jones, Alissa 5 W <NA> 53772.58 <NA> 1 65 62
## Afshari, Anbar 6 W ADMN 69441.93 high 2 100 100
## Knox, Michael 18 M MKTG 99062.66 med 3 81 84
## Kimball, Claire 8 W MKTG 61356.69 high 2 93 92
Here, explicitly set na_by_show
to TRUE
,
the default value. The group for male sales employees is shown with the
values of the computed variables n_ and na_ set to 0
and the values of the computed statistics necessarily missing for this
row of the output data frame.
## Dept Gender Salary_n Salary_na Salary_mean Salary_mdn
## 1 ACCT M 2 0 59626.19 59626.19
## 2 ADMN M 2 0 80963.35 80963.35
## 3 FINC M 3 0 72967.60 66312.89
## 4 MKTG M 1 0 99062.66 99062.66
## 5 SALE M 0 0 NA NA
## 6 ACCT W 3 0 63237.16 71084.02
## 7 ADMN W 4 0 81434.00 71058.60
## 8 FINC W 1 0 57139.90 57139.90
## 9 MKTG W 5 0 64496.02 61356.69
## 10 SALE W 5 0 64188.25 56508.32
Drop the groups from the output with missing data for a
by
variable. To do so, set na_by_show
to
FALSE
. Now the group for the non-existent male sales
employee does not display.
## Dept Gender Salary_n Salary_na Salary_mean Salary_mdn
## 1 ACCT M 2 0 59626.19 59626.19
## 2 ADMN M 2 0 80963.35 80963.35
## 3 FINC M 3 0 72967.60 66312.89
## 4 MKTG M 1 0 99062.66 99062.66
## 6 ACCT W 3 0 63237.16 71084.02
## 7 ADMN W 4 0 81434.00 71058.60
## 8 FINC W 1 0 57139.90 57139.90
## 9 MKTG W 5 0 64496.02 61356.69
## 10 SALE W 5 0 64188.25 56508.32
by
Variable ItselfThe parameter na_group_show
specifies if a value of the
grouping variable itself should be displayed as level
<NA>
. By default, the <NA>
levels
are displayed but the na_group_show
is explicitly included
in the following.
## Dept Gender Salary_n Salary_na Salary_mean Salary_mdn
## 1 ACCT M 2 0 59626.19 59626.19
## 2 ADMN M 2 0 80963.35 80963.35
## 3 FINC M 3 0 72967.60 66312.89
## 4 MKTG M 1 0 99062.66 99062.66
## 5 SALE M 10 0 86150.97 82442.74
## 6 <NA> M 0 0 NA NA
## 7 ACCT W 3 0 63237.16 71084.02
## 8 ADMN W 4 0 81434.00 71058.60
## 9 FINC W 1 0 57139.90 57139.90
## 10 MKTG W 5 0 64496.02 61356.69
## 11 SALE W 5 0 64188.25 56508.32
## 12 <NA> W 1 0 53772.58 53772.58
Set na_group_show
to FALSE
to not display
as levels when a value of the by
variable is missing.
## Dept Gender Salary_n Salary_na Salary_mean Salary_mdn
## 1 ACCT M 2 0 59626.19 59626.19
## 2 ADMN M 2 0 80963.35 80963.35
## 3 FINC M 3 0 72967.60 66312.89
## 4 MKTG M 1 0 99062.66 99062.66
## 5 SALE M 10 0 86150.97 82442.74
## 6 ACCT W 3 0 63237.16 71084.02
## 7 ADMN W 4 0 81434.00 71058.60
## 8 FINC W 1 0 57139.90 57139.90
## 9 MKTG W 5 0 64496.02 61356.69
## 10 SALE W 5 0 64188.25 56508.32
Any function that processes a single column of data and returns a
single value, usually a statistic, can be accessed with
pivot()
. In this example, define a function named
mnmd() that computes the difference between the
mean
and median
of a distribution of data
values, here represented by x
. For technical reasons, need
to include the NA
remove parameter, na.rm
, in
the function definition if there is any missing data. If missing data
values are to be dropped in internal function calls, then set the base R
parameter na.rm
to TRUE
for each component
function as well.
Invoke pivot()
to compute the mean, the median, and
their difference for each group in the aggregation of the variable
Years.
## Dept Years_n Years_na Years_mean Years_mdn Years_mnmd
## 1 ACCT 5 0 5.600 5.0 0.600
## 2 ADMN 6 0 10.167 6.5 3.667
## 3 FINC 4 0 10.250 10.0 0.250
## 4 MKTG 6 0 9.833 9.0 0.833
## 5 SALE 14 1 10.286 10.0 0.286
## 6 <NA> 1 0 5.000 5.0 0.000
pivot()
vs Base
R aggregate()
The lessR pivot()
function relies upon
the base R function aggregate()
for
aggregation. By default, except for the table
computation,
pivot()
generates a long-form data frame pivot table (Excel
terminology), which can then be directly input into analysis and
visualization functions as a standard data frame. The levels across all
the by
grouping variables are listed in the rows. If there
are specified column grouping variables according to
by_cols
, pivot()
relies upon base R
reshape()
to form a 2-d table for direct viewing instead of
a data table to input into further analysis functions.
pivot()
provides additional features than
aggregate()
provides.
value
over which to aggregate, the sample size
and number of missing values for each group is provided.compute
the aggregated value for each group.compute
statistics over
the entire data set instead of groups of data.value
aggregated.compute
the specified
statistic(s) for each variable
across the entire data
set.by
variables of type Date
retain the same
variable type in the summary table instead of each converted to a
factor
(set factors=FALSE
to not convert
integer
and character
variables to
factors).data
parameter first,
which facilitates the use of the pipe operator, such as from base R as
of Version 4.1.0 or the magrittr package.data
are listed with their corresponding data value, which
identifies when drilling down into the data to study relevant rows.Although the pivot()
function considerably extends the
functionality of base R aggregate()
, pivot()
does rely upon this base R function for most of its computations.