The following material is excerpted from Chapter 1 of my book:
R Visualizations – Derive Meaning from Data, Chapman and Hall/CRC Press, May, 2020.
library(lessR)
##
## lessR 4.2.6 feedback: gerbing@pdx.edu
## --------------------------------------------------------------
## > d <- Read("") Read text, Excel, SPSS, SAS, or R data file
## d is default data frame, data= in analysis routines optional
##
## Learn about reading, writing, and manipulating data, graphics,
## testing means and proportions, regression, factor analysis,
## customization, and descriptive statistics from pivot tables.
## Enter: browseVignettes("lessR")
##
## View changes in this and recent versions of lessR.
## Enter: news(package="lessR")
##
## **Newly Revised**: Interactive data analysis.
## Enter: interact()
The easiest way to read text data files, csv or tab-delimited, and
Excel files, is with the lessR function
Read(). The function identifies the file type, or lets the
user specify the formatting of the file, and then calls the appropriate
lower-level function to read the data. For example, if an Excel data
file is encountered, Read() relies upon the Excel read
function from the openxlsx package (and also gives credit
to the package author when that function is invoked). But
Read() keeps track of these details for you, freeing your
time for something more productive.
Read() also provides output regarding the variables read
into the R data frame, and their characteristics, and sample values.
This output can be invaluable for ensuring that the data values were
read correctly. Moreover, the variable names are clearly specified,
ready for specification in analysis functions.
There is another type of text data file, however, not covered by the previous examples.
Fixed width column data file: The data values in each column are not delimitted from adjacent values, but instead occupy a fixed numer of columns.
Consider the data file at http://lessrstats.com/data/Mach4.fwd. Here are the beginning and ending lines of 351 rows of data.
0100004150541540000401324
0127001440330440111244310
0134121054405341400202401
0222105240444520001115440
0282022332323141312223321
...
9543120241223351002325411
9677023332443331232300401
9721014443442341112522400
9840133343223332322352313
Each row of data consists of 25 integers. The first four columns define an ID field. The fifth column is Gender. The remaining 20 columns are the responses to the 20-item Mach IV scale, a measure of Machiavellian tendencies. Respondents to the survey responded to each item on a 6-pt Likert scale, that is, six possible response categories to each item. The responses for each of 351 persons to each of the 20 items – \(m01\), \(m02\), …, \(m20\) – were encoded according the following one-digit integers.
0 - Strongly Disagree
1 - Disagree
2 - Slightly Disagree
3 - Slightly Agree
4 - Agree
5 - Strongly Agree
d <- Read("http://lessRstats.com/data/Mach4.fwd",
col.names=c("ID", "Gender", to("m",20)),
widths=c(4,1,rep(1,20)), quiet=TRUE)
The lessR function to() is just a simple
way to avoid repetitiously writing a long list of variable names.
to("m",20)
## [1] "m01" "m02" "m03" "m04" "m05" "m06" "m07" "m08" "m09" "m10" "m11" "m12"
## [13] "m13" "m14" "m15" "m16" "m17" "m18" "m19" "m20"
To read a text file that does not have the variable names in the
first row, apply the base R col.names parameter to define
the variable names as part of the call to Read(). The base
R widths parameter defines the respective widths of the
columns, one column for the responses to each variable.
In this example, the col.names parameter indicates that
the respective variable names are ID, Gender, and, as
specified by the lessR function to(),
m01, m02, to m20. The base R
widths parameter specifies that the first variable occupies
four columns, the next variable occupies a single column, and the base R
repetition function rep() specifies that the next 20
variables each occupy only a single column, short-hand for writing 20
1’s, separated by commas.
rep(1,20)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Here repeat the Read() function call to show the
variables that were read.
d <- Read("http://lessRstats.com/data/Mach4.fwd",
col.names=c("ID", "Gender", to("m",20)),
widths=c(4,1,rep(1,20)))
##
## >>> Suggestions
## To read a csv or Excel file of variable labels, var_labels=TRUE
## Each row of the file: Variable Name, Variable Label
## Read into a data frame named l (the letter el)
##
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## integer: Numeric data values, integers only
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 ID integer 351 0 351 100 127 134 ... 9677 9721 9840
## 2 Gender integer 351 0 2 0 0 1 ... 0 0 1
## 3 m01 integer 351 0 6 0 0 2 ... 2 1 3
## 4 m02 integer 351 0 6 4 1 1 ... 3 4 3
## 5 m03 integer 351 0 6 1 4 0 ... 3 4 3
## 6 m04 integer 351 0 6 5 4 5 ... 3 4 4
## 7 m05 integer 351 0 6 0 0 4 ... 2 3 3
## 8 m06 integer 351 0 6 5 3 4 ... 4 4 2
## 9 m07 integer 351 0 6 4 3 0 ... 4 4 2
## 10 m08 integer 351 0 6 1 0 5 ... 3 2 3
## 11 m09 integer 351 0 6 5 4 3 ... 3 3 3
## 12 m10 integer 351 0 6 4 4 4 ... 3 4 3
## 13 m11 integer 351 0 6 0 0 1 ... 1 1 2
## 14 m12 integer 351 0 6 0 1 4 ... 2 1 3
## 15 m13 integer 351 0 6 0 1 0 ... 3 1 2
## 16 m14 integer 351 0 6 0 1 0 ... 2 2 2
## 17 m15 integer 351 0 6 4 2 2 ... 3 5 3
## 18 m16 integer 351 0 6 0 4 0 ... 0 2 5
## 19 m17 integer 351 0 6 1 4 2 ... 0 2 2
## 20 m18 integer 351 0 6 3 3 4 ... 4 4 3
## 21 m19 integer 351 0 6 2 1 0 ... 0 0 1
## 22 m20 integer 351 0 6 4 0 1 ... 1 0 3
## ------------------------------------------------------------------------------------------
Follow the same function form to read any other fixed-width column
data file. Specify the column names with col.names, which
also, of course, would apply to a csv or Excel file that for some reason
failed to have the variable names in the first row. Then indicate the
column widths with the widths parameter, one specified
column width for each variable, listed in the same order.