BTA 522: Read Fixed-Width Column Data Files

The following material is excerpted from Chapter 1 of my book:

R Visualizations – Derive Meaning from Data, Chapman and Hall/CRC Press, May, 2020.

library(lessR)

## 
## lessR 4.2.6                         feedback: gerbing@pdx.edu 
## --------------------------------------------------------------
## > d <- Read("")   Read text, Excel, SPSS, SAS, or R data file
##   d is default data frame, data= in analysis routines optional
## 
## Learn about reading, writing, and manipulating data, graphics,
## testing means and proportions, regression, factor analysis,
## customization, and descriptive statistics from pivot tables.
##   Enter:  browseVignettes("lessR")
## 
## View changes in this and recent versions of lessR.
##   Enter: news(package="lessR")
## 
## **Newly Revised**: Interactive data analysis.
##   Enter: interact()

The easiest way to read text data files, csv or tab-delimited, and Excel files, is with the lessR function Read(). The function identifies the file type, or lets the user specify the formatting of the file, and then calls the appropriate lower-level function to read the data. For example, if an Excel data file is encountered, Read() relies upon the Excel read function from the openxlsx package (and also gives credit to the package author when that function is invoked). But Read() keeps track of these details for you, freeing your time for something more productive.

Read() also provides output regarding the variables read into the R data frame, and their characteristics, and sample values. This output can be invaluable for ensuring that the data values were read correctly. Moreover, the variable names are clearly specified, ready for specification in analysis functions.

There is another type of text data file, however, not covered by the previous examples.

Fixed width column data file: The data values in each column are not delimitted from adjacent values, but instead occupy a fixed numer of columns.

Consider the data file at http://lessrstats.com/data/Mach4.fwd. Here are the beginning and ending lines of 351 rows of data.

0100004150541540000401324  
0127001440330440111244310  
0134121054405341400202401  
0222105240444520001115440  
0282022332323141312223321  
...  
9543120241223351002325411  
9677023332443331232300401  
9721014443442341112522400  
9840133343223332322352313

Each row of data consists of 25 integers. The first four columns define an ID field. The fifth column is Gender. The remaining 20 columns are the responses to the 20-item Mach IV scale, a measure of Machiavellian tendencies. Respondents to the survey responded to each item on a 6-pt Likert scale, that is, six possible response categories to each item. The responses for each of 351 persons to each of the 20 items – \(m01\), \(m02\), …, \(m20\) – were encoded according the following one-digit integers.

0 - Strongly Disagree
1 - Disagree
2 - Slightly Disagree
3 - Slightly Agree
4 - Agree
5 - Strongly Agree

d <- Read("http://lessRstats.com/data/Mach4.fwd", 
          col.names=c("ID", "Gender", to("m",20)),
          widths=c(4,1,rep(1,20)), quiet=TRUE)

The lessR function to() is just a simple way to avoid repetitiously writing a long list of variable names.

to("m",20)

##  [1] "m01" "m02" "m03" "m04" "m05" "m06" "m07" "m08" "m09" "m10" "m11" "m12"
## [13] "m13" "m14" "m15" "m16" "m17" "m18" "m19" "m20"

To read a text file that does not have the variable names in the first row, apply the base R col.names parameter to define the variable names as part of the call to Read(). The base R widths parameter defines the respective widths of the columns, one column for the responses to each variable.

In this example, the col.names parameter indicates that the respective variable names are ID, Gender, and, as specified by the lessR function to(), m01, m02, to m20. The base R widths parameter specifies that the first variable occupies four columns, the next variable occupies a single column, and the base R repetition function rep() specifies that the next 20 variables each occupy only a single column, short-hand for writing 20 1’s, separated by commas.

rep(1,20)

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Here repeat the Read() function call to show the variables that were read.

d <- Read("http://lessRstats.com/data/Mach4.fwd", 
          col.names=c("ID", "Gender", to("m",20)),
          widths=c(4,1,rep(1,20)))

## 
## >>> Suggestions
## To read a csv or Excel file of variable labels, var_labels=TRUE
##   Each row of the file:  Variable Name, Variable Label
## Read into a data frame named l  (the letter el)
## 
## Details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## integer: Numeric data values, integers only
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1        ID   integer    351       0     351   100  127  134 ... 9677  9721  9840
##  2    Gender   integer    351       0       2   0  0  1 ... 0  0  1
##  3       m01   integer    351       0       6   0  0  2 ... 2  1  3
##  4       m02   integer    351       0       6   4  1  1 ... 3  4  3
##  5       m03   integer    351       0       6   1  4  0 ... 3  4  3
##  6       m04   integer    351       0       6   5  4  5 ... 3  4  4
##  7       m05   integer    351       0       6   0  0  4 ... 2  3  3
##  8       m06   integer    351       0       6   5  3  4 ... 4  4  2
##  9       m07   integer    351       0       6   4  3  0 ... 4  4  2
## 10       m08   integer    351       0       6   1  0  5 ... 3  2  3
## 11       m09   integer    351       0       6   5  4  3 ... 3  3  3
## 12       m10   integer    351       0       6   4  4  4 ... 3  4  3
## 13       m11   integer    351       0       6   0  0  1 ... 1  1  2
## 14       m12   integer    351       0       6   0  1  4 ... 2  1  3
## 15       m13   integer    351       0       6   0  1  0 ... 3  1  2
## 16       m14   integer    351       0       6   0  1  0 ... 2  2  2
## 17       m15   integer    351       0       6   4  2  2 ... 3  5  3
## 18       m16   integer    351       0       6   0  4  0 ... 0  2  5
## 19       m17   integer    351       0       6   1  4  2 ... 0  2  2
## 20       m18   integer    351       0       6   3  3  4 ... 4  4  3
## 21       m19   integer    351       0       6   2  1  0 ... 0  0  1
## 22       m20   integer    351       0       6   4  0  1 ... 1  0  3
## ------------------------------------------------------------------------------------------

Follow the same function form to read any other fixed-width column data file. Specify the column names with col.names, which also, of course, would apply to a csv or Excel file that for some reason failed to have the variable names in the first row. Then indicate the column widths with the widths parameter, one specified column width for each variable, listed in the same order.

BTA 522: Read Fixed-Width Column Data Files

Sunday January 08, 2023 at 14:19