Analyze the data values for at least one variable, such as the annual salaries of employees at a company. Typically have available the data for multiple variables. Organize the data values into a specific kind of structure from which analysis proceeds.
Data Table: Organize data values into a rectangular data table with the data values for each variable in a column, and the name of the variable at the top of the column.
Store the structured data values within a computer file, such as an Excel or OpenDocument Spreadsheet (ODS) formatted file or text file. This file can be stored on your computer, an accessible local network, or the world wide web. The data table in the following figure, formatted as an Excel file, contains four variables: Years, Gender, Dept, and Salary plus an ID field called Name for a total of five columns.
Structure of a data table.
Describe the data table by its columns, rows, and cell entries.
Data value: The contents of a single cell of a data table, a specific measurement.
For example, according to the data values for employee Darnell Ritchie, he has worked at the company for 7 years, identifies as a male, and works in administration with an annual salary of $43,788.26.
Missing data: A cell for which there is no recorded data value.
Two data values in this section of the data table are missing. The number of years James Wu has worked at the company is not recorded, nor is the department in which Alissa Jones works.
Variable name: A short, concise word or abbreviation that identifies a column of data values in a data table.
Instance or Example or Sample or Case: Each row of the data table, the data for a specific instance of a single person, organization, place, event, or whatever is being studied.
Encode the data table in one of a variety of computer file formats.
Common formats include Excel files (.xlsx), OpenDocument
Spreadsheet files (.ods), and text files the form of
comma-separated value files (.csv), tab-delimited text
files (.txt), or space-delimited text files
(.prn).
Analysis of data can only proceed with the data table identified and the relevant variables identified by their name.
R data analysis functions analyze the data values for one or more specified variables, identified by their names, such as Salary.
Analysis requires the correct spelling of each variable name, including the same pattern of capitalization.
Your data organized as a data table exists somewhere as a data file stored on a computer system. To analyze data in a data table stored in a computer file, first read the data table from the computer file into a corresponding data table within a running R session.
Data frame: A data table stored within an R session, referenced by its name.
Each variable in a data table has a name, and so does the data table
itself. Initially reference the data table stored on your computer
system by its file name and location. When read into R, name the data
table, the R data frame, with a name of your choice. Regardless of the
file name on your computer system, typically name the data table
within the active R session, the data frame, as d for
data. Not only is d easy to type, but it is also the
lessR default data frame name for the data processed by its
various analysis functions.
When analyzing data read into R, the same data exists in two different locations: a computer file on your computer system, and an R data frame within a running R app. Different locations, different names: same data. On your computer system, identify the data table by its file name and location. Within a running R app, identify the same data by the name of the data frame, such as d, within which the data from the computer file was read.
To read the data from a file into a data frame of a running R
application, as with every other task in R, accomplish the task with a
function. The R ecosystem, base R and its many packages, presents many
such functions. We use the lessR function
Read() for its simplicity and for its useful output that
helps understanding the data that was read.
By integrating many different read functions from R and contributed R
packages, lessR functions Read() and
Write() can seamlessly read and write data files in any one
of the many file formats shown in the following table. The same
Read() function syntax or Write() function
syntax is identical regardless of file format, though some formats may
present unique advanced options such as the Write()
parameter ExcelTable for writing to an Excel table Instead
of an ordinary Excel worksheet.
| Extension | Format | Package | Source |
|---|---|---|---|
|
.csv .tsv .prn .txt |
text, custom separator and decimal text, tab separated values text, space separated values text, comma or tab separated |
R utils |
read.csv( ) read.delim( ) read.table( ) read.table( ) |
|
.xls .xlsx |
Excel | openxlsx |
read.xlsx( ) |
| .ods | ODS | readODS | read_ODS( ) |
|
.feather .parquet |
feather parquet |
arrow |
read_feather( ) read_parquet( ) |
| .rda | R | R base | load( ) |
|
.sav, .zsav .dta .sas7bdat |
SPSS Stata SAS |
haven |
read_spss( ) read_sas( ) read_stata( ) |
Notes.
Data stored in all of these data file formats can be read and written by both R and Python functions, so the resulting data files are available to both systems.
If the data are in the form of a text file that follows the
European and South American convention of using a comma ,
for the decimal point, and if adjacent values are delimited with a
semicolon, ;, then either use Read2(), or
specify a format of txt and then manually add
the parameters dec="," for the decimal indicator and
sep=";" to Read() for the data value
separator.
Data file formats .feather and .parque
are modern, recently developed cross-application formats, particularly
applicable for data reading speed and large file size, respectively. The
corresponding write functions from the arrow package strip
away any row names. Write() does extra processing to save
the row names as an additional variable, which is then recognized by
Read() and converted back to row names.
Packages R utils and R base are two of
the six core R packages downloaded with R.
text files consist only of text and a few control
characters such as for a new line. They can be created with any text
editor or word processor such as MS Word or spreadsheet such as MS
Excel. These generally recognizable listed four file types are also
provided from MS Excel File -> Save As... .
.prn is a file extension for “print” file, with
columns aligned separated by blank spaces, such as saved by MS Excel. Of
course, in this format missing data must be designated with codes
instead of empty cells, such as part of a function call to
Read() with parameter and values
missing=c(-99, "xxxx") to define numeric and character
string missing data codes if these are present in the data.
SPSS files read by Read() undergo additional
processing. Each (usually) integer variable with value labels is
converted into two R variables: the original numeric code with the
original variable name, and also the corresponding factor with the
variable labels named with the original name plus the suffix
_f.
The Read() defaults are generally applicable but
more options are usually avaialble for each core read function. To learn
more, download the respective package and access the corresponding
function help manual with ?function_name, such as
?read_feather.
To read the data, direct R to the location of the data file. R cannot read the data file until it knows where the data is stored. One option has you locate the data file on your computer system by browsing for it, navigating your file system until you locate the file.
To locate your data file by browsing through your file system, call the
Read()function with an empty file reference(""), literally nothing between the quotes.
The following Read() statement reads the data stored as
a rectangular data table from an external file stored on your computer
system into an R data frame called d. The empty quotes indicate
to R to open your file browser for you to locate the data file that
already exists somewhere on your computer system).
d <- Read("")
As with all R (and Excel) functions, the call to invoke the function includes a matching set of parentheses. Any information within the parentheses specifies the information provided to the function for analysis.
The <- indicates to assign what is on the right of
the expression, here the data read from an external file, to the object
on the left, here the R data frame stored within R, named d in
this example. You can also use an ordinary equals sign, =,
to indicate the assignment, but the <- is more
descriptive, and more widely used by R practitioners.
Specify either the full path name of a file on your computer system,
or specify a web address that locates the data table on the web. Also
can explicitly select the location of the data file to be read within
the quotes and parentheses. Specify either the full path name of a file
on your computer system, or specify a web address that locates the data
table on the web. Again, read the data into the d data
frame.
d <- Read("path name" or "web address")
With Excel, R, or any other computer apps that processes data,
enclose values that are character strings, such as a file name, in
quotes. For example, to read the data stored on the web in the data file
called employee.xlsx into the data frame d, invoke the
following Read() function call.
d <- Read("http://web.pdx.edu/~gerbing/data/employee.xlsx")
This example reads a data file on the web. To specify a location on
your computer, provide the full path name of your data file, its name
and location. To obtain this path name, first browse for the file with
Read(""). The resulting output displays the path name of
the identified file. Copy this path name and insert between the quotes
of Read(""), save this and other R function calls in a text
file, and then run the code in the future to directly read the data file
for future analyses without needing to browse for its location.
With the Read() function, put nothing between the quotes
to browse for a data file, or specify the location of the data file on
your computer system or the web. Direct the data read from a file into
an R data frame, usually named d, but can choose any valid
name.
Read()The Read() function displays useful output. Because R
organizes analyses by variable name, it is crucial to know the exact
variable names, including the pattern of capitalization.
In addition to the variable names, Read() displays each
variable’s type as stored in the computer, as numbers with or without
decimal digits, or as character strings. Also listed are the number of
complete and missing values for each variable, the number of unique
values for each variable, and sample data values.
The following lists the output from reading from a data file
downloaded with lessR. All that is needed to read these
data files is the name. A file on the web cannot be specified here
because you may not have web access when this file is generated.
##
## >>> Suggestions
## Recommended binary format for data files: feather
## Create with Write(d, "your_file", format="feather")
## More details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 7 ... 1 2 10
## 2 Gender character 37 0 2 M M W ... W W M
## 3 Dept character 36 1 5 ADMN SALE FINC ... MKTG SALE FINC
## 4 Salary double 37 0 37 63788.26 104494.58 ... 66508.32 67562.36
## 5 JobSat character 35 2 3 med low high ... high low high
## 6 Plan integer 37 0 3 1 1 2 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 90 ... 83 59 80
## 8 Post integer 37 0 22 92 74 86 ... 90 71 87
## ------------------------------------------------------------------------------------------
To allow for many variables, Read() lists the
information for each variable in a row. Note that the data file
organizes the variables by column.
Get more information from Read(), especially regarding
missing data, by setting brief to FALSE.
##
## >>> Suggestions
## Recommended binary format for data files: feather
## Create with Write(d, "your_file", format="feather")
##
##
## ----------------------------------------------------------
## Dimensions: 8 variables over 37 rows of data
##
## First two row names: Ritchie, Darnell Wu, James
## Last two row names: Anastasiou, Crystal Cassinelli, Anastis
## ----------------------------------------------------------
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
##
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Years integer 36 1 16 7 NA 7 ... 1 2 10
## 2 Gender character 37 0 2 M M W ... W W M
## 3 Dept character 36 1 5 ADMN SALE FINC ... MKTG SALE FINC
## 4 Salary double 37 0 37 63788.26 104494.58 ... 66508.32 67562.36
## 5 JobSat character 35 2 3 med low high ... high low high
## 6 Plan integer 37 0 3 1 1 2 ... 2 2 1
## 7 Pre integer 37 0 27 82 62 90 ... 83 59 80
## 8 Post integer 37 0 22 92 74 86 ... 90 71 87
## ------------------------------------------------------------------------------------------
##
##
## Missing Data Analysis
## ------------------------------------------------------------
## Number of cells in the data table: 296
## Number of missing data values: 4
## Proportion of missing data values: 0.014
## Number of rows of data with missing values: 3
## ------------------------------------------------------------
##
## Years Gender Dept Salary JobSat Plan Pre Post
## Wu, James NA M SALE 104494.58 low 1 62 74
## Jones, Alissa 5 W <NA> 63772.58 <NA> 1 65 62
## Korhalkar, Jessica 2 W ACCT 82502.50 <NA> 2 74 87
## ------------------------------------------------------------
Obtain this same information at any time during an R analysis by
calling the function details(). Read() calls
this same function after reading the data, with brief set
to TRUE by default.
Always distinguish continuous variables from categorical variables. This distinction between these two types of variables is fundamental in data analysis.
Continuous (quantitative) variable: A numerical variable with many possible values.
Categorical (qualitative) variable: A variable with relatively few unique labels as data values.
Examples of continuous variables are Salary or Time. Examples of categorical variables are Gender or State of Residence, each with just a relatively few number of possible values compared to numerical values. This distinction of continuous and categorical variables is common to all data analytics.
Sometimes that distinction gets a little confusing because variables with integer values, which are numeric, could be either quantitative or qualitative. For example, sometimes Male, Female, and Other are encoded as 0, 1, and 2, respectively, for three levels of the categorical variable Gender. However, these integer values are just labels for different non-numeric categories. Best to avoid this confusion. Instead, encode categorical variables with non-numeric values, such as Gender, for example, with M F, and O for Other.
A variable label is a longer description of the corresponding variable than that of the variable name. The variable label displays in conjunction with the variable name on the text and visualization output to further clarify the interpretability of the output.
The variable label file has two columns. The the first column lists
the variable names and the second column the corresponding labels. The
file can be of type .csv, or .xlsx, or
contained within lessR, as with the following example.
The variable labels must be read into the data frame l.
Specify the var_labels parameter as TRUE to
instruct the Read() function to read variable labels
instead of data.
##
## >>> Suggestions
## Recommended binary format for data files: feather
## Create with Write(d, "your_file", format="feather")
## More details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 label character 8 0 8 Time of Company Employment ... Test score on legal issues after instruction
## ------------------------------------------------------------------------------------------
Relying upon the have package, lessR
can write data with function Write() in formats: text,
Excel, ODS, and native R and native SPSS formats.
| Extension | Format | Package | Function |
|---|---|---|---|
|
.txt .csv .tsv .prn |
text, custom separator and decimal text, comma separated values text, tab separated values text, space separated values |
R utils |
write.table( ) write.csv( ) write.delim( ) |
| .xlsx | Excel | openxlsx |
write.xlsx( ) |
| .ods | ODS | readODS | write_ods( ) |
| .feather | feather | arrow | write_feather( ) |
| .parquet | parquet | arrow | write_parquet( ) |
| .rda | R | R base | load( ) |
| .sav | SPSS | haven | write_sav( ) |
The R format preserves a binary copy of the data frame as it is
stored within R, as does the SPSS format for native SPSS files. The
Excel and ODS formats yield worksheets. The csv format yields a comma
separated value text file. The feather and
parquet modern cross-analysis application formats,
optimized respectively for speed of input and output and for file
size.
One procedure begins the analysis of data in .csv,
.xlsx, or .ods format. Then proceed with data
cleaning and preparation, including needed transformations and
re-coding. When the data is ready for analysis, save the cleaned,
prepared data as a native R file of format .rda, or
feather or parquet. These formats are the most
efficient for size and for speed of reading back into R, with all data
preparations already completed. Can also write the data in a format such
as Excel so that those on the analysis team not using R (or Python) can
also access.
With the Write() function,
format parameter, with
the default of .csv, or indicate the full file name with
the file type and the format will be inferred.The following R statements that call Write() are not run
here as the intent is not to create additional files.
The following just indicates the data frame name, which results in
writing a .csv file to d.csv.
Write(d)
The following two calls to Write() are equivalent.
Write(d, "GoodData.xlsx")
Write(d, "GoodData", format="Excel")
Or, if you prefer to write the data file in Excel format, you can also choose to write the data as an Excel table.
Write(d, "GoodData", format="Excel", ExcelTable=TRUE)
You can also use the abbreviation for an Excel file,
wrt_x().
wrt_x(d, "GoodData")
Write the data as a R data table.
Write(d, "GoodData", format="R")
Can also use the abbreviation for an R file,
wrt_r().
wrt_r(d, "GoodData")
Use format="ODS" to specify to write to the OpenDocument
Spreadsheet format with file type .ods. Use
format="SPSS" to specify to write to the SPSS format with
file type .sav.
The output of Write() indicates the full path name of
the written file.
The haven package has an excellent function for
reading SPSS files, read_spss(). Read() adds
to the given functionality by preserving the SPSS variable labels and
value labels. Read() invokes this function to read SPSS
files, with the default file types of .sav and
zsav. The functionality of haven also
allows for reading SAS and Stata data files.
Within SPSS, an integer scored variable can have value labels. An
example is Likert scaled data with 1 representing a Strongly Disagree to
5 a Strongly Agree. The corresponding SPSS variable has integer values
but displays the more informative value labels such as in bar charts.
This type of categorical variable corresponds to a factor
in the R system.
The read_spss() function preserves the value labels with
a special variable type called haven_labels. These
variables can be converted to an R factor for processing in the R system
with the haven function as_factor().
Read() performs this conversion automatically for each
relevant variable.
One problem is that the factor conversion preserves the value labels
listed in the correct order, but looses the original integer scoring
information. To preserve both the labels as a standard R factor, and to
preserve the original scoring, Read() converts the original
read variable with the labels (type haven_labels) into two
variables. The first variable, an integer variable with the original
integer scoring, has the name of the read variable. The corresponding
factor variable has the same name as the read variable with the suffix
\_f.
For example, the use of read_spss() results in the
following, here showing just the first four lines of data. The variable
region contains both the integer scoring and the value label
that are part of the SPSS data file that was read.
# A tibble: 4 x 4
city region growth income
<chr> <dbl+lbl> <dbl> <dbl>
1 ALBANY-SCHNTADY-TROY,N.Y. 1 [NE] -71 3313
2 ATLANTA,GA. 2 [SE] 264 3153
3 BALTIMORE,MD. 1 [NE] 38 3540
4 BIRMINGHAM,ALA. 2 [SE] -178 2528
With Read(), obtain the following standard R data frame.
The variable region is now a standard R integer variable, and
region_f is the corrsponding factor.
city region region_f growth income
1 ALBANY-SCHNTADY-TROY,N.Y. 1 NE -71 3313
2 ATLANTA,GA. 2 SE 264 3153
3 BALTIMORE,MD. 1 NE 38 3540
4 BIRMINGHAM,ALA. 2 SE -178 2528
With these data, the analyst may accomplish a numerical analysis with the integer variable, and for analyses such as a bar chart, instead, display the corresponding value labels.
lessR automatically accesses variable labels stored in a
data frame named l, and then displays with the variable name in
text and visualization output. Usually, the variable labels are stored
in an Excel or .csv file with two columns, the variable
name, and the variable label. There are no column titles, just the names
and labels. Then read into the l data frame with the
var_labels parameter set to TRUE.
l <- Read(file_reference, var_labels=TRUE)
Read() also processes the variable labels of these
(usually) integer-scored variables with value labels in the SPSS data
file. The overseers of the R system do not permit package authors to
create stored data structures from their internal R code. Only the user
can create these structures. As such, Read() lists each
variable name, a comma, and then the corresponding label. This example
only has one such relevant variable, region and its factor
equivalent.
Variable and Variable Label --> See vignette("Read"), SPSS section
---------------------------
region, region of US
region_f, region of US
To access these labels, copy the names and labels, paste into a text
file, and then save as a file. Then read the file of names/labels into R
with the preceding Read() statement.
Use the base R help() function to view the full manual
for Read() or Write(). Simply enter a question
mark followed by the name of the function.
?Read
?Write