Read Data into R

Author

David Gerbing

Published

Apr 11, 2024, 08:21 am

Read the Data File

Getting the data from a file into a running R app is called reading the data.

The Data Frame

Data analysis within R proceeds from one or more data tables stored within an active R session.

Data frame

A data table stored within an active R session, referenced by its name.

Access the data within R by reading the data file into R. Multiple read functions are available from Base R as downloaded and from functions in different packages. We use the lessR function Read() for its simplicity and helpful output to better understand the data that R reads into a data frame.

Reading data

Read the data table into an R data frame (table) with the Read() function, then analyze specific variables in that data table, each referenced by its name.

Analogous to multiple Excel worksheets in a single Excel file, a running R session can contain multiple data frames, though we usually work with only one, usually with the name of d for data.

Browse for the Data File

To read the data, direct R to the location of the data file. R can only read the data file once it knows where it is stored. One option is to browse for the location of the data file on your computer system. You navigate your file system until you locate the file.

Browse to locate your data file to read

To locate your data file by browsing through your file system, call the Read() function with an empty file reference, (""), literally nothing between the quotes: Read("").

If you are running R/RStudio in the cloud, your local computer is your cloud account, not the computer from which you are accessing the cloud. That computer could be any computing device, such as a tablet or an iPhone, that does not even run R. First, upload your data file to your cloud account, as shown referenced in the link cloud directions.

The following Read() statement reads the data stored as a rectangular data table from an external file stored on your computer system, such as an Excel file. The Read() statement reads the data from the file into an R data frame called d. The empty quotes indicate that R should open your file browser to locate the data file that already exists somewhere on your computer system.

Video: Read Data [3:35]

d <- Read("")

As with all R (and Excel, Python, and everything else) functions, the call to invoke the function includes a matching set of parentheses. Information within the parentheses specifies the information provided to the function for analysis.

The <- in the Read() statement indicates to assign what is on the right of the expression, here the data read from an external file, to the object on the left, here the R data frame stored within the R session, named d in this example. You can also use an ordinary equals sign, =, to indicate the assignment, but the <- shows the flow of information in the assignment, and is more widely used by R practitioners.

Specify the Location of the Data Table File

One way to locate a data file to be read explicitly specifies the location of the file within the quotes and parentheses of the Read() function. Specify either the full path name of a file on your computer system or specify a web address that locates the data table on the web. Again, read the data into the d data frame, remembering to include the quotes.

Read data from a specified location

d <- Read("path name" or "web address")

With Excel, R, or other computer apps that process data, enclose character string values in quotes, such as a file name or web address (URL). For example, to read the data from the web data file employee.xlsx into the data frame d, invoke the following Read() function call.

d <- Read("http://web.pdx.edu/~gerbing/data/employee.xlsx")

To specify the location of the data file on your computer, provide the full path name that locates and names your data file. To obtain this path name, first browse for the file with Read(""). The resulting output displays the path name of the identified file. Copy this path name and insert between the quotes of Read(""). Save this and other R function calls in a text file for future analyses without needing to browse for its location.

In summary, with the Read() function, either put nothing between the quotes to browse for a data file or specify the data file’s location on your computer system or the web. Direct the data read from a file into an R data frame, usually named d, but can choose any valid name.

Multiple Excel Worksheets

If you read an Excel data file with multiple worksheets, the default is to read only the first worksheet as the data file. If you wish to read another worksheet as the data file, then specify that worksheet with the sheet parameter. Specify either the number of the worksheet or its name.

Output of Read()

R organizes analyses by variable name, so knowing the exact variable names is critical. This specification includes the pattern of capitalization. The Read() function automatically displays these names. The variables are in the columns, so to specify a variable is to select a column of data values.

Read() also displays the type of how each variable is stored in the computer: as numbers with or without decimal digits or as character strings. Also listed are the number of complete and missing values for each variable, the number of unique values for each variable, and sample data values. Figure 1 lists the output from reading the employee.xlsx data file.

Figure 1: Annotated output of Read() function with the Variable Name column highlighted.

Always compare the output of ‘Read()’ with the actual data file to ensure your data was correctly read. Never read data into R or any other system without ensuring that the data values in the data table stored on some computer system correspond to the variables and data values read into an R data frame.

To allow for the display of many variables, Read() lists the information for each variable in a row. Of course, the data file organizes the variables by column.

Display the Data

To analyze data, first understand the data. You should know what the data values look like for each variable and the variable names. The output of the lessR function Read() assists this understanding, but you often want to view the data directly.

After reading the data into R, you can view all or some of the contents of the newly created data frame. The rule is to view the contents of any R object, of which there are many types: enter the object’s name at the console, in response to the command prompt, >.

Video: Display the Data [1:49]

d

Or, use the R head() function to list the variable names and, by default, the first six rows of data, here for the data frame d.

head(d)

Separating data from the instructions to process that data is a welcome benefit of R over Excel. You should, however, view your data on a regular basis in order to understand what you are analyzing.

View your data

When doing data analysis with R, frequently access the head() function to view the beginning lines of the data table you are analyzing.

You should routinely view your data as you analyze it. When something does not work the way you expected it to work, look at your data!

Often, the problem can be fixed because the computer stores your data differently than you think it would be stored. Instead of trying to fix a problem by guessing, first look at your data.

              Name Years Gender Dept    Salary JobSat Plan Pre Post
1 Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
2        Wu, James    NA      M SALE  94494.58    low    1  62   74
3      Hoang, Binh    15      M SALE 111074.86    low    3  96   97
4    Jones, Alissa     5      W <NA>  53772.58   <NA>    1  65   62
5   Downs, Deborah     7      W FINC  57139.90   high    2  90   86
6   Afshari, Anbar     6      W ADMN  69441.93   high    2 100  100

Compare this output, the representation of the data within R, to the data table in Figure 1 as an Excel file. Same data, different locations. Note the representation of missing data.

R missing data code

NA and <NA>for not available indicates missing data for numerical and non-numerical variables, respectively.

The blank cells in an Excel file are replaced with either NA for numerical variables or <NA> for the non-numerical variables.

R also provides a corresponding function tail() that lists the data values at the end of the file.