The Data Table
Data analysis begins with, well, data. Analyze the data values for at least one variable, such as the company’s employee annual salaries. Organize the data values into a specific kind of structure from which analysis proceeds. To use any data analysis system, such as R, organize the data values into a table.
Video: Data Table [3:24]
Organize data values into a rectangular data table with the name of each variable at the top of a column, followed by its data values in the remainder of the column.
Store the structured data values in a file on your computer, an accessible local network, or the World Wide Web. Encode the data table into one of several computer file formats. The formats we encounter are Excel files (.xlsx) and comma-separated value (csv) files. Identify a text file with one of several potential file types, such as .txt, but usually .csv.
The Excel data table in Figure 1 contains four variables: Years, Gender, Dept, and Salary, plus an ID field called Name, for a total of five columns. Figure 1 displays their data values for the first six employees.
Describe the data table by its columns, rows, and cell entries.
A short, concise word or abbreviation that identifies a column of data values in a data table.
Analysis of data can proceed only after the data table and the relevant variables within it are identified.
All R functions analyze the data values within a data table for one or more specified variables, identified by their names, such as Salary.
Analysis requires the correct spelling of each variable name, including the same pattern of capitalization.
The contents of a single cell of a data table, a specific measurement, except for the first row, which (usually) contains the variable names.
The name variable was chosen because the data values for a variable vary. Doing data analysis is the analysis of that variability. Analysis of data can proceed only after the data table and the relevant variables within it are identified.
Variables define the columns of a data table. What about the rows?
A row of the data table that contains the data for a specific instance of a single person, organization, place, event, or whatever is the object of analysis.
Unfortunately, the row references in the data table are not standardized. Observations are also referred to as cases, examples, samples, and instances.
Consider employee Darnell Ritchie. He has worked at the company for seven years, identifies as a man, and works in administration with an annual salary of $43,788.26. Two data values in this section of the data table are missing. The number of years James Wu has worked at the company is not recorded, nor is the department in which Alissa Jones works.
Read the Data File
Getting the data from a file into a running R app is called reading the data.
Read the data table into an R data frame (table) with the Read() function, then analyze specific variables in that data table, each referenced by its name.
When read into R, the data table is called a data frame.
Browse for the Data File
To read the data, direct R to the location of the data file. R can only read the data file once it knows where it is stored. One option is to browse for the location of the data file on your computer system. You navigate your file system until you locate the file.
To locate your data file by browsing through your file system, call the Read() function with an empty file reference, (""), literally nothing between the quotes: Read("").
The following Read() statement reads the data stored as a rectangular data table from an external file stored on your computer system, such as an Excel file. The Read() statement reads the data from the file into an R data frame called d. The empty quotes indicate that R should open your file browser to locate the data file that already exists somewhere on your computer system.
Video: Read Data [3:35]
d <- Read("")As with all R (and Excel, Python, and everything else) functions, the call to invoke the function includes a matching set of parentheses. Information within the parentheses specifies the information provided to the function for analysis.
The <- in the Read() statement indicates to assign what is on the right of the expression, here the data read from an external file, to the object on the left, here the R data frame stored within the R session, named d in this example. You can also use an ordinary equals sign, =, to indicate the assignment, but the <- shows the flow of information in the assignment, and is more widely used by R practitioners.
Specify the Location of the Data Table File
One way to locate a data file to be read explicitly specifies the location of the file within the quotes and parentheses of the Read() function. Specify either the full path name of a file on your computer system or specify a web address that locates the data table on the web. Again, read the data into the d data frame, remembering to include the quotes.
d <- Read("path name" or "web address")
With Excel, R, or other computer apps that process data, enclose character string values in quotes, such as a file name or web address (URL). For example, to read the data from the web data file employee.xlsx into the data frame d, invoke the following Read() function call.
d <- Read("http://web.pdx.edu/~gerbing/data/employee.xlsx")To specify the location of the data file on your computer, provide the full path name that locates and names your data file. To obtain this path name, first browse for the file with Read(""). The resulting output displays the path name of the identified file. Copy this path name and insert between the quotes of Read(""). Save this and other R function calls in a text file for future analyses without needing to browse for its location.
In summary, with the Read() function, either put nothing between the quotes to browse for a data file or specify the data file’s location on your computer system or the web. Direct the data read from a file into an R data frame, usually named d, but can choose any valid name.
Multiple Excel Worksheets
If you read an Excel data file with multiple worksheets, the default is to read only the first worksheet as the data file. If you wish to read another worksheet as the data file, then specify that worksheet with the sheet parameter. Specify either the number of the worksheet or its name.
Output of Read()
R organizes analyses by variable name, so knowing the exact variable names is critical. This specification includes the pattern of capitalization. The Read() function automatically displays these names. The variables are in the columns, so to specify a variable is to select a column of data values.
Read() also displays the type of how each variable is stored in the computer: as numbers with or without decimal digits or as character strings. Also listed are the number of complete and missing values for each variable, the number of unique values for each variable, and sample data values. Figure 2 lists the output from reading the employee.xlsx data file.
Always compare the output of ‘Read()’ with the actual data file to ensure your data was correctly read. Never read data into R or any other system without ensuring that the data values in the data table stored on some computer system correspond to the variables and data values read into an R data frame.
To allow for the display of many variables, Read() lists the information for each variable in a row. Of course, the data file organizes the variables by column.
Display the Data
To analyze data, first understand the data. You should know what the data values look like for each variable and the variable names. The output of the lessR function Read() assists this understanding, but you often want to view the data directly.
After reading the data into R, you can view all or some of the contents of the newly created data frame. The rule is to view the contents of any R object, of which there are many types: enter the object’s name at the console, in response to the command prompt, >.
Video: Display the Data [1:49]
d Years Gender Dept Salary JobSat Plan Pre Post
Ritchie, Darnell 7 M ADMN 63788.26 med 1 82 92
Wu, James NA M SALE 104494.58 low 1 62 74
Downs, Deborah 7 W FINC 67139.90 high 2 90 86
Hoang, Binh 15 M SALE 121074.86 low 3 96 97
Jones, Alissa 5 W <NA> 63772.58 <NA> 1 65 62
Afshari, Anbar 6 W ADMN 79441.93 high 2 100 100
Knox, Michael 18 M MKTG 109062.66 med 3 81 84
Campagna, Justin 8 M SALE 82321.36 low 1 76 84
Kimball, Claire 8 W MKTG 71356.69 high 2 93 92
Cooper, Lindsay 4 W MKTG 66772.95 high 1 78 91
Saechao, Suzanne 8 W SALE 65545.25 med 1 98 100
Pham, Scott 13 M SALE 91871.05 high 2 90 94
Tian, Fang 9 W ACCT 81084.02 med 2 60 61
Bellingar, Samantha 10 W SALE 76337.83 med 1 67 72
Sheppard, Cory 14 M FINC 105027.55 low 3 66 73
Kralik, Laura 10 W SALE 102681.19 med 2 74 71
Skrotzki, Sara 18 W MKTG 101352.33 med 2 63 61
Correll, Trevon 21 M SALE 144419.23 low 1 97 94
James, Leslie 18 W ADMN 132563.38 low 3 70 70
Osterman, Pascal 5 M ACCT 59704.79 high 3 69 70
Adib, Hassan 14 M SALE 93014.43 med 2 71 69
Gvakharia, Kimberly 3 W SALE 59868.68 med 2 83 79
Stanley, Grayson 9 M SALE 79624.87 low 1 74 73
Link, Thomas 10 M FINC 76312.89 low 1 83 83
Portlock, Ryan 13 M SALE 87714.85 low 1 72 73
Langston, Matthew 5 M SALE 59188.96 low 3 94 93
Stanley, Emma 3 W ACCT 56124.97 high 2 86 84
Singh, Niral 2 W ADMN 71055.44 high 2 59 59
Anderson, David 9 M ACCT 79547.60 low 1 94 91
Fulton, Scott 13 M SALE 97785.51 low 1 72 73
Korhalkar, Jessica 2 W ACCT 82502.50 <NA> 2 74 87
LaRoe, Maria 10 W MKTG 71961.29 high 2 80 86
Billing, Susan 4 W ADMN 82675.26 med 2 91 90
Capelle, Adam 24 M ADMN 118138.43 med 2 83 81
Hamide, Bita 1 W MKTG 61036.85 high 2 83 90
Anastasiou, Crystal 2 W SALE 66508.32 low 2 59 71
Cassinelli, Anastis 10 M FINC 67562.36 high 1 80 87
Or, use the R head() function to list the variable names and, by default, the first six rows of data, here for the data frame d.
head(d) Years Gender Dept Salary JobSat Plan Pre Post
Ritchie, Darnell 7 M ADMN 63788.26 med 1 82 92
Wu, James NA M SALE 104494.58 low 1 62 74
Downs, Deborah 7 W FINC 67139.90 high 2 90 86
Hoang, Binh 15 M SALE 121074.86 low 3 96 97
Jones, Alissa 5 W <NA> 63772.58 <NA> 1 65 62
Afshari, Anbar 6 W ADMN 79441.93 high 2 100 100
Compare this output, the representation of the data within R, to the data table in Figure 2 as an Excel file. Same data, different locations.
Another option to view the data read into R invokes the Base R View() function, which works directly from within RStudio.
View(d)One advantage of this form of viewing the data is that you can view the data just by scrolling. Figure 3 shows the display of data within RStudio with View(d), with the scroll bar at the right-side of the window pane. Or, in RStudio, select the name of the data frame from the Environment tab at the top right window pane.
View() data display from within RStudio.
Note the representation of missing data.
NA and <NA>for not available indicates missing data for numerical and non-numerical variables, respectively.
The blank cells in an Excel file are replaced with either NA for numerical variables or <NA> for the non-numerical variables.
Separating data from the instructions to process that data is a welcome benefit of R over Excel. You should, however, view your data on a regular basis in order to understand what you are analyzing.
When doing data analysis with R, frequently access the head() function to view the beginning lines of the data table you are analyzing.
You should routinely view your data as you analyze it. When something does not work the way you expected it to work, look at your data! Often, the problem can be fixed because the computer stores your data differently than you think it would be stored. Instead of trying to fix a problem by guessing, first look at your data, such as with the head() function. R also provides a corresponding function tail() that lists the data values at the end of the file.