3  Data

Author

David Gerbing

Published

Dec 17, 2023 08:03 pm

3.1 Data Table

Data analysis begins with, well, data. Analyze the data values for at least one variable, such as the annual salaries of employees at a company. Organize the data values into a specific kind of structure from which analysis proceeds. To know how to use any data analysis system such as Python, understand the organization of data values into a data table [video 5:14].

Data Table: Organize data values into a rectangular data table with the data values for each variable in a column, and the name of the variable at the top of the column.

Store the structured data values within a computer file, such as an Excel formatted file or text file. This file can be stored on your computer, an accessible local network, or the world wide web. The data table in Figure 3.1, formatted as an Excel file, contains four variables: Years, Gender, Dept, and Salary plus an ID field called Name for a total of five columns.

Figure 3.1: Structure of a data table.

Describe the data table by its columns, rows, and cell entries.

Data value: The contents of a single cell of a data table, a specific measurement.

Variable name: A short, concise word or abbreviation that identifies a column of data values in a data table.

Example: Each row of the data table, the data for a specific instance of a single person, organization, place, event, or whatever is studied.

Unfortunately, data analysis language is not standardized. Other names for example are data row, case, sample, and instance. Even more confusingly, in some contexts, the word sample refers to all the rows of data values, and in other contexts, it refers to a single row of data.

Figure 3.1 shows the data values in the Excel file for these variables for just the first 6 employees. For example, according to the data values for employee Darnell Ritchie, he has worked at the company for 7 years, identifies as a male, and works in administration with an annual salary of $53,788.26. Two data values in this section of the data table are missing. The number of years James Wu has worked at the company is not recorded, nor is the department in which Alissa Jones works.

Encode the data table in one of a variety of computer file formats. The formats we encounter are Excel files, indicated by a file type of .xlsx, and text files in the form of comma-separated value files (csv). Identify a text file with one of several potential file types, such as .txt, but usually .csv.

csv file: A data file that consists only of text, with adjacent data values for each row of data separated by a comma.

The first five rows of the employee data file in csv format appear in Figure 3.2.

Figure 3.2: csv formatted text data file with fields that contain a comma enclosed in quotes.

An advantage of csv files is that they can be read by any computer app that reads text. They are the default text file type for Excel.

Analysis of data can only proceed with the data table identified and the relevant variables identified by their name.

All Python data analysis functions analyze the data values for one or more specified variables, identified by their names, such as Salary.

Analysis requires the correct spelling of each variable name, including the same pattern of capitalization.

3.2 Read Data

3.2.1 Data Frame

To begin an analysis, read the data from a computer file into Python. Your data organized as a data table exists somewhere as a data file stored on a computer system. Figure 3.3 shows a data table as an Excel file named employee.xlsx stored on a (Macintosh) computer. Figure 3.1 shows the first several lines of this data table in detail.

Figure 3.3: Data table, named employee.xlsx, stored as an Excel file.

The data table stored within a running Python program has its own name.

Data frame: A data table stored within a Python session as a pandas data structure.

Each variable in a data table has a name, and so does the data table itself. Reference the data table stored on your computer system by its file name and location. When reading into a Python data frame, name the data frame with a name of your choice. Regardless of the file name on your computer system, typically name the data table within the active Python session, the data frame, as simply d for data.

When analyzing data read into Python, the same data exists in two different locations: as a computer file on some computer system, and as a Python data frame within a running Python app. Different locations, different names: same data. On your computer system, identify the data table by its file name and location. Within a running Python app, identify the same data by the its data frame name.

To read the data from a file into a data frame of a running Python application, as with every other task in Python, invoke a function. The pandas package provides two functions for reading data into a pandas data frame.

read_csv(): Read a data table in a file formatted as csv (comma separated values) text.

read_excel(): Read a data table in a file formatted as an Excel file.

Explicitly specify the location of the data file to be read within quotes as the first value passed to each function, that is within the parentheses. Specify either the path name of a file on your computer system, or specify a web address that locates the data table on the web. Chapter 8, Read Data, of the video, illustrate reading data into the d data frame.

Because the read functions are from the pandas package, inform Python of this location by beginning the file reference with a pd..

d = pd.read_csv("path name" or "web address")

d = pd.read_excel("path name" or "web address")

With Excel, Python, or any other computer apps that process data, enclose values that are character strings, such as a file name, in quotes. For example, to read the data stored on the web in the data file called employee.xlsx into the data frame d, invoke the following function call.

Because Python organizes analyses by variable name, it is crucial to know the exact variable names, including the pattern of capitalization.

Specify a Python analysis for one or more variables (columns) in the data frame (table).

Rows and columns organize a data table. The variables are in the columns, so to specify a variable is to specify a column of data values.

3.2.2 From the Web

Figure 3.4 illustrates reading the data from an Excel file employee.xlsx on the web.

Figure 3.4: Read the Excel data from the web into a data frame named d.

If using Jupyter Lab as the development environment, enter the function call to read the data into a notebook cell.

3.2.3 Anaconda

Read data files from your own computer file system or directly from the web. However, as previously indicated, when in Anaconda cloud, data must be read from your cloud’s file system, not from the web. So being able to read data files from the cloud file system is necessary for using Anaconda cloud.

If in Anaconda cloud, if given a web data file the first step is to download the data file to your computer. Downloading a data file from the web is straightforward and has nothing to do with Anaconda. For example, with Safari on my Mac, I copy a web address (URL) such as from a homework assignment, and then paste this address into Safari. The data file opens in my broswer, Then File --> Save and the data file is downloaded. In Jupyter Lab on your cloud account, upload a data file from your ccomputer by clicking the Upload button at the top-left of the window. After you do this process once it will take somwehre around 15 seconds each time in the fututre.

To specify a location of a data file on your computer, provide the path name of the file, its location and name. One possibility specifies a full path name that begins from the top-level of your computer’s file system. The top-level for Windows is C:, or ~/ for the Unix based Macintosh file system. A usually more efficient and less error-prone method is to specify a path within your previously created project folder where, in our case, the corresponding Jupyter Notebook file is located.

Current working directory: The default folder on your computer from which the running application identifies files from which to read, and where files are written.

The current working directory is the location of your Jupyter Notebook file. That is one reason why, when opening the Jupyter Notebook app, first navigate to your personal Python folder that you created before analysis begins. An example of the folders and files from the home directory of the Jupyter Notebook appears in the previously shown Figure 3.5.

Figure 3.5: List of folders from the home directory when Jupyter Lab first opens.

In this example, click on Documents, then click on your previously created Python folder. Then, create a new Jupyter Notebook from that location.

For example, Jupyter Notebook files have the file type .ipynb. Once you begin creating these files, in your Python project directory (folder), there will be a list of .ipynb files. Also in this directory you could create a sub-directory (sub-folder) called data. Presume that employee.xlsx is located in that directory (folder) data. Then the read statement appears as in Figure 3.6.

Figure 3.6: Read the Excel data from the data folder into a data frame named d.

The forward slash, /, in data/FILENAME indicates a sub-directory, here relative to the current working directory. Chapter 6 of the video, Data Folder, illustrates creating a data folder in the Anaconda Jupyter Lab environment.

3.2.4 From Colab

Access Colab at colab.research.google.com. Colab uses your Google Drive as its file storage. If you have a data file to read that is not on the web, create a data folder on your Google Drive and move the data file to that location. To access your Google Drive on the web: drive.google.com. Better, however, to download the Drive app locally on your device to be able to access directly.

To access your Google Drive files requires to connect Colab with the drive.mount() function in the drive package. Mount the drive directory (folder) in the content directory as follows. When you request to mount the drive you will be provided a link to obtain an authorization code.

Figure 3.7: Connecting to Google Drive.

The full file reference also contains the string MyDrive, which than accesses your Google Drive. This example presumes that there is a folder you created on your Google Drive called data that contains your data files.

'/content/drive/MyDrive/data/employee.xlsx'

Note the path name is included with quotes.

3.3 Display Data

To analyze data, you need to understand the data. You should know what the data values look like for each variable, and you should know the variable names. After reading the data into Python, view the contents of the newly created data frame. The rule is that to view the contents of any Python object, of which there are many types, enter the name of the object. Here enter the name d to view all the data.

Of course, viewing all the data is usually not practical. Instead, display [video 4:28] part of the data, including the size of the data table. The shape method shows the number of rows and then columns in the data table. The output shows in Figure 3.8.

Figure 3.8: Get the number of rows and columns for a data frame named d.

Or, use the Python head() function to list the variable names and, by default, just the first five rows of data, here for the data frame d. Figure 3.9 illustrates the head() function.

Figure 3.9: Display variable names and first five rows of data of a data frame named d.

The following output from the Python data frame d results.

Compare this output, the representation of the data within Python, to the data table in Figure 3.1 as an Excel file. Same data, different locations. Note the pandas representation of missing data.

Python missing data code: NaN for not a number.

The blank cells in the Excel file, Figure @ref(fig:dataTable), are replaced with either NaN.

Python pandas also provides a corresponding function tail() that lists the data values at the end of the file.

3.4 Two Types of Variables

Always distinguish continuous variables from categorical variables. This distinction between these two types of variables is fundamental in data analysis.

Continuous (quantitative) variable: A numerical variable with many possible values.

Categorical (qualitative) variable: A variable with relatively few unique labels as data values.

Examples of continuous variables are Salary or Time, all defined on a numerical scale with many unique values. Examples of categorical variables are Gender or State of Residence. Each categorical variable has just a relatively few number of possible values compared to a continuous value. This distinction of continuous and categorical variables is common to virtually every data analysis project.

Sometimes that distinction gets a little confusing because variables with integer values, which are numeric, could be quantitative or qualitative. For example, sometimes Man, Woman, and Other are encoded as 0, 1, and 2, respectively, for three levels of the categorical variable Gender. However, these integer values are just labels for different non-numeric categories. Best to avoid this confusion. Instead, encode categorical variables with non-numeric values, such as Gender, for example, with M, W, and O for Other.

To distinguish between continuous and categorical variables, determine if the values of a variable are on a numerical scale, presumably with a relatively large number of unique values that can be ordered from smallest to largest value. A categorical variable such as Gender, however, coded numerically, does not imply the values are on a numerical scale. For example, Woman coded a 1, is not more than Man, coded as 0, or vice versa.

Python, like all computer applications, stores numerical and categorical data differently. After reading the data into a data frame, pandas provides a method to show the storage type of the data values for each variable.

dtypes: Method to display the name of each variable in the specified data frame and its associated data type.

When calling the method, precede with the data frame name and a period.

The Python variable types we work with follow.

  • categorical variables: object or category
  • integers: int64
  • numbers with decimal digits: float64

Figure 3.10 illustrates the input and output of dtypes.

Figure 3.10: Some Python variable types

The distinction between the variable types object and category is discussed later. For now, object is the default variable type for a non-numeric variable, but eventually all categorical variables, integer or character strings, should be converted to variable type category.