1  Introduction

Author

David Gerbing

Published

Dec 17, 2023 08:03 pm

The hottest topic in the world of business is data analytics, a true revolution within the last ten or so years. Here we introduce one of the primary data science languages: Python [video 1:13].

All data analysis is done on the computer, as in 100%. We use Python for data analytics, the same analysis system many data scientists use doing real data science, particularly machine learning (R is also a popular data science choice). You initially spend some time with the Python system, and then once you see how it works, the rest becomes straightforward. From a small investment in learning how the system works, you are able to access to much data analytic prowess, including machine learning.

The following content introduces Python leading to data analysis.

1.1 Python vs. Excel

Excel and Python analyze data from function calls. You already know more Python programming than you thought you did. Excel works with many built-in functions, plus user-defined functions (which Excel calls macros), and so does Python.

Function: Performs the computations that transform data values into the results, the output.

The input into a data analysis function is data. The output of a function can appear in several forms:

  • text in the form of writing and tables
  • data visualizations in a graphics format such as pdf or png
  • data transformed from the original input data.

To run a Python program is to process the code, which consists of a sequence of function calls to accomplish a corresponding sequence of analyses.

1.1.1 Function Calls

Python and Excel differ in how to instruct a function to do its work. To illustrate, consider the following data, six data values for the variable Salary. Find the variable name Salary in the first row of the corresponding Excel worksheet, with the data values listed under the variable name. The standard organization of data for analysis lists the variable name in the first row. List data values for that variable in the corresponding column, beginning in the second row.

What is the average salary? Compute the average, more technically called the (arithmetic) mean, either with the Excel function average() or with the Python function mean(). Both functions provide the same result but the respective languages call their functions differently.

Function call: Process the computations of a function in a running computer app.

With Excel, enter the function call into a cell in the worksheet, sharing the same type of storage area, a worksheet cell, as the data itself. See Figure 1.1.

Figure 1.1: Excerpt from an Excel worksheet.

In this example, enter the function call beneath the column of data into the 8th cell in Column A. Specify the data for analysis with a cell range, such as the relative cell range A2:A7. This cell range refers to cells in the same column relative to the cell to enter the function call. This cell range extends from the second row of Column A to the seventh row.

Python works differently.

A running Python application presents a space separate from the data from which to enter sequential Python instructions.

This cell range extends from the second row of Column A to the seventh row. Read the data into a running Python application into a data table named by you, say d. Then enter the Python instruction for computing the average Salary in this d data table with the mean() function, shown in Figure 1.2.

Figure 1.2: Python function call to compute the mean of variable Salary, a variable found in the data table read into the Python program with the name of d.

The data table within Python, here named d, has its own general name.

Data frame: Data table stored in a running Python app.

Instructions for a Python data analysis consist of at least several lines of code entered in successive lines.

To do data analysis with either Excel or Python in this example, enter a simple function call to analyze the data. For data analysis, both Excel and Python analyze the data values for a variable organized within a column, though with different user interfaces. Python, however, presents several advantages.

1.1.2 Advantages of Python

Data scientists use Python (or R) for their analyses instead of Excel. The following are some reasons for the preference for Python.

  • Excel is great for data entry and viewing data as a spreadsheet app, but provides only a smattering of basic statistics computations.
  • Python offers a wide variety of statistical analyses, from beginner to advanced. level
  • Python does Big Data, efficiently handling data sets with millions of observations.
  • Python does machine learning analysis, including the most advanced, recent developments.
  • Once the concept of working with Python is understood, less time is required to conduct an analysis, such as constructing a histogram instead of clicking and mousing around.
  • Python separates the instructions for the analysis of data from the data. This separation makes debugging errors much more straightforward for complicated Excel files that can span multiple worksheets
  • Obtain each Python analysis with one or more instructions, function calls, that can be saved for future use instead of irrecoverable mouse clicks. The results of Python analyses are reproducible, an important enough concept to merit its own discussion.

The multiple instructions to perform an analysis document exactly how to conduct the analysis. Save these instructions in a file for later use so that the analysis can be repeated, that is, reproduced.

Reproducibility: Analyses can be re-run in the future to reproduce previously obtained results.

The saved Python code is an instruction manual for accomplishing the analysis, a set of instructions anyone with access can repeat. The logic underlying the computations becomes readily apparent. The instructions for analyses done by one person become accessible to all applicable members of the organization at any subsequent point in time, including the original author. On the contrary, the Excel mouse clicks typically disappear into digital dust.

When using Excel you always see your data. That part is good, but there is a huge downside from mixing data with code, which is one reason why data scientists use Python or R instead of Excel for their analyses.

As I wrote in my 2021 publication in the Journal of Statistics and Data Science Education (Gerbing 2021):

“From the perspective of data science, Excel worksheets exhibit a fundamental flaw, the confounding of the data with the instructions to process that data. Both data and data processing instructions are entered into adjacent cells stored within the same worksheet. On the contrary, R and Python separately store data and data processing instructions into different files” (p. 251).

Countless overly complex Excel worksheets for business processes are horrendous to debug and understand in their complexity. Much better to separate your data from the code. Let the (small to moderate size) data table reside within Excel, but use Python or R to write your code that manipulates and analyzes your data. Python writes data to Excel files as easily as it reads from Excel worksheets. If needed, export the results of your specified computations back to Excel.

Separate your data from your code to manipulate that data. Data analysis programming languages such as Python and R provide that separation. Excel is vastly overused and a detriment to many business operations. Welcome, instead, to the world of real data science.

1.1.3 Machine Learning

Python provides a general framework for machine learning. All different machine learning procedures follow the same overall structure of Python code. Once you have completed one Python machine learning analysis, running others follows the same syntax. However, before we can use Python for machine learning, we need some basic concepts of Python programming.

1.2 Getting Started

Download and install Python and related on your computer, or run in the cloud. The choice is yours. Python runs the same regardless. The download is free and cloud access is also free except for fairly extensive analyses beyond the basics. From the cloud, run Python from any device with a web browser, such as any Windows or Mac machine, and devices, such as a Chromebook or an iPad. Of course, running in the cloud requires an active Internet connection but does not require a Windows or Mac (or Linux/Unix) computer and the needed installation and configuration. The cloud versus your own computer are also not mutually exclusive choices. The same Python code runs in either environment, with the Python code files easily uploaded and downloaded across environments.

1.2.1 Your Computer

1.2.1.1 Download

Getting Python onto your computer is simple. The most straightforward method to access Python and related software on your computer is from a company called Anaconda. The Anaconda Individual Edition download provides all the needed free, open-source tools. Download the installer for Python and related ecosystem direct from the anaconda.com/download. Scroll down the web page a bit for the Download button.

Download Python and related content. Currently, Anaconda offers Version 3.11.5 of the core Python language. If downloading on a Mac, from the drop-down menu on the Download button choose the version appropriate for your computer, shown in Figure 1.3.

Figure 1.3: To download the current Ananconda distribution of Python, click the Download link.

“Python” for data analysis is actually a set of related software apps, an entire ecosystem of software. Downloading Python from Anaconda downloads the following:

  • Python core language, Version 3.x
  • many packages that enhance the core language for data analysis
  • conda package manager to manage all the packages
  • several development environments from which to write and run Python code

Clicking on the download links first downloads the installer. Accept the given defaults for each step of the process.

Another downloaded app is the package manager, discussed next.

1.2.1.2 Update

The Anaconda download material is typically frozen at a specific time and only updated several times a year. Many updates may have happened in the Python world since the last download was posted on the Anaconda web site, so usually, though not necessarily, after downloading the posted material, update it.

To manually maintain all of the many packages included in the Anaconda distribution is not a trivial task. Many different people and organizations develop the various external packages needed to achieve the functionality of the Python system that we need. Each package usually depends on one or more other packages. Updating one package may violate dependencies in another package so that the software no longer works. To safely update the software, we need a way to manage these interdependencies, locate the various packages and then do the download.

Conda package manager: Provides the needed bookkeeping for updating packages and their many inter-dependencies.

To access conda, however, we need the command line, which can be accessed or via Anaconda Navigator, as shown in Figure 1.4. On the Mac, Navigator brings you to the command line via the Terminal app, which you can also engage directly if you wish.

Figure 1.4: Access the command line.

When at the command line, a prompt indicates to enter a command. Enter:
conda update --all

Figure 1.5 illustrates the update instruction for the Mac.

Figure 1.5: Update the Anaconda distribution from the Mac Terminal app.

Figure Figure 1.6 illustrates the update instruction for Windows.

Figure 1.6: Update the Anaconda distribution from the Windows Anaconda Prompt app.

The process takes some minutes to gather and install all the updates from the initial Anaconda download to the current state of the software.

1.2.1.3 Python Project Folder

Before beginning Python program development on your computer, consider creating a folder (directory) for storing your Python projects. While not required, always organize the contents of any project in a distinct directory that you can easily navigate. For example, create a folder called Python in your Documents directory.

1.2.2 The Cloud

Computing in the cloud, that is, on remote machines accessed over the Internet via a web browser, is a dynamic industry. Cloud computing leverages cheaper personal computing devices with increasingly more cloud computer power. Running Python in the cloud is no exception.

One cloud option comes directly from Anaconda. With Anaconda Cloud (click on Notebooks at the top of the window) you can maintain a directory that includes both your Python code files and your data files. With Anaconda Cloud there is no need to have to mount a separate drive. However, there is a limitation.

Anaconda in the cloud cannot read data files on the web directly. Instead, download a web data file to your computer and then uploaded the data file to the cloud.

The process is straightforward, explained in Section 3, though be aware there are daily time limits to access, the lowest limit available for the free access tier. That said, analyses can generally be done within the time limit for each assignment.

Another cloud option is Google’s generally free Colab environment. Working with Colab is similar to working with the Anaconda distribution of Python, though your files are stored on Google Drive. The same directions for running Python scripts apply to both environments with minor distinctions. Colab automatically creates a project folder called Colab Notebooks on your Google Drive to store your Python notebook files. These notebooks are stored on your Google Drive, and can also be downloaded to your own computing device, and even opened as a Jupyter Notebook, such as with Anaconda.