Data Preparation for Machine Learning

Overview

On the job, you cannot expect that data arrives ready for analysis. This data preparation step, accomplished with its own set of analytic tools, can devour much more time than the analysis itself, typically around 70% to 80% of the work. We consider two forms of data management.

Data wrangling applies to all data analysis, resulting in a tidy, clean data table with data ready for analysis.
Data pre-processing refers to specific data manipulation methods that typically precede a machine learning analysis that leads to the development of models that generate useful information for business decision making.

Too frequently, structural problems with the data’s organization, miscoding, a variety of inconsistencies, missing data, and other issues hinder data preparation. Even if the data is clean and organized correctly, additional transformations of some of the variables may be needed. To manage data from its initial version to a cleaned, tidied version amenable for analysis typically requires the analyst to wrangle with the data.

Wrangle: To herd, to round-up, to brawl.

Data Wrangling: The process of cleaning, tidying, and otherwise preparing data for analysis.

As with the rest of this course, becoming expert in these topics would require weeks of focus. However, having at least some skills provides a reasonable background for understanding the basic principles of data preparation so that you can do many operations yourself, plus provide the foundation for learning more in the future as needed. Remember, you never need memorize the Python syntax, just be able to read and understand what the steps are accomplishing.

Data is organized into a data table called a data frame within a running Python or R analysis. For example, a data frame can be subsetted by extracting some subset of its rows and columns. Using the corresponding Juypter notebook template as a reference, you should be able to implement a specified subsetting by reading that template and then, with relatively minor modifications, create your own subsetting of a data frame.

From the template notebooks, be able to work with data using that code and understand the underlying concepts. The goal is that you can adapt the notebooks to different data, such as with the homework assignments, or, data you encounter on the job. The worked problems for the homework are designed so that you do exactly the analyses I do in the notebooks.

Stat Background

In general, each week presents a conceptual overview of the statistical principles introduced for that week, followed by a Python template that implements the concepts in a working Jupyter Notebook. However, this is the only week in which new material is introduced without a supplemental online reading. The two Python templates for the week contain the needed explanation. Additional needed stat material, such as standardization (z-scores) and box plots, are included with the Python Notebook template.

If additional review is desired, find optional review material for basic stats referenced below. To understand machine learning you need to understand the concepts of mean, standard deviation, and normal curve probabilities. The following slides and associate videos explain these concepts beyond the explanations provided in the templates.

Ignore the R computer code in this supplementary material, which is not relevant for us. The concepts remain the same. These pdfs/videos are not about coding, but about understanding some basic descriptive statistics.

Mean and Standard Deviation

Statistical Practice: The analysis of variability.

The analysis of variability is often about variability about the mean. The concepts of the mean, and the standard deviation that indicates variability about the mean, are crucial to understanding and doing data analysis. Section 2.1 presents these concepts. Without necessarily knowing all the details, you should be able to explain the meaning of the standard deviation and understand how that statistic assesses variability about the mean.

The standard deviation is the key statistic to assess the variability of numerical data. Sec 2.1c of the posted slides and videos presents a concise definition and explanation of the standard deviation. You do not need to memorize formulas (Sec 2.1, #14) for this class. You should, however, understand that squared deviation scores are the basis for the standard deviation (Sec 2.1, #16). The most useful outcome in this course is the meaning of a statistical result, such as illustrated with Sec 2.1, Slide #26. From that slide, view the two distributions of test scores side-by-side. Identify the distribution with the largest deviation scores and, which follows, the largest standard deviation. That is, you understand what the concept of "standard deviation" means, which is a different skill and more important than memorizing a formula.

2.1

Mean, Standard Deviation

pdf
[27]

2.1a
[5:26]

2.1b
[8:26]

2.1c
[14:35]

Uncover Pattern Blurred by Sampling Instability

We cannot observe data without the influence of random fluctuations that help shape the data values. A primary fact to understand to do data analysis is that randomness is inherent in our data.

Randomness in Data

We only observe samples from populations. The two concepts are related. Data analysis becomes the search to understand the patterns that underlie observed data. These patterns become the basis for generalizing the results of our analyses beyond the one sample of data from a population to the entire population.

The concept of sampling variation is one of the most fundamental concepts in data analysis. Little in data analysis makes sense beyond simple description of a sample until the concept of sampling variation is understood. The material linked here explains sampling variation.

The basic concepts of samples from populations, sample statistics versus population parameters, and the relation between the standard deviation and the normal curve, are crucial statistical concepts to understanding data analytics. You should have at least some understanding of these concepts. You need to understand the distinction between a sample statistic and the corresponding population value, such as between the sample mean, m, and the corresponding population mean, μ. Also, understand why we care about the population mean much more than the sample mean to the extent that our analysis focuses on the population mean. Understand how the distinction between sample and population relates to collecting data and forming conclusions.

The basic concept that underlies statistical inference, that is, making conclusions about population values from data sampled from that population, is repeatedly taking samples from the population, all of the same size, that is, the number of observations. The abstract part of that realization is that when we draw conclusions about population values we act as if we had taken many, many samples, when, in fact, we have taken only one sample, the largest sample we could get. The mathematicians have given us formulas that show us what would happen if we did take many, many samples. "What would happen" refers to the distribution of the statistic of interest, here the sample mean, over the many samples. Crucial is to understand that each sample would yield a different sample mean.

You need to understand these preceding paragraphs before you can understand data analysis. If you do not understand, what follows in this course is all gibberish. If you do understand, it all begins to make sense.

The Normal Curve

One of the fundamental concepts in statistics and data analysis is that 95% of normally distributed data values, for any normal distribution, are within 1.96 standard deviations of the mean. So much understanding of the application of statistics to data analysis depends on knowing that basic fact. We will be applying it several times throughout this course.

3.2

Normal Curve, Standard Scores, Probabilities

pdf
[41]

3.2a
[11:32]

3.2b
[11:59]

3.2c
[13:31]