David Gerbing
The School of Business
Portland State University
2023-01-07
Much data analysis is directed toward finding the underlying signal, the structure, obscured by random error. This concept is further developed in the following content.
Video: Samples vs Populations [12:59]
One primary problem we encounter in doing data analysis is that all the data of interest is not available for our analysis. We are limited to analysis of only some of the observations of interest. For a study of the number of employees who call in sick on a Friday before a holiday Monday, the data may not exist for many years in the past, and regardless, do not exist at all for future such Fridays. A study of the blood chemistry of those who have Type~II Diabetes cannot examine all people who have had, who do have, and who will have Type~II Diabetes. The length of time to complete a specific procedure is of interest, but the times of past procedures may not have been recorded, and cannot have been recorded for future instances of the procedure.
The distinction here is the population of data values compared to a sample of data taken randomly from the full population. We want the entire population, but typically only get one sample.
Population: Set of all existing or potential observations.
Sample: A subset of the population, the data for a specific analysis.
To collect data, randomly gather a subset of the population, and generalize results to the entire population.
A population defined by a process, and a sample from the process, where each dot represents a single observation.
The population is of primary interest. So we must generalize beyond the usual one sample of data. The population contains the desired information. The population may consist of a usually large number of fixed elements in a given location at a given time, such as all registered voters. More generally, the population consists of outcomes of a process ongoing over time, in which case the population and its associated values such as its mean are hypothetical.
Population value: True value based on the entire population, such as the population mean.
A population value is not known directly because all the values of the population are not known. A population value is an abstraction, considered real, but not observed, with an unknown value. A primary reason to analyze data is to estimate a population value from sample data
One type of data analysis refers only to the sample which is analyzed.
Descriptive statistics: Summarize and display aspects of the sample data drawn from the larger population.
Descriptive statistics are also referred to as summary statistics. Calculate descriptive statistics directly from the data. Examples of descriptive statistics include the mean and standard deviation. Another class of statistics are the median, range, IQR, quartiles and quantiles.
Compare estimation with calculation. The estimation of unknown population values follows from the calculation of the relevant descriptive statistics.
How are the elements in the sample selected from the larger population?
Random sample: A sample in which every value of the population has an equal probability of selection.
To select a random sample requires access to randomly generated numbers, today usually accomplished with a computer application such as R or Excel. A random sample is difficult to implement completely, but the essence of randomness is essential to properly generalizing results to the population.
When drawing a sample from the larger population, distinguish between what is wanted vs what is obtained.
Sampling frame: The actual population from which the sample is drawn, distinguished from the desired population.
The results of an analysis can only be properly generalized to the sampling frame. The sampling frame determines the scope of generalization of results, not the desired population of interest per se. The sampling frame should be the population of interest, but sometimes the population actually sampled is not the population that was desired.
Consider an example of a limitation of generalizing sample results. Suppose a researcher conducts a market research survey. Define the population of interest as all city residents, then. Draw a random sample of people listed in the phone book. Collect data by calling the people from 9am to 5pm. The results of this analysis can only be properly generalized to people .
The sampling frame is not the population of interest, all city residents, so these survey results cannot be properly generalized to all city residents. Inappropriate generalization of the results from a sampling frame to the wrong population, usually the intended population, is one of the most egregious errors of data analysis. Unfortunately, this error is not too uncommon, nor can the usual statistical analysis of the data correct this problem. As the saying goes: Garbage in, garbage out.
Video: Sampling Fluctuations [6:16]
The sample of data values is only the starting point of statistical analysis. Statistics such as the sample mean, \(m\), provide a summary of this distribution of data values. The specific values in a sample differ from sample to sample. Accordingly, the value of a sample statistic of interest, such as \(m\), arbitrarily varies from sample to sample. Each sample outcome, such as \(m\), is an arbitrary result, which only partially reflects the true, underlying population value. Typically, only one sample is taken and so only one \(m\) is observed, but the following reality is the basic motivating concern addressed by statistical inference.
IF many samples were taken, then a different value of \(m\) WOULD BE observed for each sample.
Consider the simple scenario of flipping a fair coin. The randomness is apparent, but is there some stable aspect of reality that underlies this variation? The answer is yes. Sometimes obtain fewer Head and sometimes more Heads, but the random variability of the number of Heads follows from a specific pattern: the number of Heads on 10 flips tends to be around 50%. In general you will get around 5 Heads on 10 flips for a fair coin.
To illustrate, flip a fair coin 10 times, a sample size of \(n=10\). Encode the outcome of each flip as a value of the variable Y. Score each Head as Y=1 and each Tail as Y=0.
Sample 1: Now get the first sample. Do the 10 flips.
Ten flips of the coin, get 5 heads.
Calculate a sample mean, \(m\), for a given sample with the standard formula: \[m = \dfrac{\sum Y_i}{n}\]
Result: \(m_1 = \dfrac{0+0+1+1+0+1+1+0+0+1}{10} = .5\)
Sample 2: Now gather another sample of size \(n=10\). Flip the same fair coin another 10 times. Again, score Y=1 for a Head and Y=0 for a Tail. Again, calculate \(m\) for \(n=10\).
Ten flips of the coin, get 6 heads.
Result: \(m_2 = \dfrac{1+1+1+0+0+1+1+0+1+0}{10} = .6\)
We see that one sample of ten flips yields \(m_1=.5\). However, a second sample yields a different result, \(m_2=.6\).
As managers we are not concerned with coin flips per se, but we are all gamblers in the presence of uncertainty. The context of our decision making is this random variation about an underlying structure. The application of statistics to data analytics cannot provide certainty, but it can provide a tool to more effectively address uncertainty. We wish to construct a potentially useful form of knowledge, a model that delineates the underlying pattern from the observed random variation.
For example, move away from coin flips to a real-world example.
Consider the mean length of time required to complete a procedure. For
example, assess the average length of time over 10 people it takes to
complete an assembly.
Sample 1: \(m\) = 14.53 minutes.
Then, with the same tools, the same source of materials, and the same
employees, repeat the measurement on a second assembly.
Sample 2: \(m_2\) = 13.68 minutes.
Therein lies one of the most important, central, and essential concepts of doing and understanding applications of statistics via data analysis. The variation of results across mundane samples of coin flips applies to random sampling in general, and all of our measurements are obtained from what are supposed to be random samples.
Sampling variability: The value of a statistic, such as the sample mean, \(m\), randomly varies from sample to sample.
Why? Sampling variation of a statistic results from random variation of the data values from random sample to random sample. A descriptive statistic only describes a sample, but is limited in its generality to describe the corresponding population value. The focus of statistical analysis of data is the underlying, stable population values, not statistics calculated from an arbitrary sample.
Video: Law of Large Numbers [16:52]
The population mean, \(\mu\), is an abstraction. The distinction between sample and population values is apparent from the way in which information about these values is obtained. Calculate sample statistics directly from the sample data. If the calculation is done correctly, the sample value is known exactly. In contrast, estimate the underlying reality, the population values.
For example, to understand reality, management wants to know the value of the population mean, \(\mu\). The difficulty is that \(\mu\) is an abstraction, a hypothetical, never directly calculated, and so must be estimated without knowing its precise value. To estimate the unknown requires information from which to provide the basis for the estimate.
Only one kind of information is considered in traditional statistical analysis, which employs what is called the The classical or frequentist model of statistics1 defines probability in terms of long-run averages. The classical model specifies to obtain information regarding a population value, such as \(\mu\), only from data, preferably lots of data.
More information, more data randomly sampled from the same process, generally provides better estimation of underlying population values.
The data used to estimate a population value of a process or population, must, reasonably enough, all be generated from the same process. Further, the sampling process must be random.
Consider a coin flip in which the truth is that the coin is fair. How do we express the underlying reality, the true structure? The coin is fair, so unlike the typical data analysis, here directly compute the reality, the value of \(\mu\), from the scoring protocol of Y=1 or Y=0 that the coin is fair. The probability of a Head (1) on a given toss is \(\frac{1}{2}\), as is the probability of a Tail (0). \[\mu=\frac{1}{2}(1) + \frac{1}{2}(0)=0.5\] The question of interest here is if data analysis can reveal this reality of a fair coin.
The underlying reality results in the long-term outcome of 50% Heads, an expression of a simple model. Random variability, however, obscures knowledge of the underlying reality. To maximize the chances of winning the most favorable outcome on a single coin flip, a gambler needs to understand the underlying reality of what happens over the long term of many coin flips. Any one specific sample result does not, by itself, reveal the underlying reality. Flipping a coin 10 times will result in 7 Heads (11.7% of samples of 10 flips), a result far from the true reality of a fair coin but not so rare either.
Betting on a 50% chance of getting a Head with a fair coin will generally provide better outcomes than would betting with a false understanding of reality. For example, believing that there is a 70% chance of getting a Head just because one sample result of 10 flips resulted in 7 Heads. Now we explore the relation between the underlying pattern, here \(\mu=0.5\) to the size of the sample used to estimate the underlying pattern. To illustrate the usefulness of larger samples, successively plot a mean over many sample sizes. Generate the data by flipping a coin a specified number of times and record if a Head or a Tail after each flip. Record the data values for a variable Y to indicate the number of Heads obtained. Score each Head as Y=1 and each Tail as Y=0. Over the entire population, one-half of all flips result in a 1 and one-half of all flips result in a 0.
Running Mean: For a sequence of data values, re-calculate the sample mean, \(m\), after each new data value is collected.
The resulting plot of the running mean shows the value of the sample mean, \(m\), as the sample size increases.
Here we simulate repeated coin flips. Anyone with a coin can conduct the analysis and then plot the running mean over as many flips as desired. Of course, such a task is exceedingly tedious. Fortunately the computer provides the data as if} we have many coin flips.
Question of interest: How close is \(m\), based on data, to the truth of \(\mu=0.5\), as the sample size increases?
Computer simulation allows us to easily explore what happens for different sample sizes. Here the simulation is of different numbers of coin flips, from 1 to the specified maximum number of flips. A different result follows each time the simulation is run.
The lessR function simFlips() provides the
simulation, plotting the running mean as the sample size increases
as if repeatedly flipping the same coin.
n, the maximum number of flipsprob=.5, probability of a Head, coin is
fair?simFlips for all options, e.g.,
pause=TRUE for each flipMirroring the reality of actually flipping a real coin, every time a simulation is performed, even for the same sample size, generally obtain a different result.
Begin with a simulated sequence of only 10 flips, with the result in Figure @ref(fig:n10sim).
Sequence of Heads (1) and Tails (0) for 10 coin flips.
The first (simulated) coin flip was a Tail, and the second flip was a Head. These 10 flips resulted in 4 Heads, so the estimate of \(\mu = 0.5\) is the value of the sample mean, \(m = 0.4\). After 10 coin flips:
\[{m} = \dfrac{\sum Y_i}{n} = \dfrac{\textrm{Number of Heads}}{\textrm{Number of Flips}}= \dfrac{4}{10}=0.4\]
Now consider 250 flips (and a certain appreciation that the computer will do the work of all those “flips” of the coin for us). The simulation result is in Figure @ref(fig:n250sim).
Sequence of Heads (1) and Tails (0) for 250 coin flips.
The value of \(m\) is {relatively unstable} for the first 50 or so coin flips, and then after 200 or so flips, the {estimate of \(\mu=0.5\) nicely settles down}. After 250 coin flips:
\[{m} = \dfrac{\sum Y_i}{n} = \dfrac{\textrm{Number of Heads}}{\textrm{Number of Flips}} = \dfrac{117}{250}=0.468\]
Note that the estimate of \(\mu\), which is the sample mean, \(m\), is closer to \(\mu=0.5\) at the end of 250 flips than at the end of a sample of 10 flips.
Now consider a much larger sequence of simulated coin flips, \(n=10,000\). The simulation result is in Figure @ref(fig:n10000sim).
Sequence of Heads (1) and Tails (0) for 10000 coin flips.
Again, there is much fluctuation in the value of \(m\) for small sample sizes, but much stability in the estimate as the sample size continues to grow. After 10,000 coin flips:
\[{m} = \dfrac{\sum Y_i}{n} = \dfrac{\textrm{Number of Heads}}{\textrm{Number of Flips}}= \dfrac{5112}{10,000}=0.511\]
Even after 10,000 coin flips the true value of \(\mu\) is not attained, but a closer estimate resulted.
How close is the estimate to the actual population value? Not a big surprise here: We conclude that the more information, the better the estimation.
Generalize information obtained from a sample, such as the sample mean, to the population as a whole.
Sample the data values from the same population so that a common \(\mu\) and \(\sigma\) underlie each data value \(Y_i\).
Law of Large Numbers: The larger the random sample of data values all from the same population, the closer the estimate tends to be to the underlying population value.
This result of large samples is only a tendency. There is no guarantee that a statistic from any one larger sample yields a closer estimate to the population value, but generally, more data provides better estimation.
The primary competing model is the Bayesian model, in which other types of information are also considered in the estimation of a population value.↩︎