Our primary goal this first week is to become comfortable using the computer for data analysis. We will apply these computer skills to simple data analyses, just reading data and doing counts of the data values with bar charts and histograms. When you collect data, such as from a survey, the most basic question to ask regards how people respond to each question, that is, for example, how many people agree? disagree?, etc. The answer to that question is to count the number of responses to each potential answer. The statistical analysis of counting the data values is a bar chart for a non-numeric variable defined by categories, such as Gender, and a histogram for a numeric variable such as Income.
Because counting the data values, such as to responses to questions on a survey, is the most basic and first form of data analysis done in every data analysis project, both Track ABC and Track C students do all of the homework for this week.
The R instructions for the first homework follow.
R is available for free and runs on any Windows, Macintosh or Linux/Unix computer.
You can also find R on any of the computers in the SBA computer labs.
Run R as you would any app on your computer [Text, Sec 1.2.2, p4; video (3:26 to 4:28)]. You are offered both 32-bit and 64-bit versions. Unless your computer is many years old, most people run 64-bit software, but the choice of versions is irrelevant for this class.
All analyses in R are accomplished with functions. To do an analysis, invoke (that is, call) a function.
Function: A procedure for transforming input data values into output, an analysis of the data.Each function has a name, such as Histogram for the computation of a histogram. To call a function, that is, to do the analysis, in response to the R command prompt, >, enter the function name, a (, an argument such as the variable name if analyzing data values directly, and then a ).
The first task when you run R is to obtain and then load my lessR functions into memory. The current version as of Oct 29, 2016 is lessR 3.5.1. To access lessR, when running R enter the following two function calls: [Text, Sec 1.2.3, 1.3.1, p5-6; video (4:38 to 7:22)].
To add the lessR functions to the R environment simply adds about 43 functions to the many hundreds of functions that already exist in R. You work in the same R environment and have access to the same R functions when using lessR. The only distinction is that with the lessR functions available, R becomes much simpler to use because each lessR function replaces several to many R functions.
Data analysis begins with, well, data. The data for our applications are always organized into the form of a rectangular data table with the data values for each variable in a column and the name of the variable at the top of the column in the first row. To know how to use any data analysis system such as R, first you must understand how the data are organized [Text, Sec 1.6.1, p21; video (6:56)].
The data table may be stored in a variety of formats. The formats we will encounter in this course are Excel files, indicated by a file type of .xlsx, and text files in the form of csv or comma separated value files, indicated by a file type of .csv [Text, Sec 1.6.4, 1.6.5, p24,25].
When reading the data with the Read function, you can specify the location of the data file to be read within quotes and parentheses. Or, to instruct R to let you browse for the data file on your computer system, leave the location blank, as in the empty parentheses (), [Text, Sec 2.2.1, p33, now also applies to Excel files; video (12:03)].
Both of these Read statements read the data stored as a rectangular data table from an external file into an R data table called mydata.
From the output of Read, identify the list of variable names, initially specified in the first row of the data file, as well as the type of variable, continuous or categorical. All data analysis in R follows from the names of the relevant variables, illustrated in the following function calls to the generic variable Y. For a specific analysis, replace Y in the function call with the actual variable name to accomplish the analysis, such as Salary, a continuous variable, or Gender, a categorical variable. [Text, Sec 2.2.2, 2.2.3, p33-35]
Distribution of a categorical variable, that is, only a relatively few, non-numeric categories available.
Distribution of a continuous variable, that is, many different, numeric possible values possible.
Get Help.
The full set of R Instruction Videos [1:00:57] are available, but the relevant links to this week's material are referenced above. If you would like some practice, follow along with the video doing the analyses. Do so with the following data. Also available is a summary of all R function calls used in this course, including the download instructions, Summary of R instructions.
Note: In general the slides sometimes present more detail than is needed for this class, but of interest to those more interested in data analysis. Assess your own understanding against the criterion of doing the homework. If you can do the homework, including being able to define the listed concepts, then you understand what is required in this course.
| 1.1 | Statistics, Variables, Data | pdf [37] |
1.1b [14:04] |
1.1c [11:11] |
||
| 1.2 | Using the Computer (and/or view videos of actual R sessions linked below) | pdf [19] |
1.2a [13:40] |
1.2b [9:11] |
||
| 1.3 | Bar Chart, Histogram | pdf [32] |
1.3a [10:35] |
1.3b [7:41] |
1.3c [12:20] |
1.3d [1:59] |
Gerbing, R Data Analysis without Programming for more examples and more in-depth understanding.
The homework assignments are the criterion by which you are evaluated on the tests and your project. These assignments are the study guide for the tests and project. If you know how to answer these questions, then you will do well in this course.
As you read the posted pdf's and videos, you may chose to proceed through them sequentially as reading a book or watching a movie, or use them as a reference tool, looking up the information you need to answer the questions as you go. The key is to adopt a strategy that allows you to successfully answer the questions asked in the homework, which include the same questions asked in the tests and project, just a different data to which the questions are applied.
As stated in the syllabus, regardless of Track ABC or Track C, these homework assignments are required for you to proceed in the course. If you cannot make the due date for any assignment, please contact me in advance so we can discuss how to address the situation. Only for this week Track ABC and Track C students do the same homework assignment. This first week presents the content for one of the primary data analyses needed by Track C students for their much simplified analysis of their Project data. Track ABC students will also do this basic analysis of each question (variable) on their survey, and much more as in the real world of marketing research.
Homework #1
Due in the D2L Dropbox Sunday, Jan 15 at 11:59pm, at the end of Week 1. As specified in the syllabus, the homework is not graded, but is checked for reasonable completeness, with full solutions to be provided.
Make sure to read the note regarding Video Chats posted on the home page of this site and on the home page of D2L.
In the solutions I show you improved way to read data into R with RStudio.
Before the concept of a confidence interval can be understood, some basic concepts must first be understood. This is not a course in statistics, and you will not be tested or need to know all of the posted material as you would in a statistics course. What you need to focus, as always in this class, are the skills you need to be able to do your project. For this material, that means the general concepts. And, all of the material for this week reviews material you have already covered in your statistics prerequisite.
For example, statistics is the analysis of variability, and the standard deviation is the key statistic we use to assess variability. Sec 2.1c of the posted slides and videos presents a concise definition and explanation of the standard deviation. The concept cannot be explained with fewer words. However, success in this class does not even require complete knowledge of this material. For example, you do not need to memorize formulas (Sec 2.1, #14) for this class. At the same time, you should understand that the standard deviation is based on squared deviation scores (Sec 2.1, #16). The most concept in this course is the meaning of a statistical outcome, such as illustrated with Slide #26. If I showed you the two distributions of test scores on that slide, you should be able to tell which distribution had the largest deviation scores and the largest standard deviation. That is, you understand what the concept of "standard deviation" means, which is different from memorizing a formula.
A large part of data analysis is the analysis of variability about the mean. These concepts are presented in Section 2.1.
| 2.1 | Mean, Standard Deviation | pdf [27] |
2.1a [5:26] |
2.1b [10:02] |
2.1c [16:37] |
Section 3.1 here provides the basic concepts regarding sampling. The topic is so important because every marketing research study that involves data is based on a sample of data from a larger population. Section 3.2 into more detail than needed, and the pdf contains Subsection 3.2d on probability distributions, which we do not have time to cover in this course in even a cursory manner. It is included in the pdf file for those interested.
| 3.1 | Populations, Sampling Fluctuations, Inference [a-c only, not d] | pdf [36] |
3.1a [13:05] |
3.1b [6:13] |
3.1c [16:52] |
| 3.2 | Normal Curve, Standard Scores, Probabilities | pdf [41] |
3.2a [11:45] |
3.2b [12:11] |
3.2c [13:31] |
The application of the confidence interval of the mean to provide guidance for a management decision is one of the most important concepts in data analysis. The key skill emphasized here is to apply statistical inference as an aid to management decision making. As such, this section is only for Track~ABC students.
To obtain the confidence interval use the lessR ttest or tt lessR function. Here we only need the brief form, tt.brief, which provides much less output than the full function. Just specify the variable name if the analysis directly analyzes the data, such as tt.brief(Y) for a variable named Y, or specify n, m, s, if the analysis proceeds from the three summary statistics: the sample size, sample mean and sample standard deviation.
Textbook Chapter: This chapter represents the next developmental step of this material from the slides/videos and so can either completely replace, or complement, all of the slides for this week in either their pdf or video format.
The primary material here for this course is Sec 4.1 of the book chapter, the basic introduction with a brief example, and Sec 4.4, a more in-depth example. The intermediary material, Secs 4.2 and 4.3, is mostly there for those interested.
These slides and videos are retained here for an alternative presentation of the material, but all that is needed are the readings from the provided book chapter.
| 4.1 | Assess Sampling Variability | pdf [18] |
4.1a [2:11] |
4.1b [9:44] |
||
| 4.2 | Range of Estimation Error | pdf [29] |
4.2a [12:07] |
4.2b [12:09] |
4.2appendix [10:42] |
CLT [simulation] |
| 4.3 | Confidence Interval of the Mean | pdf [24] |
4.3a [18:25] |
4.3b [11:43] |
4.3c [20:06] |
Gerbing, R Data Analysis without Programming for more examples and more in-depth understanding.
Solutions #2
In this section hypothesis testing is introduced as the second form of statistical inference, and then related to the confidence interval, the first form. The worked homework problems provide a general template for statistical inference of the mean that integrates both hypothesis testing and the confidence interval.
Again use the lessR function ttest, and can use the simpler form, tt.brief. Now add the option mu0, which specifies a hypothesized value. For example, for variable Y with a hypothesized mean of 10, tt.brief(Y, mu0=10).
To plot the data values in the order they appear in the data file, use the lessR function LineChart, or just add line.chart=TRUE to the tt.brief function call, such as
tt.brief(Y, mu0=10, line.chart=TRUE).
| 5.1 | Hypothesis Test of the Mean | pdf [21] |
5.1a [12:27] |
5.1b [5:57] |
5.1c [7:45] |
| 5.2 | Conduct the Hypothesis Test | pdf [31] |
5.2a [8:04] |
5.2b [16:26] |
5.2c [18:24] |
| 5.3 | Hypothesis Test, Confidence Interval | pdf [6] |
5.3 [7:27] |
Gerbing, R Data Analysis without Programming for more examples and more in-depth understanding.
This homework introduces a real survey data set, similar to the kind of data you will be analyzing for your project. Again, all the course content and homework questions are designed to prepare you for the tests and, ultimately, your project (at either the Track ABC or Track C level). All of the homework assignments from here through the end of the course involve the analysis of survey data. When you do your project, you just need to apply the skills needed to do your homework assignments. That is one reason why most people get over 90% on the project a grade of A or A-.
Solutions #3
One way to compare groups is to compare their respective means on a variable. At a given firm, how does the average Salary for Men compare to the average Salary for Women? The analysis of the mean difference is the comparison of the means of a numerical variable of interest, called the response variable, across two different groups. The two groups define the data values for a variable, called the grouping variable. For example, the data values Male and Female define a grouping variable called Gender. The purpose of the analysis is to investigate how the value of the response variable Y relates to the level of the grouping variable X. On average, do the women at the firm make less, the same, or more than the men?
A second method to compare groups is possible if each data value in one group matches a data value in a second group, such as a husband's score and the wife's score on a survey of marital satisfaction. Instead of analyzing the mean difference, directly analyze the differences between the matched data values. For the population analysed, on average, are the husbands more, the same or less satisfied than their wives?
One issue is to detect a difference in the means of the response variable between groups. An additional issue is to attribute that the differences in the level of the grouping variable at least partially caused the resulting differences [e.g., Sec 6.6a, #3, Sec 6.6b, #9]. If a difference between the means of Salary across Males and Females is detected, what is the reason for this difference? In other words, correlation is not causation. Is the detected difference due to Gender, that is, discrimination, or are there other explanations for the observed difference? Is the difference due to direct causality, or is it a spurious relationship due to a confounding variables [e.g., Sec 6.6a, #4]? To observe a difference does not imply that the level of the grouping variable caused the difference.
The experiment [e.g., Sec 6.6c, #12] is the premier method for obtaining sufficient control that eliminates the potential impact of confounding variables by manipulation, randomizing respondents (people) to obtain equivalent groups [e.g., Sec 6.6b, #10]. Only when alternative explanations due to confounding variables are eliminated, or at least minimized, can the researcher conclude causality, that the value of the grouping variable directly influences the response variable. Unfortunately an experiment can not always be implemented, such as studies regarding gender differences. In the absence of randomization to obtain equivalent groups, attempt the next-best alternative, the quasi-experiment [e.g., Sec 6.6d, #29]. The issue here is methodological, it depends on the methods used to collect the data. Without some kind of experimental (or, as we see later, statistical) control), differences can be detected, but not explained.
We will use the brief output of the analysis of group means provided by the lessR function tt.brief. The format, for a model specified by continuous (numerical) response variable Y, and grouping variable X with two categories (groups), is tt.brief(Y ~ X) [e.g., Sec 6.1, #31]. The tilde, "~", means "explained by", that is, it specifies a model that explains variation in Y in terms of one or more other variables called predictor variables. In this situation, there is just one predictor variable, a grouping variable X with exactly two unique values. For example, explain variation in Salary in terms of two Genders, Male or Female.
To directly compare the differences between matched data values for two groups on a variable of interest, separate the variables by a comma and add the paired=TRUE option, such as tt.brief(X1, X2, paired=TRUE) for matched variables X1 and X2 [e.g., Sec 6.4, #5]. Use the comma here instead of a tilde because neither variable is used to explain the values of the other variable. Operationally, this means that the variables X1 and X2 can be listed in any order.
Excel Template for Inference of the Mean Difference
| 6.1 | The Mean Difference | pdf [45] |
6.1a [15:35] |
6.1b [9:21] |
6.1c [6:40] |
6.1d [15:53] |
|
| 6.4 | Paired Analysis | pdf [15] |
6.4a [6:29] |
6.4b [10:00] |
|||
| 6.6 | Causality and Experiments | pdf [37] |
6.6a [11:50] |
6.6b [7:26] |
6.6c [15:59] |
6.6d [14:15] |
6.6e [10:43] |
Gerbing, R Data Analysis without Programming, Routledge Publishing, 2013: Sec 6.3, 6.4
Homework #4
Solutions #4
A downloadable Word document in which you provide your answers directly on the test. Due at the end of Week 5, Tuesday, Feb 9 at 11:59pm. Track C students only do the basic descriptive statistics, including bar charts and histograms.
Timed Multiple-Choice and/or Short-Answer questions from D2L are administered anytime on Fri, Feb 5 and Sat, Feb 6. The questions are very closely based on the concepts presented at the beginning of each homework.
Each person takes a different test as the items for a each test are randomly selected from a larger pool. All questions are based on the concepts provided at the beginning of each homework, which serve as the study guide. The questions appear in two sections. Track C students only do the questions on descriptive statistics and general marketing research. That section is labeled All Student and general marketing research. That section is labeled All Students. A second section of the test, labeled Track ABC students is well, just for Track ABC students.