MKTG 460

Week 1: Basic Stats, Data Analysis with R

Overview

Our primary goal this first week is to become comfortable using the computer for data analysis. We will apply these computer skills to simple data analyses, just reading data and doing counts of the data values with bar charts and histograms. When you collect data, such as from a survey, the most basic question to ask regards how people respond to each question, that is, for example, how many people agree? disagree?, etc. The answer to that question is to count the number of responses to each potential answer. The statistical analysis of counting the data values is a bar chart for a non-numeric variable defined by categories, such as Gender, and a histogram for a numeric variable such as Income.

Because counting the data values, such as to responses to questions on a survey, is the most basic and first form of data analysis done in every data analysis project, both Track ABC and Track C students do all of the homework for this week.

R for Data Analysis

The R instructions for the first homework follow.

Download and Run

R is available for free and runs on any Windows, Macintosh or Linux/Unix computer.

Week 2a: Prerequisite Material for Confidence Intervals

Overview

Before the concept of a confidence interval can be understood, some basic concepts must first be understood. This is not a course in statistics, and you will not be tested or need to know all of the posted material as you would in a statistics course. What you need to focus, as always in this class, are the skills you need to be able to do your project. For this material, that means the general concepts. And, all of the material for this week reviews material you have already covered in your statistics prerequisite.

For example, statistics is the analysis of variability, and the standard deviation is the key statistic we use to assess variability. Sec 2.1c of the posted slides and videos presents a concise definition and explanation of the standard deviation. The concept cannot be explained with fewer words. However, success in this class does not even require complete knowledge of this material. For example, you do not need to memorize formulas (Sec 2.1, #14) for this class. At the same time, you should understand that the standard deviation is based on squared deviation scores (Sec 2.1, #16). The most concept in this course is the meaning of a statistical outcome, such as illustrated with Slide #26. If I showed you the two distributions of test scores on that slide, you should be able to tell which distribution had the largest deviation scores and the largest standard deviation. That is, you understand what the concept of "standard deviation" means, which is different from memorizing a formula.

Standard Deviation

A large part of data analysis is the analysis of variability about the mean. These concepts are presented in Section 2.1.

2.1 Mean, Standard Deviation pdf
[27]
2.1a
[5:26]
2.1b
[10:02]
2.1c
[16:37]
 

Uncover Pattern Blurred by Sampling Instability

Section 3.1 here provides the basic concepts regarding sampling. The topic is so important because every marketing research study that involves data is based on a sample of data from a larger population. Section 3.2 into more detail than needed, and the pdf contains Subsection 3.2d on probability distributions, which we do not have time to cover in this course in even a cursory manner. It is included in the pdf file for those interested.

3.1 Populations, Sampling Fluctuations, Inference [a-c only, not d] pdf
[36]
3.1a
[13:05]
3.1b
[6:13]
3.1c
[16:52]
3.2 Normal Curve, Standard Scores, Probabilities pdf
[41]
3.2a
[11:45]
3.2b
[12:11]
3.2c
[13:31]

Week 2b: Confidence Interval of the Mean

Overview

The application of the confidence interval of the mean to provide guidance for a management decision is one of the most important concepts in data analysis. The key skill emphasized here is to apply statistical inference as an aid to management decision making. As such, this section is only for Track~ABC students.

R Functions

To obtain the confidence interval use the lessR ttest or tt lessR function. Here we only need the brief form, tt.brief, which provides much less output than the full function. Just specify the variable name if the analysis directly analyzes the data, such as tt.brief(Y) for a variable named Y, or specify n, m, s, if the analysis proceeds from the three summary statistics: the sample size, sample mean and sample standard deviation.

Content

Textbook Chapter: This chapter represents the next developmental step of this material from the slides/videos and so can either completely replace, or complement, all of the slides for this week in either their pdf or video format.

Gerbing Textbook Chapter 4 for Week 2  [Feedback welcome and encouraged]

The primary material here for this course is Sec 4.1 of the book chapter, the basic introduction with a brief example, and Sec 4.4, a more in-depth example. The intermediary material, Secs 4.2 and 4.3, is mostly there for those interested.

These slides and videos are retained here for an alternative presentation of the material, but all that is needed are the readings from the provided book chapter.

4.1 Assess Sampling Variability pdf
[18]
4.1a
[2:11]
4.1b
[9:44]
   
4.2 Range of Estimation Error pdf
[29]
4.2a
[12:07]
4.2b
[12:09]
4.2appendix
[10:42]
CLT
[simulation]
4.3 Confidence Interval of the Mean pdf
[24]
4.3a
[18:25]
4.3b
[11:43]
4.3c
[20:06]
 

Readings from Textbook

Optional Reading

Gerbing, R Data Analysis without Programming for more examples and more in-depth understanding.

Homework

Homework #2

Solutions #2

Week 3: Hypothesis Testing of the Mean

Overview

In this section hypothesis testing is introduced as the second form of statistical inference, and then related to the confidence interval, the first form. The worked homework problems provide a general template for statistical inference of the mean that integrates both hypothesis testing and the confidence interval.

R Functions

Again use the lessR function ttest, and can use the simpler form, tt.brief. Now add the option mu0, which specifies a hypothesized value. For example, for variable Y with a hypothesized mean of 10, tt.brief(Y, mu0=10).

To plot the data values in the order they appear in the data file, use the lessR function LineChart, or just add line.chart=TRUE to the tt.brief function call, such as
   tt.brief(Y, mu0=10, line.chart=TRUE).

Content

5.1 Hypothesis Test of the Mean pdf
[21]
5.1a
[12:27]
5.1b
[5:57]
5.1c
[7:45]
5.2 Conduct the Hypothesis Test pdf
[31]
5.2a
[8:04]
5.2b
[16:26]
5.2c
[18:24]
5.3 Hypothesis Test, Confidence Interval pdf
[6]
5.3
[7:27]
   

Optional Reading

Gerbing, R Data Analysis without Programming for more examples and more in-depth understanding.

Homework

This homework introduces a real survey data set, similar to the kind of data you will be analyzing for your project. Again, all the course content and homework questions are designed to prepare you for the tests and, ultimately, your project (at either the Track ABC or Track C level). All of the homework assignments from here through the end of the course involve the analysis of survey data. When you do your project, you just need to apply the skills needed to do your homework assignments. That is one reason why most people get over 90% on the project a grade of A or A-.

Homework #3

Solutions #3

Week 4: Compare Two Groups on a Variable of Interest

Overview

Compare Groups by their Means for a Variable

One way to compare groups is to compare their respective means on a variable. At a given firm, how does the average Salary for Men compare to the average Salary for Women? The analysis of the mean difference is the comparison of the means of a numerical variable of interest, called the response variable, across two different groups. The two groups define the data values for a variable, called the grouping variable. For example, the data values Male and Female define a grouping variable called Gender. The purpose of the analysis is to investigate how the value of the response variable Y relates to the level of the grouping variable X. On average, do the women at the firm make less, the same, or more than the men?

Compare Differences between Two Sets of Matched Data Values Directly

A second method to compare groups is possible if each data value in one group matches a data value in a second group, such as a husband's score and the wife's score on a survey of marital satisfaction. Instead of analyzing the mean difference, directly analyze the differences between the matched data values. For the population analysed, on average, are the husbands more, the same or less satisfied than their wives?

The Experiment

One issue is to detect a difference in the means of the response variable between groups. An additional issue is to attribute that the differences in the level of the grouping variable at least partially caused the resulting differences [e.g., Sec 6.6a, #3, Sec 6.6b, #9]. If a difference between the means of Salary across Males and Females is detected, what is the reason for this difference? In other words, correlation is not causation. Is the detected difference due to Gender, that is, discrimination, or are there other explanations for the observed difference? Is the difference due to direct causality, or is it a spurious relationship due to a confounding variables [e.g., Sec 6.6a, #4]? To observe a difference does not imply that the level of the grouping variable caused the difference.

The experiment [e.g., Sec 6.6c, #12] is the premier method for obtaining sufficient control that eliminates the potential impact of confounding variables by manipulation, randomizing respondents (people) to obtain equivalent groups [e.g., Sec 6.6b, #10]. Only when alternative explanations due to confounding variables are eliminated, or at least minimized, can the researcher conclude causality, that the value of the grouping variable directly influences the response variable. Unfortunately an experiment can not always be implemented, such as studies regarding gender differences. In the absence of randomization to obtain equivalent groups, attempt the next-best alternative, the quasi-experiment [e.g., Sec 6.6d, #29]. The issue here is methodological, it depends on the methods used to collect the data. Without some kind of experimental (or, as we see later, statistical) control), differences can be detected, but not explained.

R Functions

Compare Groups by their Means for a Variable

We will use the brief output of the analysis of group means provided by the lessR function tt.brief. The format, for a model specified by continuous (numerical) response variable Y, and grouping variable X with two categories (groups), is tt.brief(Y ~ X) [e.g., Sec 6.1, #31]. The tilde, "~", means "explained by", that is, it specifies a model that explains variation in Y in terms of one or more other variables called predictor variables. In this situation, there is just one predictor variable, a grouping variable X with exactly two unique values. For example, explain variation in Salary in terms of two Genders, Male or Female.

Compare Differences between Two Sets of Matched Data Values Directly

To directly compare the differences between matched data values for two groups on a variable of interest, separate the variables by a comma and add the paired=TRUE option, such as tt.brief(X1, X2, paired=TRUE) for matched variables X1 and X2 [e.g., Sec 6.4, #5]. Use the comma here instead of a tilde because neither variable is used to explain the values of the other variable. Operationally, this means that the variables X1 and X2 can be listed in any order.

Excel (optional)

Excel Template for Inference of the Mean Difference

Content

6.1 The Mean Difference pdf
[45]
6.1a
[15:35]
6.1b
[9:21]
6.1c
[6:40]
6.1d
[15:53]
 
6.4 Paired Analysis pdf
[15]
6.4a
[6:29]
6.4b
[10:00]
     
6.6 Causality and Experiments pdf
[37]
6.6a
[11:50]
6.6b
[7:26]
6.6c
[15:59]
6.6d
[14:15]
6.6e
[10:43]

Readings from Textbook

Optional Reading

Gerbing, R Data Analysis without Programming, Routledge Publishing, 2013: Sec 6.3, 6.4

Homework

Homework #4

Solutions #4

Week 5

Midterm

Take-Home Section

A downloadable Word document in which you provide your answers directly on the test. Due at the end of Week 5, Tuesday, Feb 9 at 11:59pm. Track C students only do the basic descriptive statistics, including bar charts and histograms.

Take-Home Midterm

Multiple-Choice Section

Timed Multiple-Choice and/or Short-Answer questions from D2L are administered anytime on Fri, Feb 5 and Sat, Feb 6. The questions are very closely based on the concepts presented at the beginning of each homework.

Each person takes a different test as the items for a each test are randomly selected from a larger pool. All questions are based on the concepts provided at the beginning of each homework, which serve as the study guide. The questions appear in two sections. Track C students only do the questions on descriptive statistics and general marketing research. That section is labeled All Student and general marketing research. That section is labeled All Students. A second section of the test, labeled Track ABC students is well, just for Track ABC students.