3 The Sampling Distribution

As explained in lecture 1, a sample is an observed subset of a larger population. We typically calculate statistics based on sample data, and use these as best guesses of the values of population parameters. This process is called statistical inference. A crucial insight is that sample statistics are not perfect estimates of population parameters. The discrepancy between the sample statistic and population parameter is known as sampling error.

We have some theoretical insight into theoretical behavior of sample statistics. For example, we can imagine constructing a probability distribution of the values we might see for a sample statistic, such as the mean, if we were to draw very many random samples from an identical population. This theoretical distribution of means is called the sampling distribution. The central limit theorem tells us that, regardless of the shape of the distribution of the data in the population, as the sample size increases, the sampling distribution of the mean approaches a normal distribution. This is an important realization, because it means that we can use probability calculus using the normal distribution to draw inferences about population parameters based on sample statistics.

The standard deviation of the sampling distribution plays a central role in inferential statistics. It is so important that we give it a unique name: we call this particular standard deviation the standard error (SE). The standard error quantifies the average, or expected, amount of sampling error when we use a sample statistic to estimate the population parameter. If the standard error is small, our estimates based on the sample are likely to be accurate, whereas a large standard error indicates greater uncertainty.

With the help of the normal distribution, and given a particular (hypothesized or known) population mean and standard error, we can calculate how likely it is to observe specific sample means. For example, if we want to determine the probability that the mean of a random sample exceeds a certain value, we can standardize the sample mean using the formula \(Z = \frac{M - \mu}{SE_M}\), where M is the sample mean, \(\mu\) is the known or hypothesized population mean, and SE is the standard error. By looking up the corresponding probability on the standard normal distribution table or using statistical software, we can assess the likelihood of observing a specific sample mean (or greater, or smaller).

Confidence intervals are a way to express our uncertainty about the sample statistic as estimator of the population parameter. A confidence interval is a range of values - a window - within which we expect the true population parameter to fall with a certain level of confidence. Typically, we select a 95% confidence interval, which means that if we could repeat the sampling process many times and calculated confidence intervals each time, 95% of those intervals would contain the true population parameter. The width of the confidence interval is determined by the standard error and is proportional to the level of confidence desired. The formula for a confidence interval is often written as: \(M \pm Z_{95\%} * SE_M\). In practice, this comes down to approximately: \(M \pm 2 * SE_M\).

4 Lecture

VIDEO ERRATA: from 19:40 - 19:50 I incorrectly report the probability of P(Z > 1) as .025, but it is .16.

6 Tutorial

6.1 Sampling Distribution

Complete the following sentences:

IQ scores in the population of potential students are normally distributed with mean 120 and an SD of 10. If a cohort contains 75 students, 95% of cohorts will have an average IQ in between and .

After graduating, a cohort of 75 LAS students can expect to earn a starter salary of 2650 Euros, with an SD of 300 euros. What percentage of cohorts will have a mean starter salary greater than 2750 euros? .

In a sample of 5000 babies, the average birthweight is 3.213 kg, with an SD of 254 grams. What is the mean birthweight of the sampling distribution?

Consider a continuous variable X, which is normally distributed with \(X \sim(\mu = 30, \sigma = 4)\). We draw a sample of 15 participants. What is the probability that the sample mean will be smaller than 32?

The proportion of male babies is .51. Assume babies born in each hospital in a given month constitute a random sample of size 100. The standard error of a proportion is given by \(\sqrt(p*(1-p) / n) = \sqrt(.51*.49 / 100) = 0.05\). What proportion of hospitals will have more than 60% male babies?

6.2 In SPSS

6.2.1 SE for Means

Open the file called student_questionnaire.sav.

These are data from a previous cohort of students. Note that we have data about biometric differences (e.g., age, height, shoesize), as well as school-related questions (which program they are enrolled in), variables about their love for statistics, and about moral preferences (based on the “Morality As Cooperation” questionnaire that I helped develop).

Go to Analyze -> Descriptives and ask for descriptive statistics on height and shoesize. Click Options, and notice that there’s an option to request the standard error of the mean. Select this option, then paste and run your syntax. Check if it corresponds to the syntax below.

DESCRIPTIVES VARIABLES=height
  /STATISTICS=MEAN STDDEV MIN MAX SEMEAN.

Note the SEMEAN option was added by clicking that option!

The mean length in the population of Dutch people is 177.434. With this in mind, calculate the probability that a random sample of the same size as this sample would have the mean length you calculated for this sample or smaller.

The question asks for the lower-tail probability below a value of 174.62 in a distribution with mean 177.434 and SD .772 (the SE you obtained from SPSS).

\(\frac{174.62-177.34}{.772} = -3.65\)

A Z-score of nearly -4, so this probability is going to be extremely small, < .001.

6.2.2 SE for Proportions

Go to Analyze -> Compare Means -> One Sample Proportions.

This procedure allows you to estimate proportions and their standard errors. It’s not very common, in fact I learned about it by Googling “standard error for proportion spss”! Any time you need to know how to do something in SPSS, you can find advice on the internet.

Calculate the proportion for the variable sex, and paste and run your syntax.


PROPORTIONS
  /ONESAMPLE sex TESTVAL=0.5 TESTTYPES=MIDP SCORE  CITYPES=AGRESTI_COULL JEFFREYS WILSON_SCORE 
  /SUCCESS VALUE=LAST
  /CRITERIA CILEVEL=95
  /MISSING SCOPE=ANALYSIS USERMISSING=EXCLUDE.

Note the table labelled “One-Sample Proportions Confidence Intervals”. This table contains confidence intervals for the proportion, calculated according to three different procedures. In the lecture, you also learned a procedure to calculate confidence intervals.

Using the procedure from the lecture, calculate a 95% confidence interval for the proportion. You can round the Z-score for this confidence interval to 2.

The 95% CI for the proportion of male students is [, ].

Note that the differences between this procedure and the three procedures in the table only differ in the third decimal.

How do you interpret a confidence interval?

A range of values that contains the population parameter with 95% certainty.A range of values that, 95% of the time, contains the population parameter.The population value, with a 95% margin of error.The range of likely population values.

6.2.3 SE for Correlation

Recall from the first lecture that the correlation coefficient is a measure of linear association between two variables, or: a descriptive statistic that describes how strongly two continuous variables are associated.

Go to Analyze -> Correlate -> Bivariate. Add the variables work_hours and study_hours, paste and run the syntax.

The value of the correlation coefficient is labelled “Pearson Correlation”. What value do you observe?

The correlation coefficient ranges from 0-1 (or minus 1). With this in mind, answer the following question:

True or false: This correlation coefficient is near zero.

The calculation of a standard error is a bit more complicated, but there’s an “approximation” (a quick approach that gives reasonable results in some cases, but could be wrong in other cases). It is calculated as:

\[ SE_r = \sqrt{\frac{1-r^2}{n-2}} \]

Calculate the SE this way. What is its value?

Assume for a moment that the true population correlation is zero (r = 0). Using the SE you calculated, what would then be the probability of observing a correlation between 0 and the correlation you actually observed?

The question asks for the probability between the mean (0) and a value of .057 in a distribution with mean 0 and SD .075 (the SE you calculated).

So we first calculate the right-tailed probability for the value of .057.

\(Z = \frac{.057-0}{.075} = 0.76\)

A Z-score of 0.76, so the right-tailed probability is 0.22.

Then, take .5 (the probability to the right of 0), and subtraCT .22: .28