3  The Sampling Distribution

As explained in lecture 1, a sample is an observed subset of a larger population. We typically calculate statistics based on sample data, and use these as best guesses of the values of population parameters. This process is called statistical inference. A crucial insight is that sample statistics are not perfect estimates of population parameters. The discrepancy between the sample statistic and population parameter is known as sampling error.

We have some theoretical insight into theoretical behavior of sample statistics. For example, we can imagine constructing a probability distribution of the values we might see for a sample statistic, such as the mean, if we were to draw very many random samples from an identical population. This theoretical distribution of means is called the sampling distribution. The central limit theorem tells us that, regardless of the shape of the distribution of the data in the population, as the sample size increases, the sampling distribution of the mean approaches a normal distribution. This is an important realization, because it means that we can use probability calculus using the normal distribution to draw inferences about population parameters based on sample statistics.

The standard deviation of the sampling distribution plays a central role in inferential statistics. It is so important that we give it a unique name: we call this particular standard deviation the standard error (SE). The standard error quantifies the average, or expected, amount of sampling error when we use a sample statistic to estimate the population parameter. If the standard error is small, our estimates based on the sample are likely to be accurate, whereas a large standard error indicates greater uncertainty.

With the help of the normal distribution, and given a particular (hypothesized or known) population mean and standard error, we can calculate how likely it is to observe specific sample means. For example, if we want to determine the probability that the mean of a random sample exceeds a certain value, we can standardize the sample mean using the formula \(Z = \frac{M - \mu}{SE_M}\), where M is the sample mean, \(\mu\) is the known or hypothesized population mean, and SE is the standard error. By looking up the corresponding probability on the standard normal distribution table or using statistical software, we can assess the likelihood of observing a specific sample mean (or greater, or smaller).

Confidence intervals are a way to express our uncertainty about the sample statistic as estimator of the population parameter. A confidence interval is a range of values - a window - within which we expect the true population parameter to fall with a certain level of confidence. Typically, we select a 95% confidence interval, which means that if we could repeat the sampling process many times and calculated confidence intervals each time, 95% of those intervals would contain the true population parameter. The width of the confidence interval is determined by the standard error and is proportional to the level of confidence desired. The formula for a confidence interval is often written as: \(M \pm Z_{95\%} * SE_M\). In practice, this comes down to approximately: \(M \pm 2 * SE_M\).

4 Lecture

VIDEO ERRATA: from 19:40 - 19:50 I incorrectly report the probability of P(Z > 1) as .025, but it is .16.

5 Formative Test

A formative test helps you assess your progress in the course, and helps you address any blind spots in your understanding of the material. If you get a question wrong, you will receive a hint on how to improve your understanding of the material.

Complete the formative test ideally after you’ve seen the lecture, but before the lecture meeting in which we can discuss any topics that need more attention

Question 1

Introversion is normally distributed with a mean of 50 and a standard deviation of 10. What is the probability that the mean introversion level of a randomly selected group of 16 people is smaller than 52? Round the answer to 3 decimal places.

Question 2

Variable X is not normally distributed in the population. Variable X has a population mean of 30 and a population standard deviation of 6. A random sample of N = 36 scores is drawn from the population for variable X. The sample mean is equal to 32. Which of the following statements about the sampling distribution of the sample means for this sample (n = 36) is incorrect?

Question 3

What does the sampling distribution represent?

Question 4

Which of the following statements about the sampling distribution is true?

Question 5

What is the standard deviation of the sampling distribution called?

Question 6

How does sample size affect the shape of the sampling distribution?

Question 7

What is the probability that a sample mean falls within +/- 1 standard deviation of the population mean, assuming a normal distribution of sample means?

Question 8

If the standard deviation of the population is 10 and the sample size is 25, what is the standard error of the sample mean?

Question 9

What is the probability that the sample proportion falls within +/- 2 standard deviations of the population proportion, assuming a large sample size?

Question 10

If the standard deviation of the population is 5 and the sample size is 50, what is the standard error of the sample mean?

Question 1

Calculate the standard error as 10/sqrt(16). Then, calculate the Z-score as (52-50)/SE. Find the right tailed probability of that Z-score, then calculate 1 minus that probability.

Question 2

The sampling distribution will be approximately normal because n >= 30. The SE is indeed 1, because sigma/sqrt(n) = 6 /sqrt(36) = 1. The SE is always smaller than sigma, because it is calculated as sigma divided by square root of n.

Question 3

The sampling distribution represents the distribution of sample statistics, such as sample means or proportions, derived from multiple samples drawn from the same population. It provides insights into the variability and characteristics of these sample statistics.

Question 4

The sampling distribution is centered around the population parameter.

Question 5

The standard deviation of the sampling distribution is known as the standard error. It measures the average variability or spread of sample statistics around the population parameter, reflecting the precision of the estimation.

Question 6

Larger sample sizes result in smaller standard errors. As the sample size increases, the sampling distribution becomes more concentrated around the population parameter, leading to a decrease in the standard error. This implies that larger samples provide more precise estimates of the population parameter.

Question 7

The probability that a sample mean falls within +/- 1 standard deviations of the population mean, assuming a normal distribution of sample means, is 68%. This is based on ‘the empirical rule’.

Question 8

The standard error of the sample mean is 2. The standard error can be calculated by dividing the standard deviation of the population by the square root of the sample size. In this case, it would be 10 / √25 = 2.

Question 9

The key lesson here is that everything you learned about the sampling distribution also applies to other statistics than the mean, so according to the empirical rule, 95% of sample proportions will fall within +/- 2 standard deviation of the population proportion.

Question 10

The standard error of the sample mean is SD/sqrt(n), so 5/sqrt(50) = .707

6 Tutorial

6.1 Sampling Distribution

Complete the following sentences:

IQ scores in the population of potential students are normally distributed with mean 120 and an SD of 10. If a cohort contains 75 students, 95% of cohorts will have an average IQ in between and .

After graduating, a cohort of 75 LAS students can expect to earn a starter salary of 2650 Euros, with an SD of 300 euros. What percentage of cohorts will have a mean starter salary greater than 2750 euros? .

In a sample of 5000 babies, the average birthweight is 3.213 kg, with an SD of 254 grams. What is the mean birthweight of the sampling distribution?

Consider a continuous variable X, which is normally distributed with \(X \sim(\mu = 30, \sigma = 4)\). We draw a sample of 15 participants. What is the probability that the sample mean will be smaller than 32?

The proportion of male babies is .51. Assume babies born in each hospital in a given month constitute a random sample of size 100. The standard error of a proportion is given by \(\sqrt(p*(1-p) / n) = \sqrt(.51*.49 / 100) = 0.05\). What proportion of hospitals will have more than 60% male babies?

6.2 In SPSS

6.2.1 SE for Means

Open the file called student_questionnaire.sav.

These are data from a previous cohort of students. Note that we have data about biometric differences (e.g., age, height, shoesize), as well as school-related questions (which program they are enrolled in), variables about their love for statistics, and about moral preferences (based on the “Morality As Cooperation” questionnaire that I helped develop).

Go to Analyze -> Descriptives and ask for descriptive statistics on height and shoesize. Click Options, and notice that there’s an option to request the standard error of the mean. Select this option, then paste and run your syntax. Check if it corresponds to the syntax below.

DESCRIPTIVES VARIABLES=height
  /STATISTICS=MEAN STDDEV MIN MAX SEMEAN.

Note the SEMEAN option was added by clicking that option!

The mean length in the population of Dutch people is 177.434. With this in mind, calculate the probability that a random sample of the same size as this sample would have the mean length you calculated for this sample or smaller.

The question asks for the lower-tail probability below a value of 174.62 in a distribution with mean 177.434 and SD .772 (the SE you obtained from SPSS).

\(\frac{174.62-177.34}{.772} = -3.65\)

A Z-score of nearly -4, so this probability is going to be extremely small, < .001.

6.2.2 SE for Proportions

Go to Analyze -> Compare Means -> One Sample Proportions.

This procedure allows you to estimate proportions and their standard errors. It’s not very common, in fact I learned about it by Googling “standard error for proportion spss”! Any time you need to know how to do something in SPSS, you can find advice on the internet.

Calculate the proportion for the variable sex, and paste and run your syntax.


PROPORTIONS
  /ONESAMPLE sex TESTVAL=0.5 TESTTYPES=MIDP SCORE  CITYPES=AGRESTI_COULL JEFFREYS WILSON_SCORE 
  /SUCCESS VALUE=LAST
  /CRITERIA CILEVEL=95
  /MISSING SCOPE=ANALYSIS USERMISSING=EXCLUDE.

Note the table labelled “One-Sample Proportions Confidence Intervals”. This table contains confidence intervals for the proportion, calculated according to three different procedures. In the lecture, you also learned a procedure to calculate confidence intervals.

Using the procedure from the lecture, calculate a 95% confidence interval for the proportion. You can round the Z-score for this confidence interval to 2.

The 95% CI for the proportion of male students is [, ].

Note that the differences between this procedure and the three procedures in the table only differ in the third decimal.

How do you interpret a confidence interval?

6.2.3 SE for Correlation

Recall from the first lecture that the correlation coefficient is a measure of linear association between two variables, or: a descriptive statistic that describes how strongly two continuous variables are associated.

Go to Analyze -> Correlate -> Bivariate. Add the variables work_hours and study_hours, paste and run the syntax.

The value of the correlation coefficient is labelled “Pearson Correlation”. What value do you observe?

The correlation coefficient ranges from 0-1 (or minus 1). With this in mind, answer the following question:

True or false: This correlation coefficient is near zero.

The calculation of a standard error is a bit more complicated, but there’s an “approximation” (a quick approach that gives reasonable results in some cases, but could be wrong in other cases). It is calculated as:

\[ SE_r = \sqrt{\frac{1-r^2}{n-2}} \]

Calculate the SE this way. What is its value?

Assume for a moment that the true population correlation is zero (r = 0). Using the SE you calculated, what would then be the probability of observing a correlation between 0 and the correlation you actually observed?

The question asks for the probability between the mean (0) and a value of .057 in a distribution with mean 0 and SD .075 (the SE you calculated).

So we first calculate the right-tailed probability for the value of .057.

\(Z = \frac{.057-0}{.075} = 0.76\)

A Z-score of 0.76, so the right-tailed probability is 0.22.

Then, take .5 (the probability to the right of 0), and subtraCT .22: .28