Posted 7/4/2016 11:29:58 PM

Tags:
R Statistics

In teaching sampling distributions, I find it helpful to have as many tangible examples as possible, so I like to use demonstrations. I wrote a Shiny app that I use for this purpose, which you can download from here. To run the app, you will need R and the Shiny package. Shiny is integrated into Rstudio, so I would recommend that as well.

The main display of the app can be seen below. It shows what is displayed once several samples of data have been taken.

Three things are plotted:

- All of the individual values that have been sampled so far,
- The sampling distribution of the statistic that is calculated, and
- The 4 most recent samples of data.

The histogram of all sampled values asymptotically approaches the population distribution in shape, so that students can see what the population looks like. You can also check the "Show population" box to add lines showing the exact population distribution. Showing students the population distribution allows them to contrast it with the sampling distribution of the statistic. This allows them to clearly see things like the narrowness of the sampling distribution of the mean relative to the population distribution.

I show the 4 most recent samples as I find that students struggle with the idea that there are individual samples of data, each of which has its own value for the statistic. Adding in the plots of previous samples with the value of the statistic for that sample helps to make this idea concrete. To help make this point, I make use of the color of the most recent sample. Values from the most recent sample are colored in in the population distribution. Also, the value of the statistic from the most recent sample is colored in in the sampling distribution.

There are a number of things that can be controlled about the main sampling procedure, including the population distribution that is sampled from, the statistic to calculate for each sample, the sample size, and the number of samples to take each time the "Sample" button is pressed. The "Clear" button clears all samples.

You can choose whether the x-axis scales of the two main plots are matched or not. When the scales are matched, it is easy to see how the sampling distribution of the mean is less variable than the population distribution. When the scales are not matched, this essentially zooms in on the sampling distribution, which makes it easy to see the shape of the sampling distribution.

You can also choose whether the statistics printed below each of the two main plots are shown. I find that they create visual clutter, so I like to hide them unless I am talking about them.

There are three built-in distributions:

- Normal with mean 30 and standard deviation 5.
- Exponential with rate 1/10.
- Uniform from 10 to 20.

I use the normal distribution for a lot of things. One example is to show that the standard deviation of the mean is equal to the population standard deviation divided by the square root of the sample size. The normal distribution has an SD of 5, so if you pick a sample size of 25, the standard deviation of the mean will be 1.

I like to use the exponential distribution to demonstrate how the sampling distribution of the mean is more normal than the population distribution (i.e. the central limit theorem). I vary the sample size to show how the relative amount of normality of the sampling distribution depends on the sample size. For example, sample sizes 5 and 200 have very differently shaped sampling distributions of the mean.

I use the uniform distribution to show how a very non-normal but still symmetrical distribution leads to a symmetrical sampling distribution of the mean.

There are five built-in statistics. Mean and median for central tendency. Variance, standard deviation, and range for variability.

I like to use the sample standard deviation as an example of how statistics can be biased (standard deviation is biased low). I contrast variance, which is unbiased, with standard deviation.

I sometimes show how the mean and median are both unbiased, but have different amounts of variability, but this gets complicated because there is no simple rule as to which is less variable, so I often leave it out.

I include the range statistic as a example of a statistic that is a poor estimator of the population parameter. The normal and exponential distributions both have unbounded ranges, which is kind of unfair, but even for the bounded uniform distribution, the sample range is biased fairly low. Also, the bias depends a lot on sample size, which is also a bad characteristic.

I have used this app, with progressive improvements, to teach sampling distributions three times now. I've ironed out a lot of the warts and have added basically all of the features that I want. The code is designed so as to be reasonably extensible. In addition, it's a fairly short program, so it should be easy to modify if you have different ideas about how to do things.

Because many computers in classrooms will not have R on them, you should know that you can get R Portable, which runs off of a thumb drive. I have used this to run the app on a variety of computers with great success.