The code and a notebook for this post are available on github.
Before we conduct an experiment, we decide on a number of trials per condition, and a number of participants, and then we hope that whatever we measure comes close to the population values we’re trying to estimate. In this post I’ll use a large dataset to illustrate how close we can get to the truth – at least in a lexical decision task. The data are from the French lexicon project:
Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Meot, A., Augustinova, M. & Pallier, C. (2010) The French Lexicon Project: lexical decision data for 38,840 French words and 38,840 pseudowords. Behav Res Methods, 42, 488-496.
After discarding participants who clearly did not pay attention, we have 967 participants who performed a word/non-word discrimination task. Each participant has about 1000 trials per condition. I only consider the reaction time (RT) data, but correct/incorrect data are also available. Here are RT distributions for 3 random participants (PXXX refers to their ID number in the dataset):
The distributions are positively skewed, as expected for RT data, and participants tend to be slower in the non-word condition compared to the word condition. Usually, a single number is used to summarise each individual RT distribution. From 1000 values to 1, that’s some serious data compression. (For alternative strategies to compare RT distributions, see for instance this paper). In psychology, the mean is often used, but here the median gives a better indication of the location of the typical observation. Let’s use both. Here is the distribution across participants of median RT for the word and non-word conditions:
And the distribution of median differences:
Interestingly, the differences between the medians of the two conditions is also skewed: that’s because the two distributions tend to differ in skewness.
We can do the same for the mean:
With this large dataset, we can play a very useful game. Let’s pretend that the full dataset is our population that we’re trying to estimate. Across all trials and all participants, the population medians are:
Word = 679.5 ms
Non-word = 764 ms
Difference = 78.5
the population means are:
Word = 767.6 ms
Non-word = 853.2 ms
Difference = 85.5
Now, we can perform experiments by sampling with replacement from our large population. I think we can all agree that a typical experiment does not have 1,000 trials per condition, and certainly does not have almost 1,000 participants! So what happens if we perform experiments in which we collect smaller sample sizes? How close can we get to the truth?
10,000 experiments: random samples of participants only
In the first simulation, we use all the trials (1,000) available for each participant but vary how many participants we sample from 10 to 100, in steps of 10. For each sample size, we draw 10,000 random samples of participants from the population, and we compute the mean and the median. For consistency, we compute group means of individual means, and group medians of individual medians. We have 6 sets of results to consider: for the word condition, the non-word condition, their difference; for the mean and for the median. Here are the results for the group medians in the word condition. The vertical red line marks the population median for the word condition.
With increasing sample size, we gain in precision: on average the result of each experiment is closer to the population value. With a small sample size, the result of a single experiment can be quite far from the population. This is not surprising, and also explains why estimates from small n experiments tend to disagree, especially between an original study and subsequent replications.
To get a clearer sense of how close the estimates are to the population value, it is useful to consider highest density intervals (HDI). A HDI is the shorted interval that contains a certain proportion of observations: it shows the location of the bulk of the observations. Here we consider 50% HDI:
In keeping with the previous figure, the intervals get smaller with increasing sample size. The intervals are also asymmetric: the right side is always closer to the population value than the left side. That’s because the sampling distributions are asymmetric. This means that the typical experiment will tend to under-estimate the population value.
We observe a similar pattern for the non-word condition, with the addition of two peaks in the distribution from n = 40. I’m not sure what’s causing it, or if it is a common shape. One could check by analysing the lexicon project data from other countries. Any volunteers? One thing for sure: the two peaks are caused by the median, because they are absent from the mean distributions (see notebook on github).
Finally, we can consider the distributions of group medians of median differences:
Not surprisingly, it has a similar shape to the previous distributions. It shows a landscape of expected effects. It would be interesting to see where the results of other, smaller, experiments fall in that distribution. The distribution could also be used to plan experiments to achieve a certain level of precision. This is rather unusual given the common obsession for statistical power. But experiments should be predominantly about quantifying effects, so a legitimate concern is to determine how far we’re likely to be from the truth in a given experiment. Using our simulated distribution of experiments, we can determine the probability of the absolute difference between an experiment and the population value to be larger than some cut-offs. In the figure below, we look at cut-offs from 5 ms to 50 ms, in steps of 5.
The probability of being wrong by at least 5 ms is more than 50% for all sample sizes. But 5 ms is probably not a difference folks studying lexical decision would worry about. It might be an important difference in other fields. 10 ms might be more relevant: I certainly remember a conference in which group differences of 10 ms between attention conditions were seriously discussed, and their implications for theories of attention considered. If we care about a resolution of at least 10 ms, n=30 still gives us 44% chance of being wrong…
(If someone is looking for a cool exercise: you could determine the probability of two random experiments to differ by at least certain amounts.)
We can also look at the 50% HDI of the median differences:
As previously noted, the intervals shrink with increasing sample size, but are asymmetric, suggesting that the typical experiment will under-estimate the population difference. This asymmetry disappears for larger samples of group means:
Anyway, these last figures really make me think that we shouldn’t make such a big deal out of any single experiment, given the uncertainty in our measurements.
10,000 experiments: random samples of trials and participants
In the previous simulation, we sampled participants with replacement, using all their 1,000 trials. Typical experiments use fewer trials. Let say I plan to collect data from 20 participants, with 100 trials per condition. What does the sampling distribution look like? Using our large dataset, we can sample trials and participants to find out. Again, for consistency, group means are computed from individual means, and group medians are computed from individual medians. Here are the sampling distributions of the mean and the median in the word condition:
The vertical lines mark the population values. The means are larger than the medians because of the positive skewness of RT distributions.
In the non-word condition:
And the difference distributions:
Both distributions are slightly positively skewed, and their medians fall to the left of the population values. The spread of expected values is quite large.
Similarly to simulation 1, we can determine the probability of being wrong by at least 5 ms, which is 39%. For 10 ms, the probability is 28%, and for 20 ms 14%. But that’s assuming 20 participants and 100 trials per condition. A larger simulation could investigate many combinations of sample sizes…
Finally, we consider the 50% HDI, which are both asymmetric with respect to the population values and with similar lengths:
To answer the question from the title, I think the main things we can learn from the simulated sampling distributions above are:
– to embrace uncertainty
– modesty in our interpretations
– healthy skepticism about published research
One way to highlight uncertainty in published research would be to run simulations using descriptive statistics from the articles, to illustrate the range of potential outcomes that are compatible with the results. That would make for interesting stories I’m sure.