Tag Archives: hdi

Hierarchical shift function: a powerful alternative to the t-test

In this post I introduce a simple yet powerful method to compare two dependent groups: the hierarchical shift function. The code is on GitHub. More details are in Rousselet & Wilcox (2019), with a reproducibility package on figshare.

Let’s consider different situations in a hierarchical setting: we’ve got trials from 2 conditions in several participants. Imagine we collected data from one participant and the results look like this:

unnamed-chunk-3-1

These fake reaction time data were created by sampling from ex-Gaussian distributions. Here the two populations are shifted by a constant, so we expect a uniform shift between the two samples. Later we’ll look at examples showing  differences most strongly in early responses, late responses, and in spread.

To better understand how the distributions differ, let’s look at a shift function, in which the difference between the deciles of the two conditions are plotted as a function of the deciles in condition 1 – see details in Rousselet et al. (2017). The decile differences are all negative, showing stochastic dominance of condition 2 over condition 1. The function is not flat because of random sampling and limited sample size. 

unnamed-chunk-4-1

Now, let’s say we collected 100 trials per condition from 30 participants. How do we proceed? There are a variety of approaches available to quantify distribution differences. Ideally, such data would be analysed using a multi-level model, including for instance ex-Gaussian fits, random slopes and intercepts for participants, item analyses… This can be done using the lme4 or brms R packages. However, in my experience, in neuroscience and psychology articles, the most common approach is to collapse the variability across trials into a single number per participant and condition to be able to perform a paired t-test: typically, the mean is computed across trials for each condition and participant, then the means are subtracted, and the distribution of mean differences is entered into a one-sample t-test. Obviously, this strategy throws away a huge amount of information! And the results of such second-tier t-tests are difficult to interpret: a positive test leaves us wondering exactly how the distributions differ; a negative test is ambiguous – beside avoiding the ‘absence of evidence is not evidence of absence’ classic error, we also need to check if the distributions do not differ in other aspects than the mean. So what can we do?

Depending on how conditions differ, looking at other aspects of the data than the mean can be more informative. For instance, in Rousselet & Wilcox (2019), we consider group comparisons of individual medians. Considering that the median is the second quartile, looking at the other quartiles can be of theoretical interest to investigate effects in early or later parts of distributions. This could be done in several ways, for instance by making inferences on the first quartile (Q1) or the third quartile (Q3). If the goal is to detect differences anywhere in the distributions, a more systematic approach consists in quantifying differences at multiple quantiles. Here we consider the case of the deciles, but other quantiles could be used. First, for each participant and each condition, the sample deciles are computed over trials. Second, for each participant, condition 2 deciles are subtracted from condition 1 deciles – we’re dealing with a within-subject (repeated-measure) design. Third, for each decile, the distribution of differences is subjected to a one-sample test. Fourth, a correction for multiple comparisons is applied across the 9 one-sample tests. I call this procedure a hierarchical shift function. There are many options available to implement this procedure and the example used here is not the definitive answer: the goal is simply to demonstrate that a relatively simple procedure can be much more powerful and informative than standard approaches.

In creating a hierarchical shift function we need to make three choices: a quantile estimator, a statistical test to assess quantile differences across participants, and a correction for multiple comparisons technique. The deciles were estimated using type 8 from the base R quantile() function (see justification in Rousselet & Wilcox, 2019). The group comparisons were performed using a one-sample t-test on 20% trimmed means, which performs well in many situations, including in the presence of outliers. The correction for multiple comparisons employed Hochberg’s strategy (Hochberg, 1988), which guarantees that the probability of at least one false positive will not exceed the nominal level as long as the nominal level is not exceeded for each quantile. 

In Rousselet & Wilcox (2019), we consider power curves for the hierarchical shift function (HSF) and contrast them to other approaches: by design, HSF is sensitive to more types of differences than any standard approach using the mean or a single quantile. Another advantage of HSF is that the location of the distribution difference can be interrogated, which is impossible if inferences are limited to a single value.

Here is what the hierarchical shift function looks like for our uniform shift example:

unnamed-chunk-7-1

The decile differences between conditions are plotted for each participant (colour coded) and the group 20% trimmed means are superimposed in black. Differences are pretty constant across deciles, suggesting a uniform shift. Most participants have shift functions entirely negative – a case of stochastic dominance of one condition over the other. There is growing uncertainty as we consider higher deciles, which is expected from measurements of right skewed distributions.

We can add confidence intervals:

unnamed-chunk-9-1

P values are available in the GitHub code.

Instead of standard parametric confidence intervals, we can also consider percentile bootstrap confidence intervals (or highest density intervals), as done here:

unnamed-chunk-14-1

Distributions of bootstrap estimates can be considered cheap Bayesian posterior distributions. They also contain useful information not captured by simply reporting confidence intervals.

Here we plot them using geom_halfeyeh() from tidybayes. 

unnamed-chunk-15-1

The distributions of bootstrap estimates of the group 20% trimmed means are shown in orange, one for each decile. Along the base of each distribution, the black dot marks the mode and the vertical lines mark the 50% and 90% highest density intervals.

Nice hey?! Reporting a figure like that is dramatically more informative than reporting a P value and a confidence interval from a t-test!

A bootstrap approach can also be used to perform a cluster correction for multiple comparisons – see details on GitHub. Preliminary simulations suggest that the approach can provide substantial increase in power over the Hochberg’s correction – more on that in another post.

Let’s look at 3 more examples, just for fun…

Example 2: early difference

Example participant:

unnamed-chunk-17-1

Shift function:

unnamed-chunk-18-1

Hierarchical shift function with confidence intervals:

unnamed-chunk-22-1

Percentile bootstrap estimate densities:

unnamed-chunk-28-1

Example 3: difference in spread

Example participant:

unnamed-chunk-29-1

Shift function:

unnamed-chunk-30-1

Hierarchical shift function with confidence intervals:

unnamed-chunk-34-1

Percentile bootstrap estimate densities:

unnamed-chunk-40-1

Example 4: late difference

Example participant:

unnamed-chunk-41-1

Shift function:

unnamed-chunk-42-1

Hierarchical shift function with confidence intervals:

unnamed-chunk-46-1

Percentile bootstrap estimate densities:

unnamed-chunk-52-1

Conclusion

The hierarchical shift function can be used to achieve two goals: 

  • to screen data for potential distribution differences using p values, without limiting the exploration to a single statistics like the mean;
  • to illustrate and quantify how distributions differ.

I think of the hierarchical shift function as the missing link between t-tests and multi-level models. I hope it will help a few people make sense of their data and maybe nudge them towards proper hierarchical modelling.

R functions for the parametric hierarchical shift function are available in the rogme package. I also plan bootstrap functions. Then I’ll tackle the case of 2 independent groups, which requires a third level quantifying differences of differences.

 

Reaction times and other skewed distributions: problems with the mean and the median (part 2/4)

As we saw in the previous post, the sample median is biased when sampling from skewed distributions. The bias increases with decreasing sample size. According to Miller (1988), because of this bias, group comparison can be affected if the two groups differ in skewness or sample size, or both. As a result, real differences can be lowered or increased, and non-existent differences suggested. In Miller’s own words:

“An important practical consequence of the bias in median reaction time is that sample medians must not be used to compare reaction times across experimental conditions when there are unequal numbers of trials in the conditions.”

Let’s evaluate this advice.

We assess the problem using a simulation in which we draw samples of same or different sizes from populations with the same skewness, using the same 12 distributions used by Miller (1988), as described previously.

Group 2 has size 200. Group 1 has size 10 to 200, in increments of 10.

After 10,000 iterations, here are the results for the mean:

figure_bias_diff_m

 

All the bias values are near zero, as expected.

Here are the results for the median:

figure_bias_diff_md

 

Bias increases with skewness and sample size difference and is particularly large for n = 10. At least about 90-100 trials in Group 1 are required to bring bias to values similar to the mean.

Next, let’s find out if we can correct the bias. Bias correction is performed in 2 ways:

  • using the bootstrap

  • using subsamples, following Miller’s suggestion.

Miller (1988) suggested:

“Although it is computationally quite tedious, there is a way to use medians to reduce the effects of outliers without introducing a bias dependent on sample size. One uses the regular median from Condition F and compares it with a special “average median” (Am) from Condition M. To compute Am, one would take from Condition M all the possible subsamples of Size f where f is the number of trials in Condition F. For each subsample one computes the subsample median. Then, Am is the average, across all possible subsamples, of the subsample medians. This procedure does not introduce bias, because all medians are computed on the basis of the same sample (subsample) size.”

Using all possible subsamples would take far too long. For instance, if one group has 5 observations and the other group has 20 observations, there are 15504 (choose(20,5)) subsamples to consider. Slightly larger sample sizes would force us to consider millions of subsamples. So instead we compute K random subsamples. I arbitrarily set K to 1,000. Although this is not what Miller (1988) suggested, the K loop shortcut should reduce bias to some extent if it is due to sample size differences. Here are the results:

figure_bias_diff_md_sub

 

The K loop approach works very well! Bias can also be handled by the bootstrap. Here is what we get using 200 bootstrap resamples for each simulation iteration:

figure_bias_diff_md_bbc

 

The bootstrap bias correction works very well too! So in the long-run, the bias in the estimation of differences between medians can be eliminated using the subsampling or the percentile bootstrap approaches. Because of the skewness of the sampling distributions, we also consider the median bias: the bias observed in a typical experiment. In that case, the difference between group means tends to underestimate the population difference:

figure_bias_diff_m_mdbias

For the median, the median bias is much lower than the standard (mean) bias, and near zero from n = 20.

figure_bias_diff_md_mdbias

Thus, for a typical experiment, the difference between group medians actually suffers less from bias than the difference between group means.

Conclusion

Miller’s (1988) advice was inappropriate because, when comparing two groups, bias in a typical experiment is actually negligible. To be cautious, when sample size is relatively small, it could be useful to report median effects with and without bootstrap bias correction. It would be even better to run simulations to determine the sample sizes required to achieve an acceptable measurement precision, irrespective of the estimator used.

Finally, data & code are available on github.

[GO TO POST 3/4]

Reaction times and other skewed distributions: problems with the mean and the median (part 1/4)


UPDATE: this series of posts, along with much more material, is now part of this article:

Reaction times and other skewed distributions: problems with the mean and the median

[Preprint] [Reproducibility package]



In this series of 4 posts, I replicate, expand and discuss the results from

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

Part 1 = replicate Miller’s simulations + apply bootstrap bias correction

Part 2 = expand Miller’s simulations to group comparison

Part 3 = problems with the mean

Part 4 = application to a large dataset

Data & code are available on github. The content of the 4 posts is also described in this article.


Reaction times (RT) and many other quantities in neuroscience & psychology are skewed. This asymmetry tends to differ among experimental conditions, such that a measure of central tendency and a measure of spread are insufficient to capture how conditions differ. Instead, to understand the potentially rich differences among distributions, it is advised to consider multiple quantiles of the distributions (Doksum 1974; Pratte et al. 2010; Rousselet et al. 2017), or to model the shapes of the distributions (Heathcote et al. 1991; Rouder et al. 2005; Palmer et al. 2011; Matzke et al. 2013). Yet, it is still common practice to summarise reaction time distributions using a single number, most often the mean: that one value for each participant and each condition can then be entered into a group ANOVA to make statistical inferences. Because of the skewness of reaction times, the mean is however a poor measure of central tendency: skewness shifts the mean away from the bulk of the distribution, an effect that can be amplified by the presence of outliers or a thick right tail. For instance, in the figure below, the median better represents the typical observation than the mean because it is closer to the bulky part of the distribution.

figure_skew92_m_md

Mean and median for a very right skewed distribution. The distribution is bounded to the left and has a long right tail. This is an ex-Gaussian distribution, which is popular to model the shape of reaction time distributions, even though the matching between its parameters and mental processes can be debated (Sternberg, 2014).

So the median appears to be a better choice than the mean if the goal is to have a single value to tell us about the location of most observations in a skewed distribution. The choice between the mean and the median is however more complicated. It could be argued that because the mean is sensitive to skewness, outliers and the thickness of the right tail, it is better able to capture changes in the shapes of the distributions among conditions. But the use of a single value to capture shape differences will lead to intractable analyses because the same mean could correspond to various shapes. Instead, a multiple quantile approach or explicit shape modelling should be used.

The mean and the median differ in another important aspect: for small sample sizes, the sample mean is unbiased, whereas the sample median is biased (see illustrations in previous post). Concretely, if we perform many times the same RT experiment, and for each experiment we compute the mean and the median, the average mean will be very close to the population mean. As the number of experiments increases, the average sample mean will converge to the exact population mean. This is not the case for the median when sample size is small.

The reason for this bias is explained by Miller (1988):

“Like all sample statistics, sample medians vary randomly (from sample to sample) around the true population median, with more random variation when the sample size is smaller. Because medians are determined by ranks rather than magnitudes of scores, the population percentiles of sample medians vary symmetrically around the desired value of 50%. For example, a sample median is just as likely to be the score at the 40th percentile in the population as the score at the 60th percentile. If the original distribution is positively skewed, this symmetry implies that the distribution of sample medians will also be positively skewed. Specifically, unusually large sample medians (e.g., 60th percentile) will be farther above the population median than unusually small sample medians (e.g., 40th percentile) will be below it. The average of all possible sample medians, then, will be larger than the true median, because sample medians less than the true value will not be small enough to balance out the sample medians greater than the true value. Naturally, the more the distribution is skewed, the greater will be the bias in the sample median.”

To illustrate the sample median’s bias, Miller (1988) employed 12 ex-Gaussian distributions that differ in skewness. The distributions are illustrated in the next figure, and colour-coded using the difference between the mean and the median as a non-parametric measure of skewness.

figure_miller_distributions

Here are the parameters of the 12 distributions with the associated population parameters.

 

table_pop_param

The first figure in this post used the most skewed distribution of the 12, with parameters (300, 20, 300).

To estimate bias, we run a simulation in which we sample with replacement 10,000 times from each of the 12 distributions. We take random samples of sizes 4, 6, 8, 10, 15, 20, 25, 35, 50 and 100, as did Miller. For each random sample, we compute the mean and the median. For each sample size and ex-Gaussian parameter, the bias is then defined as the difference between the mean of the 10,000 sample estimates and the population value.

First, we check that the mean is not biased:

figure_miller_bias_m

Each line shows the results for one type of ex-Gaussian distribution: the mean of 10,000 simulations for different sample sizes. The grey area marks the 50% highest-density interval (HDI) of the 10,000 simulations for the least skewed distribution (the same interval is shown in the next two figures for comparison). The interval shows the variability across simulations and highlights an important aspect of the results: bias is a long-run property of an estimator; there is no guarantee that one value from a single experiment will be close to the population value. Also, the variability among samples increases with decreasing sample size, which is why results across small n experiments can differ substantially.

Here are the median bias estimates in table format:

 

table_md_bias

Columns = sample sizes; rows = skewness

 

The values are very close to the values reported in Miller 1998:

miller1988_table1

An illustration is easier to grasp:

figure_miller_bias_md

As reported by Miller (1988), bias can be quite large and it gets worse with decreasing sample sizes and increasing skewness.

Based on these results, Miller made this recommendation:

“An important practical consequence of the bias in median reaction time is that sample medians must not be used to compare reaction times across experimental conditions when there are unequal numbers of trials in the conditions.”

According to Google Scholar, Miller (1988) has been cited 172 times. A look at some of the oldest and most recent citations reveals that his advice has been followed.

For instance, Lavie (1995) noted:

“In the following experiments I used the mean RTs for each participant rather than the medians, as the increase in number of go/no-go errors under the high-load conditions resulted in a different number of responses in between conditions (see Miller, 1988).”

Tipper et al. (1992):

Analysis by Miller (1988) shows that for large numbers of trials, differences in the numbers between conditions […] has no impact on the medians obtained.

More recently, Robinson et al. (2018):

“[…] comparing medians among conditions with an unequal number of trials may lead to false positives (Miller, 1988)”

Du et al. (2017):

“We chose the mean rather than median of RT as […] the sample median may provide a biased estimation of RT (Miller, 1988)”

The list goes on. Also, in a review paper, Whelan (2008), cited 324 times, reiterates the advice:

“The median should never be used on RT data to compare conditions with different numbers of trials.”

However, there are several problems with Miller’s advice. In particular, using the mean leads to many issues with estimation and statistical inferences, as we will see in the 3rd post. In this post, we tackle one key omission from Miller’s assessment: the bias of the sample median can be corrected, using a percentile bootstrap bias correction, as described in this previous post.

For each iteration in the simulation, bias correction was performed using 200 bootstrap samples. Here are the bias corrected results:

figure_miller_bias_md_bc

The bias correction works very well on average, except for the smallest sample sizes. The failure of the bias correction for very small n is not surprising, because the shape of the sampling distribution cannot be properly estimated by the bootstrap from so few observations. From n = 10, the bias values are very close to those observed for the mean. So it seems that in the long-run, we can eliminate the bias of the sample median by using a simple bootstrap procedure. As we will see in the next post, the bootstrap bias correction is also effective when comparing two groups.


Update: 06 Feb 2018

Finally, let’s have a look at how well bias correction works as a function of sample size and skewness. The smaller the sample size, the less the bootstrap can estimate the sampling distribution and its bias. Ideally, after bias correction, we would expect the sample medians to be centred around zero, with limited variability. This ideal situation could look something like that:

figure_miller_bc_ideal_P1_N4

Here we’re considering the most skewed distribution with the smallest sample size. The x-axis shows the median bias before bias correction, whereas the y-axis shows the median bias after bias correction. In this ideal case, the correction works extremely well irrespective of the original bias (average bias is the red vertical line). As a result, the average bias after bias correction is very near zero (horizontal green line).

The reality is very different. The bias correction is only partially effective and is inhomogeneous.

figure_miller_bc_check_P1_N4

Let’s add some markup to help understand what’s going on.

figure_miller_bc_check_P1_N4_markup

If the original bias is negative, after correction, the median tends to be even more negative, so over corrected in the wrong direction (lower left triangle).

If the original bias is positive, after correction, the median is either:
– over corrected in the right direction (lower right triangle)
– under corrected in the right direction (middle right triangle)
– over corrected in the wrong direction (upper right triangle)

This pattern remains, although attenuated, if we consider the largest sample size.

figure_miller_bc_check_P1_N100

Or if we consider the least skewed distribution.

figure_miller_bc_check_P12_N4

We can look at the different patterns as a function of sample size and skewness.

The figure below shows the probability of over correcting in the wrong direction given that the original bias is negative (lower left triangle of the marked up figure).

figure_miller_bc_check_neg_over_wrong

In the ideal situation illustrated previously, the expected proportion of over correction in the wrong direction, given an original negative bias, is 6.7%. So here we clearly have an overrepresentation of these cases. When the original bias is negative, in most cases the bootstrap is unable to correct in the right direction. The situation gets worse with increasing skewness and smaller sample sizes.

The figure below shows the probability of under correcting in the right direction given that the original bias is positive (middle right triangle of the marked up figure)

figure_miller_bc_check_pos_under_right

In the ideal situation illustrated previously, the expected proportion of under correction in the right direction, given an original positive bias, is 44.7 %. So here, we have an overrepresentation of these cases. When the original bias is positive, in too many cases the bootstrap corrects in the right direction, but it under-corrects. The situation gets worse with increasing skewness and smaller sample sizes.

Ok, so bias correction is imperfect and it varies a lot, it part depending on whether the sample median fell above or below the unknown population median. Think using the mean is safer? There are several strong arguments to the contrary – more in another post.

[GO TO POST 2/4]

References

Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Annals of Statistics, 2, 267-277.

Du, Y., Valentini, N.C., Kim, M.J., Whitall, J. & Clark, J.E. (2017) Children and Adults Both Learn Motor Sequences Quickly, But Do So Differently. Frontiers in Psychology, 8.

Heathcote, A., Popiel, S.J. & Mewhort, D.J.K. (1991) Analysis of Response-Time Distributions – an Example Using the Stroop Task. Psychol Bull, 109, 340-347.

Lavie, N. (1995) Perceptual Load as a Necessary Condition for Selective Attention. J Exp Psychol Human, 21, 451-468.

Matzke, D., Love, J., Wiecki, T.V., Brown, S.D., Logan, G.D. & Wagenmakers, E.J. (2013) Release the BEESTS: Bayesian Estimation of Ex-Gaussian STop Signal reaction time distributions. Front Psychol, 4.

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

Palmer, E.M., Horowitz, T.S., Torralba, A. & Wolfe, J.M. (2011) What Are the Shapes of Response Time Distributions in Visual Search? J Exp Psychol Human, 37, 58-71.

Pratte, M.S., Rouder, J.N., Morey, R.D. & Feng, C.N. (2010) Exploring the differences in distributional properties between Stroop and Simon effects using delta plots. Atten Percept Psycho, 72, 2013-2025.

Robinson, M. M., Clevenger, J., & Irwin, D. E. (2018). The action is in the task set, not in the action. Cognitive psychology, 100, 17-42.

Rouder, J.N., Lu, J., Speckman, P., Sun, D.H. & Jiang, Y. (2005) A hierarchical model for estimating response time distributions. Psychon B Rev, 12, 195-223.

Sternberg, S. (2014) Reaction times and the ex-Gaussian distribution: When is it appropriate? Retrieved from http://www.psych.upenn.edu//~saul/

Tipper, S.P., Lortie, C. & Baylis, G.C. (1992) Selective Reaching – Evidence for Action-Centered Attention. J Exp Psychol Human, 18, 891-905.

Whelan, R. (2008) Effective analysis of reaction time data. Psychol Rec, 58, 475-482.

 

What can we learn from 10,000 experiments?


The code and a notebook for this post are available on github.

Before we conduct an experiment, we decide on a number of trials per condition, and a number of participants, and then we hope that whatever we measure comes close to the population values we’re trying to estimate. In this post I’ll use a large dataset to illustrate how close we can get to the truth – at least in a lexical decision task. The data are from the French lexicon project:

Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Meot, A., Augustinova, M. & Pallier, C. (2010) The French Lexicon Project: lexical decision data for 38,840 French words and 38,840 pseudowords. Behav Res Methods, 42, 488-496.

After discarding participants who clearly did not pay attention, we have 967 participants who performed a word/non-word discrimination task. Each participant has about 1000 trials per condition. I only consider the reaction time (RT) data, but correct/incorrect data are also available. Here are RT distributions for 3 random participants (PXXX refers to their ID number in the dataset):

figure_flp_p113

figure_flp_p388

figure_flp_p965

The distributions are positively skewed, as expected for RT data, and participants tend to be slower in the non-word condition compared to the word condition. Usually, a single number is used to summarise each individual RT distribution. From 1000 values to 1, that’s some serious data compression. (For alternative strategies to compare RT distributions, see for instance this paper). In psychology, the mean is often used, but here the median gives a better indication of the location of the typical observation. Let’s use both. Here is the distribution across participants of median RT for the word and non-word conditions:

figure_all_p_medians

And the distribution of median differences:

figure_all_p_median_diff

Interestingly, the differences between the medians of the two conditions is also skewed: that’s because the two distributions tend to differ in skewness.

We can do the same for the mean:

figure_all_p_means

Mean differences:

figure_all_p_mean_diff

With this large dataset, we can play a very useful game. Let’s pretend that the full dataset is our population that we’re trying to estimate. Across all trials and all participants, the population medians are:

Word = 679.5 ms
Non-word = 764 ms
Difference = 78.5

the population means are:

Word = 767.6 ms
Non-word = 853.2 ms
Difference = 85.5

Now, we can perform experiments by sampling with replacement from our large population. I think we can all agree that a typical experiment does not have 1,000 trials per condition, and certainly does not have almost 1,000 participants! So what happens if we perform experiments in which we collect smaller sample sizes? How close can we get to the truth?

10,000 experiments: random samples of participants only

In the first simulation, we use all the trials (1,000) available for each participant but vary how many participants we sample from 10 to 100, in steps of 10. For each sample size, we draw 10,000 random samples of participants from the population, and we compute the mean and the median. For consistency, we compute group means of individual means, and group medians of individual medians. We have 6 sets of results to consider: for the word condition, the non-word condition, their difference; for the mean and for the median. Here are the results for the group medians in the word condition. The vertical red line marks the population median for the word condition.

figure_md_w_size

With increasing sample size, we gain in precision: on average the result of each experiment is closer to the population value. With a small sample size, the result of a single experiment can be quite far from the population. This is not surprising, and also explains why estimates from small n experiments tend to disagree, especially between an original study and subsequent replications.

To get a clearer sense of how close the estimates are to the population value, it is useful to consider highest density intervals (HDI). A HDI is the shorted interval that contains a certain proportion of observations: it shows the location of the bulk of the observations. Here we consider 50% HDI:

figure_md_w_size_hdi

In keeping with the previous figure, the intervals get smaller with increasing sample size. The intervals are also asymmetric: the right side is always closer to the population value than the left side. That’s because the sampling distributions are asymmetric. This means that the typical experiment will tend to under-estimate the population value.

We observe a similar pattern for the non-word condition, with the addition of two peaks in the distribution from n = 40. I’m not sure what’s causing it, or if it is a common shape. One could check by analysing the lexicon project data from other countries. Any volunteers? One thing for sure: the two peaks are caused by the median, because they are absent from the mean distributions (see notebook on github).

figure_md_nw_size

Finally, we can consider the distributions of group medians of median differences:

figure_md_diff_size

Not surprisingly, it has a similar shape to the previous distributions. It shows a landscape of expected effects. It would be interesting to see where the results of other, smaller, experiments fall in that distribution. The distribution could also be used to plan experiments to achieve a certain level of precision. This is rather unusual given the common obsession for statistical power. But experiments should be predominantly about quantifying effects, so a legitimate concern is to determine how far we’re likely to be from the truth in a given experiment. Using our simulated distribution of experiments, we can determine the probability of the absolute difference between an experiment and the population value to be larger than some cut-offs. In the figure below, we look at cut-offs from 5 ms to 50 ms, in steps of 5.

figure_md_diff_size_prob

The probability of being wrong by at least 5 ms is more than 50% for all sample sizes. But 5 ms is probably not a difference folks studying lexical decision would worry about. It might be an important difference in other fields. 10 ms might be more relevant: I certainly remember a conference in which group differences of 10 ms between attention conditions were seriously discussed, and their implications for theories of attention considered. If we care about a resolution of at least 10 ms, n=30 still gives us 44% chance of being wrong…

(If someone is looking for a cool exercise: you could determine the probability of two random experiments to differ by at least certain amounts.)

We can also look at the 50% HDI of the median differences:

figure_md_diff_size_hdi

As previously noted, the intervals shrink with increasing sample size, but are asymmetric, suggesting that the typical experiment will under-estimate the population difference. This asymmetry disappears for larger samples of group means:

figure_m_diff_size_hdi

Anyway, these last figures really make me think that we shouldn’t make such a big deal out of any single experiment, given the uncertainty in our measurements.

10,000 experiments: random samples of trials and participants

In the previous simulation, we sampled participants with replacement, using all their 1,000 trials. Typical experiments use fewer trials. Let say I plan to collect data from 20 participants, with 100 trials per condition. What does the sampling distribution look like? Using our large dataset, we can sample trials and participants to find out. Again, for consistency, group means are computed from individual means, and group medians are computed from individual medians. Here are the sampling distributions of the mean and the median in the word condition:

figure_sim_word

The vertical lines mark the population values. The means are larger than the medians because of the positive skewness of RT distributions.

In the non-word condition:

figure_sim_nonword

And the difference distributions:

figure_sim_difference

Both distributions are slightly positively skewed, and their medians fall to the left of the population values. The spread of expected values is quite large.

Similarly to simulation 1, we can determine the probability of being wrong by at least 5 ms, which is 39%. For 10 ms, the probability is 28%, and for 20 ms 14%. But that’s assuming 20 participants and 100 trials per condition. A larger simulation could investigate many combinations of sample sizes…

Finally, we consider the 50% HDI, which are both asymmetric with respect to the population values and with similar lengths:

figure_sim_difference_hdi

Conclusion

To answer the question from the title, I think the main things we can learn from the simulated sampling distributions above are:
– to embrace uncertainty
– modesty in our interpretations
– healthy skepticism about published research
One way to highlight uncertainty in published research would be to run simulations using descriptive statistics from the articles, to illustrate the range of potential outcomes that are compatible with the results. That would make for interesting stories I’m sure.