Tag Archives: outlier

Mean or median reaction time? An open review of Miller (2020)

Below is a review of Miller (2020), which is a reply to an article in which Rand Wilcox & I reproduced and built upon Miller (1988). Here are the articles in order:

[1] A Warning About Median Reaction Time

[2] Reaction times and other skewed distributions: problems with the mean and the median

[3] Another Warning about Median Reaction Time

I’ve added a link to [3] and this review to article [2], so readers are aware of the discussions.

The review will only make sense if you at least read [3] first, but [2] contains a lot of simulations, descriptions and references not covered in [3].

Review

To start, I’ve got to say I’ve learnt so much about various sources of bias from your work on reaction time analyses, including Miller (1988) and many subsequent papers. I discovered Miller (1988) by chance a few years ago, while researching a review article on robust measures of effect sizes. Actually, I was so startled by the 1988 results that I dropped the article on effect sizes to replicate Miller’s simulations and explore their consequences. This extensive work has taught me a lot about reaction time data, the mean, the median, their sampling distributions and associated inferential statistics. So I’m really thankful for that.

Overall, I enjoyed your reply to our paper, which provides interesting new simulations and a good summary of the issues. The main apparent discrepancies are much smaller than they seem and can easily be addressed by wording key statements and conclusions more carefully, for instance by highlighting boundary conditions. If anything I wrote below is unclear, feel free to contact me directly. In particular, if needed, we could discuss the g&h simulation code, which I realise I probably could have explained better in the R&W2020 article.

Your article is well written. The illustrations are fine but could be improved by adding colours or grey/black contrasts. In Figure 4 (presenting the most original and interesting results), the symbols could be different for the 3 estimators to improve clarity.

The simulation results are convincing and mostly concur with our own assessment. However, what is missing is a consideration of the effects of outliers and the skewness of the distribution of differences (after pooling across trials using the mean, the median or some other estimator). As we explained in R&W2020, outliers and skewness can essentially destroy the power of inferential tests using the mean, whereas tests on the median are hardly affected.

“Another unusual feature of R&W’s simulations is that they used g&h distributions (Hoaglin, 1985) as models of RT. These distributions are quite different from ex-Gaussians and are not normally considered in RT modelling (e.g., Luce, 1986). This distributional choice may also have contributed to the advantage for medians in their simulations.”

I don’t understand how our simulations using g&h distributions could be the source of the discrepancy, because we didn’t use them to simulate distributions of reaction times. In fact, we used them to systematically manipulate the distribution of differences that is fed to a one-sample test, from normal to very skewed. We also varied the probability that outliers are observed. The shape of the marginal RT distributions should be irrelevant to these simulations: when the RT distributions are identical in every condition and participant, irrespective of their skewness, the distribution of differences is normal (that is, for each participant, compute the mean for each condition then subtract the means, leading to a distribution of pairwise differences). When the distributions differ in skewness, the distribution of differences has skewness equal to the difference in skewness between the original distributions. Thus, our simulations are only concerned with the group level analyses, and only with the shapes of the distributions of differences — whether these distributions resulted from individual means or medians was not considered. We also used different one-sample tests for the different estimators of central tendency of the distributions of differences. This is necessary because the mean, the median and trimmed means have different standard errors. So, our assessments of the median are not comparable between simulations.

To be clear: as I understand, in your simulations, different RT distributions from 2 conditions are summarised using the mean and the median, one for each condition and each participant (stage 1); pairwise differences are computed, resulting in distributions of differences (stage 2); one-sample tests on the mean of the differences are performed.

In R&W2020’s g&h simulations, we ignored stage 1. We only considered the shapes of the distributions of differences, and then computed one-sample tests for 3 different estimators of central tendency. The skewness of these distributions depend on both within- and between-participant variability, but we did not model these sources of variability explicitly, only their end-product. Our simulations demonstrate that, given the same distribution of differences, in some situations the median and the 20% trimmed mean have dramatically more power than the mean.

It might well be that in some (many?) situations, RT distributions do not differ enough in skewness to affect the power of the one-sample t-test. But it remains a fundamental statistical result that one-sample t-tests are strongly affected by skewness, and even much more so by outliers. This is also covered in detail in Wilcox & Rousselet (2018).

To address this discrepancy in simulation results, I don’t think new simulations necessarily need to be added, but the problem should be presented more carefully.

Other apparent disagreements can be addressed by more careful phrasing of the situations under which the problems occur, especially at the start of the conclusion section, which I find somewhat misleading.

“R&W concluded that “there seems to be no rationale for preferring the mean over the median as a measure of central tendency for skewed distributions” (p. 37). On the contrary, when performing hypothesis tests to compare the central tendencies of RTs between experimental conditions, the present simulations show that there may be an extremely clear rationale in terms of both Type I error rate and power.”

As we explain in our paper, it is precisely because the mean is a poor measure of central tendency that in some situations it is better at detecting distribution differences (particularly when conditions differ in skewness, and more specifically in their right tails when dealing with RT distributions). But higher power or nominal type I error rate doesn’t make the mean a better measure of central tendency.

What is needed in this section is a clear distinction between 2 different but complementary goals:
[1] to detect differences between distributions;
[2] to understand how distributions differ.

As we argue in our paper, if the goal is [2], then it makes no sense to use the mean or the median; much better tools are available, starting by using the mean and the median, to get a richer perspective. The distinction is clearly made in footnote 1 of your article, but should be reiterated in the conclusion and the abstract, so there is no confusion.

“When comparing conditions with unequal numbers of trials, the sample-size-dependent bias of regular medians can lead to clear inflation of the Type I error rate (Fig. 2), so these medians definitely should not be used.”

This statement is only valid when considering low numbers of trials in the condition with the least trials. So to be clear, the problem emerges only for a combination of absolute and relative numbers of trials. The problem also occurs when group statistics involve means. In contrast, performing for instance a median one-sample test on group differences between medians does not lead to inflated false positives. I realise most users who collapse RT distributions are most likely to perform group statistics using the mean, but this assumption needs to be explicitly stated. The choice of estimator applies to the two levels of analysis: within-participants and between-participants.

For this reason, in the text, it would be worth describing what inferential tests were used for the analyses (I presume t-tests). This should bring a bit of nuance to this statement for instance, given that power depends on choices at 2 levels of analysis:

“The choice of central tendency measure would then be determined primarily by comparing the power of these three measures (i.e., means, medians, bias-corrected medians).”

In R&W2020 we also considered tests on medians, which solve the bias issue. We also looked at trimmed means, and other options are available such as the family of M-estimators.

“In view of the fact that means have demonstrably greater power than bias-corrected medians for experiments with unequal trial frequencies (Fig. 3), it is also sensible to compare power levels in experiments with equal trial frequencies.”

This statement is inaccurate because too general: as we demonstrate in Figure 15 of R&W2020, at least in some situations, the median can have higher power than the mean. Maybe more importantly, in the presence of skewness and outliers, methods using quantiles (with the median as a special case) or trimmed means can have dramatically more power than mean-based methods.

To go back to the distinction between goals [1] and [2], as we argue in R&W2020, ideally we should focus on richer descriptions of distributions using multiple quantiles to understand how they differ (a point you also make in other articles). Limiting analyses to the mean or the median is really unsatisfactory. Also, standard t-tests and ANOVAs do not account for measurement error, including item variability, which should really be handled using hierarchical models. Whether going the quantile way or the hierarchical modelling way (or both, the number of trials required to make sense of the data would make bias issues naturally go away. So personally I see using the mean RT as a valid recommendation, but only in very specific situations.

Your last point in the discussion about choosing a measure of central tendency before seeing the data is an important one. Perhaps more importantly, given the large number of options available to make sense of rich reaction time data, probably the prime recommendation should be that authors make their data publicly available, so that others can try alternative, and in most cases, better techniques.

Other points:

“There is unfortunately no consensus about the psychological meanings of changes in these different parameters (e.g., Rieger & Miller, 2019), but the ex-Gaussian distribution nevertheless remains useful as a way of describing changes in the shapes as well as the means of RT distributions.”

This is an excellent point. It would be worth also to cite Matzke & Wagenmakers (2009) in addition to Rieger & Miller (2019).

I would also mention that other distributions provide better interpretability in terms of shift and scale effects (shifted lognormal) — see Lindelov (2019).

Typos:

“the very between-condition difference” — remove very?

The bootstrap-t technique

There are many types of bootstrap methods, but for most applications, two methods are most common: the percentile bootstrap, presented in an earlier post, and the bootstrap-t technique—also known as the percentile-t bootstrap or the studentized bootstrap (Efron & Tibshirani, 1994; Wilcox, 2017)​. For inferences on the population mean, the standard ​T-test and the percentile bootstrap can give unsatisfactory results when sampling from skewed distributions, especially when sample size is small. To illustrate the problem with the t-test, imagine that we sample from populations of increasing skewness.

Probability density functions for ​g&h​ distributions. Parameter ​g​ varies from 0 to 1. Parameter ​h=​0.

Here we use ​g&h​ distributions, in which parameter ​g​ controls the skewness, and parameter ​h​ controls the thickness of the tails—a normal distribution is obtained by setting ​g​=​h​=0 (Hoaglin, 1985; Yan & Genton, 2019)​. If we take many samples of size n=30 from these distributions, and for each sample we compute a ​T​ value, using the population mean as the null value, we obtain progressively more negatively skewed ​T​ value sampling distributions.

Sampling distributions of ​T​ values for different ​g​ values. Results are based on a simulation with 50,000 iterations and samples of size n=30.​

However, when we perform a ​T​-test, the ​T​ values are assumed to be symmetric, irrespective of sample size. This assumption leads to incorrect confidence intervals (CIs). The idea behind the bootstrap-t technique is to use the bootstrap (sampling with replacement) to compute a data-driven T​ distribution. In the presence of skewness, this ​T​ distribution could be skewed, as suggested by the data. Then, the appropriate quantile of the bootstrap ​T distribution is plugged into the standard CI equation to obtain a parametric bootstrap CI.

Bootstrap-t procedure

Let’s illustrate the procedure for a CI for the population mean. We start with a sample of 30 observations from a ​g&h​ distribution with ​g​=1 and ​h=​ 0.

Sample of size n=30 from a ​g&h​ distribution with ​g=1 and ​h​=0. The vertical line indicates the sample mean.

In a first step, we centre the distribution: for inferences on the mean, we subtract the mean from each observation in the sample, so that the mean of the centred distribution is now zero. This is a way to create a data-driven null distribution, in which there is no effect (the mean is zero), but the shape of the distribution and the absolute distance among observations are unaffected, as shown in the next figure. For inferences on a trimmed mean, we subtract the trimmed mean from each observation, so that the centred distribution now has a trimmed mean of zero.

Same distribution as in the previous figure, but the distribution has been mean centred, so that the sample mean is now zero.

In the next step, we sample with replacement from the centred distribution many times, and for each random sample we compute a ​T​ value. That way, we obtain a bootstrap distribution of ​T​ values expected by random sampling, under the hypothesis that the population has a mean (or trimmed mean) of zero, given the distribution of the data. Then, we use some quantile of the bootstrap ​T distribution in the standard CI equation. (Note that for trimmed means, the T-test equation is adjusted—see Tukey & McLaughlin, 1963).

5,000 bootstrap ​T​ values obtained by sampling with replacement from the mean centred data. In the asymmetric bootstrap-t technique, the quantiles (red vertical lines) of that distribution of ​T​ values are used to define the CI bounds. The insets contain the formulas for the lower (CI​lo)​ and upper bounds (CI​up)​ of the CI. Note that the lower ​T​ quantile is used to compute the upper bound (this not an error). In the symmetric bootstrap-t technique, one quantile of the distribution of absolute ​T​ values is used to define the CI bounds.​

Because the bootstrap distribution is potentially asymmetric, we have two choices of quantiles: for a 95% CI, either we use the 0.025 and the 0.975 quantiles of the signed ​T​ values to obtain a potentially asymmetric CI, also called an equal-tailed CI, or we use the 0.95 quantile of the absolute ​T​ values, thus leading to a symmetric CI.

In our example, for the mean the symmetric CI is [-0.4, 1.62] and the asymmetric CI is [0.08, 1.87]. If instead we use the 20% trimmed mean, the symmetric CI is [-0.36, 0.59] and the asymmetric CI is [-0.3, 0.67] (see Rousselet, Pernet & Wilcox, 2019). So clearly, confidence intervals can differ a lot depending on the estimator and method we use. In other words, a 20% trimmed mean is not a substitute for the mean, it asks a different question about the data.

Bootstrap samples

Why does the bootstrap-t approach work better than the standard ​T-test CI? Imagine we take multiple samples of size n=30 from a ​g&h​ distribution with ​g=​1 and ​h​=0.

Comparison of ​T​ distributions for ​g​=1 & h=0: the theoretical ​T distribution in red​ is the one used in the T-​test, the empirical ​T​ distribution in black was obtained by sampling with replacement multiple times from the g&h distribution. The red and black vertical lines indicate the ​T​ quantiles for a 95% CI. The grey lines show examples of 20 bootstrap sampling distributions, based on samples of size n=30 and 5000 bootstrap samples.

In the figure above, the standard ​T​-test assumes the sampling distribution in red, symmetric around zero. As we considered above, the sampling distribution is actually asymmetric, with negative skewness, as shown in black. However, the black empirical distribution is unobservable, unless we can perform thousands of experiments. So, with the bootstrap, we try to estimate this correct, yet unobservable, sampling distribution. The grey curves show examples of 20 simulated experiments: in each experiment, a sample of 30 observations is drawn, and then 5,000 bootstrap ​T​ values are computed. The resulting bootstrap sampling distributions are negatively skewed and are much closer to the empirical distribution in black than the theoretical symmetric distribution in red. Thus, it seems that using data-driven ​T​ distributions could help achieve better CIs than if we assumed symmetry.

How do these different methods perform? To find out we carry out simulations in which we draw samples from​ g&h distributions with the​ g​ parameter varying from 0 to 1, keeping ​h=0. For each sample, we compute a one-sample CI using the standard ​T-​ test, the two bootstrap-t methods just described (asymmetric and symmetric), and the percentile bootstrap. When estimating the population mean, for all four methods, coverage goes down with skewness.

Confidence interval coverage for the 4 methods applied to the mean. Results of a simulation with 20,000 iterations, sample sizes of n=30, and 599 bootstrap samples.​ You can see what happens for the 10% trimmed mean and the 20% trimmed mean in Rousselet, Pernet & Wilcox, 2019.

Among the parametric methods, the standard ​T-​test is the most affected by skewness, with coverage less than 90% for the most skewed condition. The asymmetric bootstrap-t CI seems to perform the best. The percentile bootstrap performs the worst in all situations, and has coverage systematically below 95%, including for normal distributions.

In addition to coverage, it is useful to consider the width of the CIs from the different techniques.

Confidence interval median width, based on the same simulation reported in the previous figure.

The width of a CI is its upper bound minus its lower bound. For each combination of parameters. the results are summarised by the median width across simulations. At low levels of asymmetry, for which the three parametric methods have roughly 95% coverage, the CIs also tend to be of similar widths. As asymmetry increases, all methods tend to produce larger CIs, but the ​T-test produces CIs that are too short, a problem that stems from the symmetric theoretical ​T​ distribution, which assumes T​ ​values too small. Compared to the parametric approaches, the percentile bootstrap produces the shortest CIs for all ​g​ values.

Confidence intervals: a closer look

We now have a closer look at the confidence intervals in the different situations considered above. We use a simulation with 20,000 iterations, sample size n=30, and 599 bootstrap samples.

Under normality

As we saw above, under normality the coverage is close to nominal (95%) for every method, although coverage for the percentile bootstrap is slightly too low, at 93.5%. Out of 20,000 simulated experiments, about 1,000 CI (roughly 5%) did not include the population value. About the same number of CIs were shifted to the left and to the right of the population value for all methods, and the CIs were of similar sizes:

We observed the same behaviour for several parametric methods in a previous post. Now, what happens when we sample from a skewed population?

In the presence of skewness (g=1, h=0)

Coverage is lower than the expected 95% for all methods. Coverage is about 88% for the standard and percentile bootstrap CIs, 92.3% for the asymmetric bootstrap-t CIs, and 91% for the symmetric bootstrap-t CIs. As we saw above, CIs are larger for the bootstrap-t CIs relative to the standard and percentile bootstrap CIs. CIs that did not include the population value tended to be shifted to the left of the population value, and more so for the standard CIs and the bootstrap-t symmetric CIs.

So when making inferences about the mean using the standard T-test, our CI coverage is lower than expected, and we are likely to underestimate the population value (the sample mean is median biased—Rousselet & Wilcox, 2019).

Relative to the other methods, the asymmetric bootstrap-t CIs are more evenly distributed on either side of the population and the right shifted CIs tends to be much larger and variable. The difference with the symmetric CIs is particularly striking and suggests that the asymmetric CIs could be misleading in certain situations. This intuition is confirmed by a simulation in which outliers are likely (h=0.2).

In the presence of skewness and outliers (g=1, h=0.2)

In the presence of outliers, the patterns observed in the previous figure are exacerbated. Some of the percentile bootstrap and asymmetric bootstrap-t intervals are ridiculously wide (x axis is truncated).

In such situation, inferences on trimmed means would greatly improve performance over the mean.

Conclusion

As we saw in a previous post, a good way to handle skewness and outliers is to make inferences about the population trimmed means. For instance, trimming 20% is efficient in many situations, even when using parametric methods that do not rely on the bootstrap. So what’s the point of the bootstrap-t? From the examples above, the bootstrap-t can perform much better than the standard Student’s approach and the percentile bootstrap when making inferences about the mean. So, in the presence of skewness and the population mean is of interest, the bootstrap-t is highly recommended. Whether to use the symmetric or asymmetric approach is not completely clear based on the literature (Wilcox, 2017). Intuition suggests that the asymmetric approach is preferable but our last example suggests that could be a bad idea when making inferences about the mean.

Symmetric or not, the bootstrap-t confidence intervals combined with the mean do not necessarily deal with skewness as well as other methods combined with trimmed means. But the bootstrap-t can be used to make inferences about trimmed means too! So which combination of approaches should we use? For instance, we could make inferences about the mean, the 10% trimmed mean or the 20% trimmed mean, in conjunction with a non-bootstrap parametric method, the percentile bootstrap or the bootstrap-t. We saw that for the mean, the bootstrap-t method is preferable in the presence of skewness. For inferences about trimmed means, the percentile bootstrap works well when trimming 20%. If we trim less, then the other methods should be considered, but a blanket recommendation cannot be provided. The choice of combination can also depend on the application. For instance, to correct for multiple comparisons in brain imaging analyses, cluster-based statistics are strongly recommended, in which case a bootstrap-t approach is very convenient. And the bootstrap-t is easily extended to factorial ANOVAs (Wilcox, 2017; Field & Wilcox, 2017).

What about the median? The bootstrap-t should not be used to make inferences about the median (50% trimming), because the standard error is not estimated correctly. Special parametric techniques have been developed for the median (Wilcox, 2017). The percentile bootstrap also works well for the median and other quantiles in some situations, providing sample sizes are sufficiently large (Rousselet, Pernet & Wilcox, 2019).

References

Efron, Bradley, and Robert Tibshirani. An Introduction to the Bootstrap. Chapman and Hall/CRC, 1994.

Field, Andy P., and Rand R. Wilcox. ‘Robust Statistical Methods: A Primer for Clinical Psychology and Experimental Psychopathology Researchers’. Behaviour Research and Therapy 98 (November 2017): 19–38. https://doi.org/10.1016/j.brat.2017.05.013.

Hesterberg, Tim C. ‘What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum’. The American Statistician 69, no. 4 (2 October 2015): 371–86. https://doi.org/10.1080/00031305.2015.1089789.

Hoaglin, David C. ‘Summarizing Shape Numerically: The g-and-h Distributions’. In Exploring Data Tables, Trends, and Shapes, 461–513. John Wiley & Sons, Ltd, 1985. https://doi.org/10.1002/9781118150702.ch11.

Rousselet, Guillaume A., Cyril R. Pernet, and Rand R. Wilcox. ‘A Practical Introduction to the Bootstrap: A Versatile Method to Make Inferences by Using Data-Driven Simulations’. Preprint. PsyArXiv, 27 May 2019. https://doi.org/10.31234/osf.io/h8ft7.

Rousselet, Guillaume A., and Rand R. Wilcox. ‘Reaction Times and Other Skewed Distributions: Problems with the Mean and the Median’. Preprint. PsyArXiv, 17 January 2019. https://doi.org/10.31234/osf.io/3y54r.

Tukey, John W., and Donald H. McLaughlin. ‘Less Vulnerable Confidence and Significance Procedures for Location Based on a Single Sample: Trimming/Winsorization 1’. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002) 25, no. 3 (1963): 331–52.

Wilcox, Rand R. Introduction to Robust Estimation and Hypothesis Testing. 4th edition. Academic Press, 2017.

Wilcox, Rand R., and Guillaume A. Rousselet. ‘A Guide to Robust Statistical Methods in Neuroscience’. Current Protocols in Neuroscience 82, no. 1 (2018): 8.42.1-8.42.30. https://doi.org/10.1002/cpns.41.

Yan, Yuan, and Marc G. Genton. ‘The Tukey G-and-h Distribution’. Significance 16, no. 3 (2019): 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x.