You get the symptoms of a replication crisis even when there isn’t one: considering power

Many methods have been proposed to assess the success of a replication (Costigan et al. 2024; Cumming & Maillardet 2006; Errington et al. 2021; LeBel et al. 2019; Ly et al. 2019; Mathur & VanderWeele 2020; Muradchanian et al., 2021; Patil et al. 2016; Spence & Stanley 2024; Verhagen & Wagenmakers 2014). The most common method, also used to determine if results from similar experimental designs are consistent across studies, is consistency in statistical significance: do the two studies report a p value less than some (usually) arbitrary threshold? This approach can be misleading for many reasons, for instance when two studies report the same group difference, but different confidence intervals: one including the null, the other one excluding it. Even though the group differences are the same, sampling error combined with statistical significance would lead us to conclude that the two studies disagree. There is a very nice illustration of the issue in Figure 1 of Amrhein, Greenland & McShane (2019).

More generally:

“if the alternative is correct and the actual power of two studies is 80%, the chance that the studies will both show P ≤ 0.05 will at best be only 0.80(0.80) = 64%; furthermore, the chance that one study shows P ≤ 0.05 and the other does not (and thus will be misinterpreted as showing conflicting results) is 2(0.80)0.20 = 32% or about 1 chance in 3.”
Greenland et al. 2016 (see also Amrhein, Trafimow & Greenland 2019)

So, in the long run, even if two studies always sample from the same population (even assuming all unmeasured sources of variability are the same across labs; Gelman et al. 2023), the literature would look like there is a replication crisis when none exists.

Let’s expand the single values from the example by Greenland et al. (2016) and plot the probability of finding consistent and inconsistent results as a function of power:

When deciding about consistency between experiments using the statistical significance criterion, the probability to reach the correct decision depends on power, and unless power is very high, we will often be wrong.

In the previous figure, why consider power as low as 5%? If that seems unrealistic, a search for n=3 or n=4 in Nature and Science magazines will reveal recent experiments carried out with very small sample sizes in the biological sciences. Also, in psychology, interactions require much larger sample sizes than typically used, for instance when comparing correlation coefficients (Rousselet, Pernet & Wilcox, 2023). So very low power is still a real concern.

In practice, the situation is probably worse, because power analyses are typically performed assuming parametric assumptions are met; so the real power of a line of research will be lower than expected — see simulations in Rousselet & Wilcox (2020); Wilcox & Rousselet (2023); Rousselet, Pernet & Wilcox (2023).

To provide an illustration of the effect of skewness on power, and in turn, on replication success based on statistical significance, let’s use g-and-h distributions — see details in Rousselet & Wilcox (2020) and Yan & Genton (2019). Here we consider h=0 and vary g from 0 (normal distribution) to 1 (shifted lognormal distribution):

Now, let’s do a simulation in which we vary g, take samples of n=20, and perform a one-sample t-test on means, 10% trimmed means and 20% trimmed means. The code is on GitHub. To assess power, a constant is added to each sample, assuming a power of 80% when sampling from a standard normal population (g=h=0). Alpha is set to the arbitrary value of 0.05. The simulation includes 100,000 iterations.

Here are the results for false positives, showing a non-linear increase as a function g, with the one-sample t-test much more affected when using means than trimmed means:

And the true positive results, showing lower power for trimmed means under normality, but much more resilience to increasing skewness than the mean.

These results are well known (see for instance Rousselet & Wilcox, 2020).
Now the novelty is to consider in turn the impact on the probability of a positive outcome in both experiments.

If we assume normality and determine our sample size to achieve 80% power in the long run, skewness can considerably lower the probability of observing two studies both showing p<0.05 if we employ a one-sample t-test on means. Trimmed means are much less affected by skewness. Other robust methods will perform even better (Wilcox & Rousselet, 2023).

In the same setting, here is the probability of a positive outcome in one experiment and a negative outcome in the other one:

Let’s consider h = 0.1, so that outliers are more likely than in the previous simulation:

In the presence of outliers, false positives increase even more with g for the mean:

And power is overall reduced for all methods:

This reduction in power leads to even lower probability of consistent results than in the previous simulation:

And here are the results on the probability of observing inconsistent results:

So in the presence of skewness and outliers, the situation is overall even worse than suggested by Greenland et al. (2016). For this and other reasons, consistency in statistical significance should not be used to infer the success of a replication.

References

Amrhein, V., Greenland, S., & McShane, B. (2019). Scientists rise up against statistical significance. Nature, 567(7748), 305. https://doi.org/10.1038/d41586-019-00857-9

Amrhein, V., Trafimow, D., & Greenland, S. (2019). Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis if We Don’t Expect Replication. The American Statistician, 73(sup1), 262–270. https://doi.org/10.1080/00031305.2018.1543137

Costigan, S., Ruscio, J., & Crawford, J. T. (2024). Performing Small-Telescopes Analysis by Resampling: Empirically Constructing Confidence Intervals and Estimating Statistical Power for Measures of Effect Size. Advances in Methods and Practices in Psychological Science, 7(1), 25152459241227865. https://doi.org/10.1177/25152459241227865

Cumming, G., & Maillardet, R. (2006). Confidence intervals and replication: Where will the next mean fall? Psychological Methods, 11(3), 217–227. https://doi.org/10.1037/1082-989X.11.3.217

Errington, T. M., Mathur, M., Soderberg, C. K., Denis, A., Perfito, N., Iorns, E., & Nosek, B. A. (2021). Investigating the replicability of preclinical cancer biology. eLife, 10, e71601. https://doi.org/10.7554/eLife.71601

Gelman, A., Hullman, J., & Kennedy, L. (2023). Causal Quartets: Different Ways to Attain the Same Average Treatment Effect. The American Statistician. https://www.tandfonline.com/doi/full/10.1080/00031305.2023.2267597

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3

LeBel, E. P., Vanpaemel, W., Cheung, I., & Campbell, L. (2019). A Brief Guide to Evaluate Replications. Meta-Psychology, 3. https://doi.org/10.15626/MP.2018.843

Ly, A., Etz, A., Marsman, M., & Wagenmakers, E.-J. (2019). Replication Bayes factors from evidence updating. Behavior Research Methods, 51(6), 2498–2508. https://doi.org/10.3758/s13428-018-1092-x

Mathur, M. B., & VanderWeele, T. J. (2020). New Statistical Metrics for Multisite Replication Projects. Journal of the Royal Statistical Society Series A: Statistics in Society, 183(3), 1145–1166. https://doi.org/10.1111/rssa.12572

Muradchanian, J., Hoekstra, R., Kiers, H., & van Ravenzwaaij, D. (2021). How best to quantify replication success? A simulation study on the comparison of replication success metrics. Royal Society Open Science, 8(5), 201697. https://doi.org/10.1098/rsos.201697

Patil, P., Peng, R. D., & Leek, J. T. (2016). What should we expect when we replicate? A statistical view of replicability in psychological science. Perspectives on Psychological Science : A Journal of the Association for Psychological Science, 11(4), 539–544. https://doi.org/10.1177/1745691616646366

Rousselet, G., Pernet, C. R., & Wilcox, R. R. (2023). An introduction to the bootstrap: A versatile method to make inferences by using data-driven simulations. Meta-Psychology, 7. https://doi.org/10.15626/MP.2019.2058

Rousselet, G. A., & Wilcox, R. R. (2020). Reaction Times and other Skewed Distributions: Problems with the Mean and the Median. Meta-Psychology, 4. https://doi.org/10.15626/MP.2019.1630

Spence, J. R., & Stanley, D. J. (2024). Tempered Expectations: A Tutorial for Calculating and Interpreting Prediction Intervals in the Context of Replications. Advances in Methods and Practices in Psychological Science, 7(1), 25152459231217932. https://doi.org/10.1177/25152459231217932

Verhagen, J., & Wagenmakers, E.-J. (2014). Bayesian tests to quantify the result of a replication attempt. Journal of Experimental Psychology. General, 143(4), 1457–1475. https://doi.org/10.1037/a0036731

Wilcox, R. R., & Rousselet, G. A. (2023). An Updated Guide to Robust Statistical Methods in Neuroscience. Current Protocols, 3(3), e719. https://doi.org/10.1002/cpz1.719

Yan, Y., & Genton, M. G. (2019). The Tukey g-and-h distribution. Significance, 16(3), 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x

Can we use cluster-based permutation tests to estimate MEG/EEG onsets?

[An updated version of the work in this post appears in this preprint]

Cluster-based statistics are a family of methods that aim to correct for multiple comparisons while preserving power when effects are distributed over time, space and other dimensions (Maris & Oostenveld, 2007; Smith & Nichols, 2009; Pernet et al., 2015; Rosenblatt et al., 2018; Fields & Kuperberg, 2020; Frossard & Renaud, 2022). These methods capitalise on the correlation of effects across measurements to control the family-wise error rate (FWER, probability of at least one false positive): neighbouring effects that pass a threshold are lumped together, and compared to a reference distribution, typically a null distribution established using a permutation or bootstrap approach (Pernet et al., 2015). These cheap multivariate techniques (I say cheap because even if they take into account the distribution of effects over at least one dimension, there is no explicit modelling required) are particularly popular in fMRI, EEG and MEG research (for instance, according to Google Scholar, Smith & Nichols (2009) has been cited 4945 times, Maris & Oostenveld (2007) 7053 times). Cluster-based methods have the potential to be applied to a wide array of problems, including simpler ones involving only a few comparisons, as I’ve argued in another post (see for instance application to TMS latencies in Dugué et al. (2011). I’ve also used cluster-based statistics extensively in my EEG and behavioural research (Rousselet et al., 2014; Bieniek et al., 2016; Jaworska et al., 2020).

However, cluster-based methods are not without issues. Most notably, Sassenhagen & Draschkow (2019) reminded the community about problems with the interpretation of cluster-based inferences and suggested guidelines on how to report such results. This useful reminder has been heard, as demonstrated by the number of citations: 196 times according to the article’s publisher page and 320 times according to Google Scholar. Essentially, the main problem with cluster-based methods is that they provide p values at the cluster level only, or even for collection of clusters for hierarchical/meta-analysis methods such as TFCE (Smith & Nichols, 2009). There are no p values at individual testing points, and no error control for onsets and offsets. Consequently, cluster-based methods do not provide direct inferences about the localisation of effects. But does that mean we cannot use cluster-based methods to localise, for instance to estimate the onsets of effects?

Sassenhagen & Draschkow (2019) go on to report results from a simulation, which suggest that onset estimation using cluster-based inferences are very noisy. They used a truncated Gaussian as a template, to define a real onset to be estimated. Then, they compared 50 trials of noise only, to 50 trials of noise + template, using cluster-sum F statistics – see cluster-sum explanation here. These cluster-sums were compared to a permutation distribution to determine statistical significance. For each simulation iteration, the first statistically significant point was declared the onset of the effect. Here is their figure of results, showing the distribution of estimated onsets across 10,000 simulation iterations:

The distribution of onsets reveals two important patterns:

  • Estimated onsets appear positively biased, with the mode higher than the real onset.
  • The distribution is asymmetric, with thicker left tail than right tail.

Sassenhagen & Draschkow (2019) reported that “on >20% of runs, the effect onset was estimated too early; divergence of 40 ms or more were found at >10% of runs.”
No bias estimate was reported.

So, how bad are the results? We don’t know because no other method was used to estimate onsets, so we cannot put the results in perspective. Imagine a study that aimed to test the prediction of larger variance in EEG signals in population X relative to population Y, but only tested participants from population X. Having observed large absolute variance in X, what can we conclude without measurements in Y?

Less importantly, Sassenhagen & Draschkow (2019) used EEG data from one human participant as noise to add to their template in the simulations. The data illustrated in supplementary information look heavily contaminated by alpha noise – I could be wrong, but having recorded from hundreds of participants, it looks like the data are from a very sleepy participant. That’s a problem, because the claim is that there is a general issue with cluster-based inferences, not one specific to a particular participant.

Let’s address the comparison problem by contrasting cluster-based inferences with inferences from two other popular methods, the false discovery rate correction for multiple comparisons (FDR, Benjamini & Hochberg, 1995; Nichols & Hayasaka, 2003) and the maximum statistics correction for multiple comparisons (MAX, Holmes et al., 1996; Nichols & Holmes, 2002). FDR methods provide a weak control of the FWER by controlling the number of false positives among the positive tests; when there is no effect, an FDR correction is equivalent to controlling the FWER; when effects are present, it is more powerful than controlling the FWER (Benjamini & Hochberg, 1995; Fields & Kuperberg, 2020). The MAX correction achieves strong control of the FWER by comparing each statistic, say a t or F value, to a distribution of maximum statistics obtained under the null. These methods, like cluster-based methods, do not explicitly estimate onsets (they were not designed for that purpose) and do not provide error control for onsets, but are routinely used to localise effects.

Other methods have been proposed, for instance defining the onset as the time at which a certain proportion of the peak amplitude has been reached (Liesefeld, 2018; Fahrenfort, 2020). However, such methods depend on potentially subjective peak identification (especially when applied to individual participants), and they are guaranteed, by definition, to lead to positively biased onset estimates. These methods are not considered here but maybe they have some merit.

Another recent method, cluster depth tests, combines cluster inferences while delivering p values at every testing point, thus solving the main issue of cluster-based inferences as implemented so far (Frossard & Renaud, 2022). But this approach, like previous cluster-based methods, and alternative ways to correct for multiple comparisons such as FDR or MAX, do not solve a fundamental issue that has been ignored so far: whether a p value is available at each testing point or not is irrelevant, as none of the methods routinely used to localise MEG and EEG effects explicitly assess onsets. That’s because onsets are second order properties of the data. As such, they require to establish the lack of effects in some areas, the presence of effects in others, and a test of the interaction. However, none of the cluster-based methods, or any other standard methods used to localise in EEG/MEG/fMRI and other fields, explicitly test for the interaction. The interaction is an inference performed pretty much by eyeballing maps of thresholded statistics. So localisation attempts typically rely on an interaction fallacy, a problem well documented in neuroscience (Nieuwenhuis et al., 2011) and other fields (Gelman & Stern, 2006).

More explicit estimation of onsets could be achieved by using methods that go beyond mass-univariate tests and consider time-courses. For instance, there is a vast family of algorithms aimed at detecting change points in time-series that could be used to estimate onsets. Many packages in the R programming language (R Core Team, 2021) offer such algorithms, for instance RBeast (Zhao et al., 2019), mcp (Lindeløv, 2020), changepoint (Killick & Eckley, 2014). There is a very useful survey of such packages here. The Beast algorithm (Zhao et al., 2013) is also available in python and Matlab and I’m sure there are many others. Matlab offers several native solutions, including the findchangepts function. For comparison, here I used a basic change point detection method that relies on differences in mean and variance, as implemented in the changepoint package in R, and the built-in findchangepts function in Matlab.

So, let’s revisit the simulation results from Sassenhagen & Draschkow (2019) by performing our own simulations. Following their approach, I compared a noise only condition to a noise + signal condition. The signal was a truncated Gaussian defining an objective onset at 160 ms.

This approach is very similar to that of Sassenhagen & Draschkow (2019), except that I simulated only one electrode, with cluster inferences only in the time domain. There is no need to simulate extra dimensions (electrodes, time-frequencies), because the argument against cluster-based inferences is not specific to one particular dimension. Also, I used synthetic noise instead of using data from one participant. The noise was generated by adding a collection of sinusoids at different frequencies (Yeung et al., 2004). Using instead 1/f noise gave very similar results.

The simulations involved 50 trials per condition, 10,000 iterations per simulation, and the null distributions were estimated using a permutation with 2,000 iterations.
The simulations were performed twice, in R and in Matlab, with the code available on GitHub.

Here is an example of noise trials:

And 50 trials of signal + noise (grey), with the mean superimposed (black):

And an example of averages for the noise only and noise + signal conditions:

The vertical black line marks the 160 ms true onset. The two conditions were compared using two-sample t-tests with unequal variances. Here is the time-course of squared t values matching the previous figure.

We consider four methods to correct for multiple comparisons:

  • [FDR] The standard 1995 FDR method.
  • [MAX] The maximum statistics correction.
  • [CS] Cluster-based + permutation inference, using a cluster-sum statistic of squared t values.
  • [CP] Change point detection based on differences in mean and variance, applied to the time-course of t2 values. The number of change points was set to 2.

Here are the onsets estimated by each method, using the data from the previous example:

In grey is the permutation distribution of squared t values. The solid vertical line indicates the true onset. The dotted vertical line marks the estimated onset. The horizontal dashed lines are permutation thresholds.

After 10,000 iterations, we get these distributions of onsets for our four methods, with the vertical line marking the true onset:

Each distribution has a shape similar to that reported by Sassenhagen & Draschkow (2019): overall a positive bias and some amount of under-estimation. Let’s quantify the results.

[1] Bias was quantified by subtracting the true onset from the median of each distribution. Because the distributions are asymmetric, the median provides a better reflection of the typical onset than the mean (Rousselet & Wilcox, 2020).

Bias:

  • FDR = 30 ms
  • MAX = 40 ms
  • CS = 30 ms
  • CP = 10 ms

Confirming our visual inspection of the distributions, all methods are biased, the least biased being CP, the most biased being MAX. This is not surprising, as MAX is by definition conservative, the price to pay for strong FWER control.

[2] The mean absolute error (MAE) indicated how close to the true onset each method gets.

MAE:

  • FDR = 42 ms
  • MAX = 44 ms
  • CS = 30 ms
  • CP = 19 ms

Again, CS doesn’t perform worse than other methods: CP was best, MAX was worst.

[3] As Sassenhagen & Draschkow (2019) did, let’s look at the proportion of onsets earlier than the true onset.

% early onsets:

  • FDR = 17.5%
  • MAX = 1.6%
  • CS = 0.5%
  • CP = 13.4%

[4] And the proportion of onsets at least 40 ms earlier than the true onset.

% <40 ms onsets:

  • FDR = 14.7%
  • MAX = 1.3%
  • CS = 0.1%
  • CP = 3.7%

Again, there is nothing wrong with the cluster-sum approach at all: other methods that provide p values at individual testing points performed worst.

Using 1/f noise gave similar results, but the absolute results change substantially whether white noise, pink noise or red noise was used – see full simulations on GitHub.

So far we have considered results from one simulated participant. A typical experiment involves more than one participant. What happens if we try to estimate onsets from 20 participants instead? To find out, let’s assume that each participant has a random onset of 150, 160 or 170 ms (uniform distribution), so that the population onset remains 160 ms. There are 50 trials per condition. In each simulation iteration, after estimating the onset in each of the 20 participants, we use the median onset as our group estimate. Here are the results:

Again, there is nothing wrong with the cluster-sum results: they are very similar to the FDR results, better than MAX, worse than change point.

In numbers:

Bias:

  • FDR = 30
  • MAX = 40
  • CS = 30
  • CP = 10

Mean absolute error:

  • FDR = 27
  • MAX = 43
  • CS = 29
  • CP = 14

Proportion too early:

  • FDR = 0.04%
  • MAX = 0%
  • CS = 0%
  • CP = 0.08%

Note the very low proportion of under-estimation once results are combined across participants.

Of course, another way to reduce bias is to increase sample size. Here are the results from one participant and sample sizes ranging from 20 to 150 trials.

Bias can be reduced considerably by using larger sample sizes, but the MAX method remains overly conservative. Here is what happens for the mean absolute error:

And the proportion of early onsets:

Overall, the cluster-sum method appears to perform well, and the change point approach looks very promising.

Discussion

Onsets are inherently noisy because they correspond to the testing points with the lowest signal to noise ratio. But estimation variability can be reduced by increasing the number of trials and pooling estimates across participants. And based on the present simulation results, there is no rationale for doubting more localisation results obtained using cluster-based inferences than other methods. Certainly, I’m not going to ditch the estimates of face ERP onsets I’ve published so far. They are estimates after all, obtained in a larger number of participants, and with promising test-retest reliability obtained from independent EEG sessions (Bieniek et al., 2016).

From the simulations, it seems that one should be more worried about onsets estimated using FDR and MAX corrections. And the change point approach seems very promising. Certainly here I’ve used a relatively simple algorithm and it remains to be determined what can be achieved using modern methods, for instance Bayesian approaches that allow hierarchical pooling of estimates (Lindeløv, 2020).

Currently, no method routinely used in MEG/EEG explicitly estimates onsets, whether a p value is available at every testing point or not. Whether correction for multiple comparison is achieved using cluster-based statistics, FDR, MAX or any related method, even the new cluster depth tests (Frossard & Renaud, 2022), onset estimation relies on a statistical fallacy, because the implied interaction is never explicitly tested. Does it mean that non-sensical results are obtained? The simulations presented here clearly indicate that meaningful results can be obtained, and that cluster-based inferences can provide more accurate results than alternative methods.

Obviously, one experiment provides no guarantee, and that’s why it is essential to consider sampling distributions, like the ones presented here, to get an idea of how much variability to expect in the long run (Rousselet, 2019). But do not fall victim to the statistical orthodoxy: plenty of methods are used to estimate aspects of the data that they were not developed to handle. In doubt, simulations are essential to understand how well a job any of your tools can achieve – never trust the labels.

In addition to the onset estimation approach, it is worth considering other methodological aspects that can affect onsets. The most important factor is probably filtering, with high-pass filters introducing potentially severe signal distortions (Rousselet, 2012; Widmann & Schröger, 2012; Widmann et al., 2015; van Driel et al., 2021). If the goal is to estimate onsets, I recommend using causal filters to remove low frequencies and to avoid backpropagation of effects. On the stimulus front, screen luminance can strongly influence onsets (Bieniek et al., 2013).

In conclusion, can we use cluster-based permutation tests to estimate MEG/EEG onsets? Absolutely, yes, but:

  • Instead of estimating onsets at the group level, measure onsets in each participant and report group distributions.
  • Assess the test-retest reliability of the onsets, either using the same data (cross-validation) or a different one.
  • Report results from more than one method, to assess computational consistency.

More advanced steps:

  • Use large datasets to compare onset estimation methods, including techniques, so far ignored, that explicitly consider time-series.
  • Use large datasets to provide benchmarks in your field.
  • Use simulations to compare methods. To go beyond the simple approach presented here, one could use algorithms that generate more realistic time-series (Barzegaran et al., 2019).

References

Barzegaran, E., Bosse, S., Kohler, P.J., & Norcia, A.M. (2019) EEGSourceSim: A framework for realistic simulation of EEG scalp data using MRI-based forward models and biologically plausible signals and noise. J. Neurosci. Methods, 328, 108377.

Benjamini, Y. & Hochberg, Y. (1995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B Methodol., 57, 289–300.

Bieniek, M., Frei, L., & Rousselet, G. (2013) Early ERPs to faces: aging, luminance, and individual differences. Front. Psychol., 4.

Bieniek, M.M., Bennett, P.J., Sekuler, A.B., & Rousselet, G.A. (2016) A robust and representative lower bound on object processing speed in humans. Eur. J. Neurosci., 44, 1804–1814.

Dugué, L., Marque, P., & VanRullen, R. (2011) Transcranial Magnetic Stimulation Reveals Attentional Feedback to Area V1 during Serial Visual Search. PLOS ONE, 6, e19712.

Fahrenfort, J.J. (2020) Multivariate Methods to Track the Spatiotemporal Profile of Feature-Based Attentional Selection Using EEG. In Pollmann, S. (ed), Spatial Learning and Attention Guidance, Neuromethods. Springer US, New York, NY, pp. 129–156.

Fields, E.C. & Kuperberg, G.R. (2020) Having your cake and eating it too: Flexibility and power with mass univariate statistics for ERP data. Psychophysiology, 57, e13468.

Frossard, J. & Renaud, O. (2022) The cluster depth tests: Toward point-wise strong control of the family-wise error rate in massively univariate tests with application to M/EEG. NeuroImage, 247, 118824.

Gelman, A. & Stern, H. (2006) The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant. Am. Stat., 60, 328–331.

Holmes, A.P., Blair, R.C., Watson, J.D., & Ford, I. (1996) Nonparametric analysis of statistic images from functional mapping experiments. J. Cereb. Blood Flow Metab. Off. J. Int. Soc. Cereb. Blood Flow Metab., 16, 7–22.

Jaworska, K., Yi, F., Ince, R.A.A., van Rijsbergen, N.J., Schyns, P.G., & Rousselet, G.A. (2020) Healthy aging delays the neural processing of face features relevant for behavior by 40 ms. Hum. Brain Mapp., 41, 1212–1225.

Killick, R. & Eckley, I.A. (2014) changepoint: An R Package for Changepoint Analysis. J. Stat. Softw., 58, 1–19.

Liesefeld, H.R. (2018) Estimating the Timing of Cognitive Operations With MEG/EEG Latency Measures: A Primer, a Brief Tutorial, and an Implementation of Various Methods. Front. Neurosci., 12.

Lindeløv, J.K. (2020) mcp: An R Package for Regression With Multiple Change Points. OSF Prepr.,.

Maris, E. & Oostenveld, R. (2007) Nonparametric statistical testing of EEG- and MEG-data. J. Neurosci. Methods, 164, 177–190.

Nichols, T.E. & Holmes, A.P. (2002) Nonparametric permutation tests for functional neuroimaging: A primer with examples. Hum. Brain Mapp., 15, 1–25.

Nichols, T. & Hayasaka, S. (2003) Controlling the familywise error rate in functional neuroimaging: a comparative review. Stat Methods Med Res, 12, 419–446.

Nieuwenhuis, S., Forstmann, B.U., & Wagenmakers, E.-J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat. Neurosci., 14, 1105–1107.

Pernet, C.R., Latinus, M., Nichols, T.E., & Rousselet, G.A. (2015) Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. J. Neurosci. Methods, Cutting-edge EEG Methods, 250, 85–93.

R Core Team (2021) R: A Language and Environment for Statistical Computing.

Rosenblatt, J.D., Finos, L., Weeda, W.D., Solari, A., & Goeman, J.J. (2018) All-Resolutions Inference for brain imaging. NeuroImage, 181, 786–796.

Rousselet, G. (2012) Does Filtering Preclude Us from Studying ERP Time-Courses? Front. Psychol., 3.

Rousselet, G. (2019) Using simulations to explore sampling distributions: an antidote to hasty and extravagant inferences.

Rousselet, G.A., Ince, R.A.A., van Rijsbergen, N.J., & Schyns, P.G. (2014) Eye coding mechanisms in early human face event-related potentials. J. Vis., 14, 7.

Rousselet, G.A. & Wilcox, R.R. (2020) Reaction Times and other Skewed Distributions: Problems with the Mean and the Median. Meta-Psychol., 4.

Sassenhagen, J. & Draschkow, D. (2019) Cluster-based permutation tests of MEG/EEG data do not establish significance of effect latency or location. Psychophysiology, 56, e13335.

Smith, S.M. & Nichols, T.E. (2009) Threshold-free cluster enhancement: Addressing problems of smoothing, threshold dependence and localisation in cluster inference. NeuroImage, 44, 83–98.

Stevens, M.H.H. (2009) A Primer of Ecology with R, 2nd Printing. Springer, New York.

van Driel, J., Olivers, C.N.L., & Fahrenfort, J.J. (2021) High-pass filtering artifacts in multivariate classification of neural time series data. J. Neurosci. Methods, 352, 109080.

Widmann, A. & Schröger, E. (2012) Filter Effects and Filter Artifacts in the Analysis of Electrophysiological Data. Front. Psychol., 3.

Widmann, A., Schröger, E., & Maess, B. (2015) Digital filter design for electrophysiological data – a practical approach. J. Neurosci. Methods, Cutting-edge EEG Methods, 250, 34–46.

Yeung, N., Bogacz, R., Holroyd, C.B., & Cohen, J.D. (2004) Detection of synchronized oscillations in the electroencephalogram: An evaluation of methods. Psychophysiology, 41, 822–832.

Zhao, K., Valle, D., Popescu, S., Zhang, X., & Mallick, B. (2013) Hyperspectral remote sensing of plant biochemistry using Bayesian model averaging with variable and band selection. Remote Sens. Environ., 132, 102–119.

Zhao, K., Wulder, M.A., Hu, T., Bright, R., Wu, Q., Qin, H., Li, Y., Toman, E., Mallick, B., Zhang, X., & Brown, M. (2019) Detecting change-point, trend, and seasonality in satellite time series data to track abrupt changes and nonlinear dynamics: A Bayesian ensemble algorithm. Remote Sens. Environ., 232, 111181.

How to test for variance homogeneity

It’s simple: don’t do it! (Or do it but for other reasons — see below)

“A Levene’s test was used to test if the distributions have equal variances.”

“To establish normality, we used a Shapiro-Wilk test with p > 0.05; and for equal variances we used Levene’s test.”

This approach of checking assumptions and then deciding on the test to perform suffers from the limitations described in my previous post. In that post, I described why you should avoid normality tests, and instead could use them in a statistics class to cover a lot of core topics. The same goes for tests of variance homogeneity, which are often used in conjunction with normality tests. In particular, methods to detect differences in variance between groups can have low power and inaccurate type I error rates (false positives), which are the topics of this post. But how to capture variance differences is worth looking at, because in some situations a legitimate question is whether distributions differ in spread, or more specifically in variance. There is a wide range of applications of test of variance differences in fields such as neuroscience, economics, genetics, quality control (Coroner et al. 1982; Li et al. 2015; Ritchie et al. 2018; Patil & Kulkarni, 2022). Thus, it is useful to consider the false positive rate and power of these methods in different situations.

Before exploring error rates for methods aimed at detecting variance differences, it’s worth pointing out a paper by Zimmerman (2004), which looked at the error rates for a combo approach, in which a test of variance homogeneity is performed first, followed by a standard t-test or a Welsh t-test depending on the outcome. This approach of conducting a preliminary check is not recommended: false positives (type I errors) and power of the combo depend on the relative and absolute sample sizes, as well as the magnitude of the variance difference between groups, leading to poor performance in realistic situations. What works best is to use methods for unequal variances by default (heteroscedastic methods); a recommendation echoed in more recent work (Delacre et al. 2017). In the presence of skewness or outliers, or both, power can be boosted by using trimmed means in conjunction with parametric or bootstrap methods (Wilcox & Rousselet, 2023). In general, it is a bad idea to rely on methods that make strong assumptions about the sampled populations and to hope for the best.

Now let’s look at power and type I error rates for a few methods aimed at comparing variances. There are so many methods available it is hard to decide on a few candidate approaches. For instance, Conover et al. (1981) compared 56 tests of variance homogeneity. And many more tests have been proposed since then. Conover et al. (1981) recommended several tests: the Brown-Forsythe test, which is the same as Levene’s test, but using the median instead of the mean, and the Fligner-Killeen test. So we’ll use these three tests here. Zimmerman (2004) used Levene’s test. The three tests belong to a family of tests in which absolute or squared distances between observations and a measure of central tendency are compared using parametric (t-tests, ANOVAs) or rank-based methods (Conover et al. 1981; 2018). Levene’s test uses the mean to centre the distributions, whereas the Brown-Forsythe and Fligner-Killeen tests use the median. In addition, we’ll look at Bartlett’s test, which is known to be very sensitive to departures from normality, and a percentile bootstrap method to compare variances (Wilcox 2002). As we will see, all these tests perform poorly in the presence of skewness, for reasons explained in Conover et al. (2018), who go on to suggest better ones.

The code to reproduce the simulations and the figures is available on GitHub. The simulations involved 10,000 iterations, with 1,000 samples for the bootstrap method.

Simulation 1: normality

Let’s first look at what happens under normality, an unlikely situation in many fields, but a useful benchmark. We consider only 2 groups. One population has a standard deviation of 1; the other population has a standard deviation that varies from 1 to 2 in steps of 0.1. Sample size varies from 10 to 100, in steps of 10. Here are the populations we sample from:

The results for the 5 methods are summarised in the next figure:

The results for SD=1 correspond to the type I error rate (false positives), which should be around 5%, because that is the arbitrary threshold I chose here. It is the case for Bartlett, but notice how Levene overshoots, and BF and FK undershoot at the lowest sample sizes. The percentile bootstrap method is systematically under 0.05 at all sample sizes. This is a reminder that your alpha is not necessarily what you think. The differences among methods for SD=1 (false positives) are easier to see in this figure:

As the standard deviation in the second population increases, power increases for all methods, as expected. Notice the large sample sizes required to detect variance differences. How do the methods compare for the largest population difference? Bartlett performed best, FK came last:

Simulation 2: skewness

Now, what happens when the populations are skewed? Here we consider g-and-h distributions, with parameters g=1 and h=0, which gives the same shape as a lognormal distribution, but with median zero. The g-and-h distributions are nicely described and illustrated in Yan & Genton (2019). We vary the standard deviation from 1 to 2, leading to these populations:

This is not to suggest that such distributions will often be encountered in applied work. Rather, methods that can maintain false positive rates near nominal level and high power when dealing with these distributions should be able to handle a large variety of situations and can therefore be recommended. Also, fun fact, the distributions above have the same median (0) and skewness, but differ in mean and variance.

Here is what we get when we sample from skewed populations:

Compared to results obtained under normality, the maximum power is lower (more illustrations of that later), but look at what happens to the false positives (SD=1): they skyrocket for Bartlett and Levene, and increase less dramatically for FK. The BF test handles skewness very well, whereas the percentile bootstrap is a bit liberal. Let’s compare the false positives of the different methods in one figure:

The same for maximum power (SD=2):

Obviously, the higher power of Bartlett and FK cannot be trusted given their huge false positive rates.

Simulation 3: false positives as a function of skewness

Here we sample from g-and-h distributions that vary from g=0 to g=1. So we explore the space between the two previous simulations parametrically. First, we consider false positives (no difference in variance). Here are the populations we sample from:

And the results for the 5 methods:

Comparison of the 5 methods for g=1:

BF and the percentile bootstrap tests are clearly more robust to skewness than the other methods. Bartlett is useless, but why is this test available in stat packages? In R, bartlett.test doesn’t even come with a warning.

Simulation 4: true positives as a function of skewness

We proceed as in simulation 3, but now the populations always differ by one standard deviation:

All the methods are strongly affected by skewness, with power dropping more for methods that are better at preserving the type I error rate at the nominal level (BF and percentile bootstrap):

A reminder that if you rely on power calculators that assume normality, your power is probably lower than you think.

Conclusion

None of the methods considered here were satisfactory. Only the Brown-Forsythe test and Wilcox’s bootstrap method controlled the type I error rate under non-normality, but their power was strongly affected by skewness. Conover et al. (2018) have proposed alternative methods to maintain high power in the presence of skewness. They recommend methods in which distributions are centred using the global mean or median (across groups), a simple step that improves performance considerably over the subtraction of the mean or median separately in each group (as used here and by default in R). See their discussion for an explanation of the lack of robustness to skewness of standard tests when individual group means or medians are used instead of the global ones. Conover et al. (2018) also considered the lognormal distribution, which corresponds to the g-and-h distribution with g=1 studied here, so their proposed methods should perform much better than the ones we considered. And there are plenty more tests on the market. For instance, Patil & Kulkarni (2022) have proposed a new method that promises high power in a range of situations. Please post a comment if you know of R packages that implement modern robust methods.

Finally, comparing variances is a very specific question. More broadly, one might be interested in differences in spread between distributions. For this more general question, other tools are available, relying on robust measures of scale (Wilcox, 2017, chapter 5) and quantile approaches (Rousselet et al. 2017). The distinction between variance and spread is important, because differences in variance could be driven by or masked by outliers or skewness, which might not affect a robust estimator of scale.

References

Conover, W.J., Johnson, M.E., & Johnson, M.M. (1981) A Comparative Study of Tests for Homogeneity of Variances, with Applications to the Outer Continental Shelf Bidding Data. Technometrics, 23, 351–361.

Conover, W.J., Guerrero-Serrano, A.J., & Tercero-Gómez, V.G. (2018) An update on ‘a comparative study of tests for homogeneity of variance.’ Journal of Statistical Computation and Simulation, 88, 1454–1469.

Delacre, M., Lakens, D., & Leys, C. (2017) Why Psychologists Should by Default Use Welch’s t-test Instead of Student’s t-test. International Review of Social Psychology, 30(1).

Li, X., Qiu, W., Morrow, J., DeMeo, D.L., Weiss, S.T., Fu, Y., & Wang, X. (2015) A Comparative Study of Tests for Homogeneity of Variances with Application to DNA Methylation Data. PLoS One, 10, e0145295.

Patil, K.P. & Kulkarni, H.V. (2022) An uniformly superior exact multi-sample test procedure for homogeneity of variances under location-scale family of distributions. Journal of Statistical Computation and Simulation, 92, 3931–3957.

Ritchie, S.J., Cox, S.R., Shen, X., Lombardo, M.V., Reus, L.M., Alloza, C., Harris, M.A., Alderson, H.L., Hunter, S., Neilson, E., Liewald, D.C.M., Auyeung, B., Whalley, H.C., Lawrie, S.M., Gale, C.R., Bastin, M.E., McIntosh, A.M., & Deary, I.J. (2018) Sex Differences in the Adult Human Brain: Evidence from 5216 UK Biobank Participants. Cereb Cortex, 28, 2959–2975.

Rousselet, G.A., Pernet, C.R., & Wilcox, R.R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience. European Journal of Neuroscience, 46, 1738–1748.

Wilcox, R.R. (2002) Comparing the variances of two independent groups. Br J Math Stat Psychol, 55, 169–175.

Wilcox, R.R. (2017) Introduction to Robust Estimation and Hypothesis Testing, 4th edition. edn. Academic Press.

Wilcox, R.R. & Rousselet, G.A. (2023) An Updated Guide to Robust Statistical Methods in Neuroscience. Current Protocols, 3, e719.

Yan, Y. and Genton, M.G. (2019), The Tukey g-and-h distribution. Significance, 16: 12-13. https://doi.org/10.1111/j.1740-9713.2019.01273.x

Zimmerman, D.W. (2004) A note on preliminary tests of equality of variances. Br J Math Stat Psychol, 57, 173–181.

Why normality tests are great…

…as a teaching example and should be avoided in research.

These statements are common in the psychology and neuroscience literature:

“In order to assess the normal distribution of the population in terms of age, BV% and CSF%, the Lilliefors-corrected Kolmogorov–Smirnov test was performed” (Porcu et al. 2019)

“The Kolmogorov–Smirnov-Test revealed a normal distribution (p = 0.82).” (Knolle et al. 2019)

“The distribution was not normal (P < 0.01 with the Shapiro–Wilk test).” (Beaudu-Lange et al. 2001)

“Assumptions of the one-way anova for normality were also confirmed with the Shapiro–Wilk test.” (Holloway et al. 2015)

“The Shapiro-Wilk-W-test (P < 0.05) revealed that all distributions could be assumed to be Gaussian as a prerequisite for the application of a t-test.” (Dicke et al. 2008)

“Given the non-normal distribution of such data (Shapiro–Wilk’s p < .05), we applied a nonparametric one-sample t test (the one-sample Wilcoxon signed rank test).” (Zapparoli et al. 2019)

A common recipe goes like this:

  • apply a normality test;
  • if p>0.05, conclude that the data are normally distributed and proceed with a parametric test;
  • if p<0.05, conclude that the data are not normally distributed and proceed with a non-parametric test (or transform the data to try to achieve normality).

It is a useful exercise or class activity to consider the statements above with the goal of identifying all the underlying issues. It could take several hours of teaching to do justice to the rich topics we need to cover to properly understand these issues.

Here is a succinct and non-exhaustive list of issues, with references for follow-up readings:

[1] In the general context of linear regression, the normality assumption applies to the residuals, not the marginal distributions. The main solution involves graphical checks of the residuals (Ernst & Albers, 2017; Vanhove, 2018).

Resources for graphical checks:

Visualization of Regression Models Using visreg

Visualizing regression model predictions

Extracting and visualizing tidy residuals from Bayesian models

Other solutions involve model comparison, to contrast models making different assumptions, and using models robust to assumption violations (Bürkner, 2017; Kruschke, 2013; Wilcox & Rousselet, 2018).

[2] The p value from standard frequentist tests, such as normality tests, cannot be used to accept the null (Rouder et al., 2016; Kruschke, 2018). The p value being computed assuming that the null is true, it cannot in turn be used to support the null — that’s circular. To find support for the null, we need an alternative hypothesis (to compute a Bayes Factor; Rouder et al., 2016; Wagenmakers et al., 2020) or a Region of Practical Equivalence (ROPE, to compute a test of equivalence; Freedman et al., 1984; Kruschke, 2018; Lakens, 2017; Campbell & Gustafson, 2022). Setting an alternative hypothesis is also crucial to get a consistent test (Rouder et al., 2016; Kruschke & Liddell, 2018). Tests of normality, like all Point Null Hypothesis Significance Tests (PNHST), are inconsistent: given alpha = 0.05, even if normality holds, 5% of tests will be positive no matter how large the sample size is.

[3] Failure to reject (p>0.05) doesn’t mean data were sampled from a normal distribution. Another function could fit the data equally well (for instance a shifted lognormal distribution). This point follows directly from [2]. Since our alternative hypothesis is extremely vague, the possibility of another distribution being a plausible data generation process is ignored: the typical test considers only a point null hypothesis versus “anything else”. So when we ask a very vague question, we can only get a very vague answer (there is no free lunch in inference – Rouder et al., 2016).

[4] Failure to reject (p>0.05) could be due to low power. This is well known but usually ignored. Here are the results of simulations to illustrate this point. The code is available on GitHub. We sample from g-and-h distributions (Yan & Genton, 2019), which let us vary asymmetry (parameter g) and tail-thickness (parameter h, which also affects how peaky the distribution is). We start by varying g, keeping a constant h=0.

g-and-h populations used in the simulation in which we vary parameter g

Here are results for the Shapiro-Wilk test, based on a simulation with 10,000 iterations.

The Shapiro-Wilk test has low power unless the departure from normality is pronounced, or sample sizes are large. With small departures from normality (say g=0.1, g=0.2), achieving high power won’t be possible with typical sample sizes in psychology and neuroscience. For g=0, the proportion of false positives is at the expected 5% level (false positive rate).

The Kolmogorov-Smirnov test is dramatically less powerful than the Shapiro-Wilk test (Yap & Sim, 2011).

What happens if we sample from symmetric distributions that are more prone to outliers than the normal distribution? By varying the h parameter, keeping a constant g=0, we can consider distributions that are progressively more kurtotic than the normal distribution.

g-and-h populations used in the simulation in which we vary parameter h

Are the tests considered previously able to detect such deviations from normality? Here is how the Shapiro-Wilk test behaves.

And here are the catastrophic results for the Kolmogorov-Smirnov test.

[5] As the sample size increases, progressively smaller and smaller deviations from normality can be detected, eventually reaching absurd levels of precision, such that tiny differences of no practical relevance will be flagged. This point applies to all PNHST and again follows from [2]: because in PNHST no alternative is considered, tests are biased against the null (Rouder et al., 2016; Wagenmakers et al., 2020). Even when p<0.05, contrasting two hypotheses could reveal that a normal distribution and a non-normal distribution are equally plausible, given our data. Also, because PNHST is not consistent, even when the null is true, 5% of tests will be positive.

[6] Choosing a model conditional on the outcome of a preliminary check affects sampling distributions and thus p values and confidence intervals. The same problem arises when doing balance tests. If a t-test is conditional on a normality test, the p value of the t-test will be different (but unknown) from the one obtained if a t-test is performed without a preliminary check. That’s because p values depend on sampling distributions of imaginary experiments, which in turn depend on sampling and testing intentions (Wagenmaker, 2007; Kruschke & Liddell, 2018). This dependence can make p values difficult to interpret, because unless we simulate the world of possibilities that led to our p value, the sampling distribution for our statistic (say t statistic) is unknown.

[7] When non-normality is detected or suspected, a classic alternative to the two sample t-test is the Wilcoxon-Mann-Whitney test. However, in general different tests or models address different hypotheses — they are not interchangeable. For instance, the WMW’s U statistics is related to the distribution of all pairwise differences between two independent groups; unlike the t-test it doesn’t involve a comparison of the marginal means. Similarly, if instead of the mean, we use a trimmed mean, a robust measure of central tendency, our inferences are about the population trimmed mean, not the population mean.

[8] In most cases, researchers know the answer to the normality question before conducting the experiment. For instance, we know that reaction times, accuracy and questionnaire data are not normally distributed. Testing for normality when we already know the answer is unnecessary and falls into the category of tautological tests. Since we know the answer in most situations, it is better practice to use appropriate models and drop the checks altogether. For instance, accuracy data follow beta-binomial distributions (Jaeger, 2008; Kruschke, 2014); questionnaire data can be modelled using ordinal regression (Liddell & Kruschke, 2018; Bürkner & Vuorre, 2019; Taylor et al., 2022); reaction time data can be modelled using several families of skewed distributions (Lindeløv, 2019).

References

Bürkner, P.-C. (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01

Bürkner, P.-C. & Vuorre, M. (2019) Ordinal Regression Models in Psychology: A Tutorial. Advances in Methods and Practices in Psychological Science, 2, 77–101. https://journals.sagepub.com/doi/full/10.1177/2515245918823199

Campbell, H. & Gustafson, P. (2021) re:Linde et al. (2021): The Bayes factor, HDI-ROPE and frequentist equivalence tests can all be reverse engineered – almost exactly – from one another. https://arxiv.org/abs/2104.07834

Ernst AF, Albers CJ. 2017. Regression assumptions in clinical psychology research practice—a systematic review of common misconceptions. PeerJ 5:e3323 https://doi.org/10.7717/peerj.3323

Freedman, L.S., Lowe, D., & Macaskill, P. (1984) Stopping rules for clinical trials incorporating clinical opinion. Biometrics, 40, 575–586.

Jaeger, T.F. (2008) Categorical Data Analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. J Mem Lang, 59, 434–446.

Kruschke, J.K. (2013) Bayesian estimation supersedes the t test. J Exp Psychol Gen, 142, 573–603.

Kruschke, J.K. (2014) Doing Bayesian Data Analysis, 2nd Edition. edn. Academic Press.

Kruschke, J.K. (2018) Rejecting or Accepting Parameter Values in Bayesian Estimation. Advances in Methods and Practices in Psychological Science, 1, 270–280.

Kruschke, J.K. & Liddell, T.M. (2018) The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychon Bull Rev, 25, 178–206.

Lakens, D. (2017). Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses. Social Psychological and Personality Science, 8(4), 355–362. https://doi.org/10.1177/1948550617697177

Liddell, T.M. & Kruschke, J.K. (2018) Analyzing ordinal data with metric models: What could possibly go wrong? Journal of Experimental Social Psychology, 79, 328–348.

Lindeløv, J.K. (2019) Reaction time distributions: an interactive overview
https://lindeloev.github.io/shiny-rt/

Rouder, J.N., Morey, R.D., Verhagen, J., Province, J.M. and Wagenmakers, E.-J. (2016), Is There a Free Lunch in Inference?. Top Cogn Sci, 8: 520-547. https://doi.org/10.1111/tops.12214

Taylor, J.E., Rousselet, G.A., Scheepers, C. et al. Rating norms should be calculated from cumulative link mixed effects models. Behav Res (2022). https://doi.org/10.3758/s13428-022-01814-7

Torrin M.Liddell & John K.Kruschke (2018) Analyzing ordinal data with metric models: What could possibly go wrong? Journal of Experimental Social Psychology, 79, 328-348
https://www.sciencedirect.com/science/article/abs/pii/S0022103117307746

Vanhove (2018) Checking model assumptions without getting paranoid. https://janhove.github.io/analysis/2018/04/25/graphical-model-checking

Wagenmakers, E.-J. (2007) A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779–804.

Wagenmakers, E.-J., Lee, M.D., Rouder, J.N., & Morey, R.D. (2020) The Principle of Predictive Irrelevance or Why Intervals Should Not be Used for Model Comparison Featuring a Point Null Hypothesis. In Gruber, C.W. (ed), The Theory of Statistics in Psychology: Applications, Use, and Misunderstandings. Springer International Publishing, Cham, pp. 111–129.

Wilcox RR, Rousselet GA. A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci. 2018 Jan 22;82:8.42.1-8.42.30. doi: 10.1002/cpns.41. PMID: 29357109.

Yan, Y. & Genton, M.G. (2019) The Tukey g-and-h distribution. Significance, 16, 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x

Yap, B.W. & Sim, C.H. (2011) Comparisons of various types of normality tests. Journal of Statistical Computation and Simulation, 81, 2141–2155. DOI: 10.1080/00949655.2010.520163

Mean or median reaction time? An open review of Miller (2020)

Below is a review of Miller (2020), which is a reply to an article in which Rand Wilcox & I reproduced and built upon Miller (1988). Here are the articles in order:

[1] A Warning About Median Reaction Time

[2] Reaction times and other skewed distributions: problems with the mean and the median

[3] Another Warning about Median Reaction Time

I’ve added a link to [3] and this review to article [2], so readers are aware of the discussions.

The review will only make sense if you at least read [3] first, but [2] contains a lot of simulations, descriptions and references not covered in [3].

Review

To start, I’ve got to say I’ve learnt so much about various sources of bias from your work on reaction time analyses, including Miller (1988) and many subsequent papers. I discovered Miller (1988) by chance a few years ago, while researching a review article on robust measures of effect sizes. Actually, I was so startled by the 1988 results that I dropped the article on effect sizes to replicate Miller’s simulations and explore their consequences. This extensive work has taught me a lot about reaction time data, the mean, the median, their sampling distributions and associated inferential statistics. So I’m really thankful for that.

Overall, I enjoyed your reply to our paper, which provides interesting new simulations and a good summary of the issues. The main apparent discrepancies are much smaller than they seem and can easily be addressed by wording key statements and conclusions more carefully, for instance by highlighting boundary conditions. If anything I wrote below is unclear, feel free to contact me directly. In particular, if needed, we could discuss the g&h simulation code, which I realise I probably could have explained better in the R&W2020 article.

Your article is well written. The illustrations are fine but could be improved by adding colours or grey/black contrasts. In Figure 4 (presenting the most original and interesting results), the symbols could be different for the 3 estimators to improve clarity.

The simulation results are convincing and mostly concur with our own assessment. However, what is missing is a consideration of the effects of outliers and the skewness of the distribution of differences (after pooling across trials using the mean, the median or some other estimator). As we explained in R&W2020, outliers and skewness can essentially destroy the power of inferential tests using the mean, whereas tests on the median are hardly affected.

“Another unusual feature of R&W’s simulations is that they used g&h distributions (Hoaglin, 1985) as models of RT. These distributions are quite different from ex-Gaussians and are not normally considered in RT modelling (e.g., Luce, 1986). This distributional choice may also have contributed to the advantage for medians in their simulations.”

I don’t understand how our simulations using g&h distributions could be the source of the discrepancy, because we didn’t use them to simulate distributions of reaction times. In fact, we used them to systematically manipulate the distribution of differences that is fed to a one-sample test, from normal to very skewed. We also varied the probability that outliers are observed. The shape of the marginal RT distributions should be irrelevant to these simulations: when the RT distributions are identical in every condition and participant, irrespective of their skewness, the distribution of differences is normal (that is, for each participant, compute the mean for each condition then subtract the means, leading to a distribution of pairwise differences). When the distributions differ in skewness, the distribution of differences has skewness equal to the difference in skewness between the original distributions. Thus, our simulations are only concerned with the group level analyses, and only with the shapes of the distributions of differences — whether these distributions resulted from individual means or medians was not considered. We also used different one-sample tests for the different estimators of central tendency of the distributions of differences. This is necessary because the mean, the median and trimmed means have different standard errors. So, our assessments of the median are not comparable between simulations.

To be clear: as I understand, in your simulations, different RT distributions from 2 conditions are summarised using the mean and the median, one for each condition and each participant (stage 1); pairwise differences are computed, resulting in distributions of differences (stage 2); one-sample tests on the mean of the differences are performed.

In R&W2020’s g&h simulations, we ignored stage 1. We only considered the shapes of the distributions of differences, and then computed one-sample tests for 3 different estimators of central tendency. The skewness of these distributions depend on both within- and between-participant variability, but we did not model these sources of variability explicitly, only their end-product. Our simulations demonstrate that, given the same distribution of differences, in some situations the median and the 20% trimmed mean have dramatically more power than the mean.

It might well be that in some (many?) situations, RT distributions do not differ enough in skewness to affect the power of the one-sample t-test. But it remains a fundamental statistical result that one-sample t-tests are strongly affected by skewness, and even much more so by outliers. This is also covered in detail in Wilcox & Rousselet (2018).

To address this discrepancy in simulation results, I don’t think new simulations necessarily need to be added, but the problem should be presented more carefully.

Other apparent disagreements can be addressed by more careful phrasing of the situations under which the problems occur, especially at the start of the conclusion section, which I find somewhat misleading.

“R&W concluded that “there seems to be no rationale for preferring the mean over the median as a measure of central tendency for skewed distributions” (p. 37). On the contrary, when performing hypothesis tests to compare the central tendencies of RTs between experimental conditions, the present simulations show that there may be an extremely clear rationale in terms of both Type I error rate and power.”

As we explain in our paper, it is precisely because the mean is a poor measure of central tendency that in some situations it is better at detecting distribution differences (particularly when conditions differ in skewness, and more specifically in their right tails when dealing with RT distributions). But higher power or nominal type I error rate doesn’t make the mean a better measure of central tendency.

What is needed in this section is a clear distinction between 2 different but complementary goals:
[1] to detect differences between distributions;
[2] to understand how distributions differ.

As we argue in our paper, if the goal is [2], then it makes no sense to use the mean or the median; much better tools are available, starting by using the mean and the median, to get a richer perspective. The distinction is clearly made in footnote 1 of your article, but should be reiterated in the conclusion and the abstract, so there is no confusion.

“When comparing conditions with unequal numbers of trials, the sample-size-dependent bias of regular medians can lead to clear inflation of the Type I error rate (Fig. 2), so these medians definitely should not be used.”

This statement is only valid when considering low numbers of trials in the condition with the least trials. So to be clear, the problem emerges only for a combination of absolute and relative numbers of trials. The problem also occurs when group statistics involve means. In contrast, performing for instance a median one-sample test on group differences between medians does not lead to inflated false positives. I realise most users who collapse RT distributions are most likely to perform group statistics using the mean, but this assumption needs to be explicitly stated. The choice of estimator applies to the two levels of analysis: within-participants and between-participants.

For this reason, in the text, it would be worth describing what inferential tests were used for the analyses (I presume t-tests). This should bring a bit of nuance to this statement for instance, given that power depends on choices at 2 levels of analysis:

“The choice of central tendency measure would then be determined primarily by comparing the power of these three measures (i.e., means, medians, bias-corrected medians).”

In R&W2020 we also considered tests on medians, which solve the bias issue. We also looked at trimmed means, and other options are available such as the family of M-estimators.

“In view of the fact that means have demonstrably greater power than bias-corrected medians for experiments with unequal trial frequencies (Fig. 3), it is also sensible to compare power levels in experiments with equal trial frequencies.”

This statement is inaccurate because too general: as we demonstrate in Figure 15 of R&W2020, at least in some situations, the median can have higher power than the mean. Maybe more importantly, in the presence of skewness and outliers, methods using quantiles (with the median as a special case) or trimmed means can have dramatically more power than mean-based methods.

To go back to the distinction between goals [1] and [2], as we argue in R&W2020, ideally we should focus on richer descriptions of distributions using multiple quantiles to understand how they differ (a point you also make in other articles). Limiting analyses to the mean or the median is really unsatisfactory. Also, standard t-tests and ANOVAs do not account for measurement error, including item variability, which should really be handled using hierarchical models. Whether going the quantile way or the hierarchical modelling way (or both, the number of trials required to make sense of the data would make bias issues naturally go away. So personally I see using the mean RT as a valid recommendation, but only in very specific situations.

Your last point in the discussion about choosing a measure of central tendency before seeing the data is an important one. Perhaps more importantly, given the large number of options available to make sense of rich reaction time data, probably the prime recommendation should be that authors make their data publicly available, so that others can try alternative, and in most cases, better techniques.

Other points:

“There is unfortunately no consensus about the psychological meanings of changes in these different parameters (e.g., Rieger & Miller, 2019), but the ex-Gaussian distribution nevertheless remains useful as a way of describing changes in the shapes as well as the means of RT distributions.”

This is an excellent point. It would be worth also to cite Matzke & Wagenmakers (2009) in addition to Rieger & Miller (2019).

I would also mention that other distributions provide better interpretability in terms of shift and scale effects (shifted lognormal) — see Lindelov (2019).

Typos:

“the very between-condition difference” — remove very?

Analog teaching activities about sampling and resampling

This year I’m teaching a new undergraduate course on the bootstrap for 4th year psychology students. Class examples, take-home exercises and the exam use R. I will also use a few analog activities in class. Here I’d like to share some of these activities. (This is also the opportunity to start a new category of posts on teaching.) The course is short, with only 5 sessions of 2 hours, but I think it is important to spend some of that time to try to get key concepts across by providing engaging activities. I’ll report back on how it went.

The 3 main activities involve dice, poker chips and wooden fish, to explore different types of sampling, sampling distributions, the distinction between sample and population, resampling…

Activity 1: dice (hierarchical sampling)

We use dice to simulate sampling with replacement from an infinite population of participants and trials.

This exercise provides an opportunity to learn about:

  • the distinction between population and sample;
  • sampling with replacement;
  • hierarchical sampling;
  • running simulations;
  • estimation;
  • the distinction between finite and infinite populations.

Material:

  • 3 bags of dice
  • 3 trays (optional)

Each bag contains a selection of dice with 4 to 20 facets, forming 3 independent populations. I used a lot of dice in each of bag but that’s not necessary. It just makes it harder to guess the content of the bags. I got the dice from the TheDiceShopOnline.

Many exercises can be proposed, involving different sampling strategies, with the aim of making some sort of inference about the populations. Here is the setup we will use in class:

  • 3 participants or groups of participants are involved, each working independently with their own bag/population;
  • a dice is randomly picked from a bag (without looking inside the bag!) — this is similar to randomly sampling a participant from the population;
  • the dice is rolled 5 times, and the results written down — this is similar to randomly sampling trials from the participant;
  • perform the two previous steps 10 times, for a total of 10 participants x 5 trials = 50 trials.

These values are then entered into a text file and shared with the rest of the class. The text files are opened in R, and the main question is: is there evidence that our 3 samples of 10 participants x 5 trials were drawn from different populations? To simplify the problem, a first step could involve averaging over trials, so we are left with 10 values from each group. The second step is to produce some graphical representation of the data. Then we can try various inferential statistics.

The exercise can be repeated on different days, to see how much variability we get between simulated experiments. During the last class, the populations and the sampling distributions are revealed.

Also, in this exercise, because the dice are sampled with replacement, the population has an infinite size. The content of each bag defines the probability of sampling each type of dice, but it is not the entire population, unlike in the faux fish activity (see below).

Here is an example of samples after averaging the 5 trials per dice/participant:

Activity 2: poker chips (bootstrap sampling with replacement)

We use poker chips to demonstrate sampling with replacement, as done in the bootstrap.

This exercise provides an opportunity to learn about:

  • sampling with replacement;
  • bootstrap sampling;
  • running simulations.

A bag contains 8 poker chips, representing the outcome of an experiment. Each chip is like an observation.

First, we demonstrate sampling with replacement by getting a random chip from the bag, writing down its value, and replacing the chip in the bag. Second, we demonstrate bootstrap sampling by performing sampling with replacement 8 times, each time writing down the value of the random chip selected from the bag. This should help make bootstrap sampling intuitive.

After this analog exercise, we switch to R to demonstrate the implementation of sampling with replacement using the sample function.

Activity 3: faux fish (sampling distributions)

We sample with replacement from a finite population of faux fish to demonstrate the effect of sample size on the shape of sampling distributions.

The faux fish activity is mentioned in Steel, Liermann & Guttorp (2019), with pictures of class results. The activity is described in detail in Kelsey & Steel (2001).

Steel, Liermann & Guttorp (2019)

This exercise provides an opportunity to learn about:

  • the distinction between population and sample;
  • sampling with replacement;
  • running simulations;
  • estimation;
  • sampling distributions.

Material:

  • two sets of 97 faux fish = fish-shaped bits of paper or other material
  • two containers = ponds
  • two large blank sheets of paper
  • x axis = ‘Mean weight (g)’
  • y axis = ‘Number of experiments’
  • titles = ‘n=3 replicates’ / ‘n=10 replicates’

I got faux fish made of wood from Wood Craft Shapes.

Each faux fish has a weight in grams written on it.

The frequencies of the weights is given in Kelsey & Steel (2001).

The fish population is stored in a box. I made 2 identical populations, so that two groups can work in parallel.

The first goal of the exercise is to produce sampling distributions by sampling with replacement from a population. The second goal is to evaluate the effect of the sample size on the shape of the sampling distribution. The third goal is to experiment with a digital version of the analog task, to gain familiarity with simulations.

Unlike the dice activity, this activity involves a finite size population: each box contains the full population under study.

Setup:

  • two groups of participants;
  • each group is assigned a box;
  • participants from each group take turn sampling from the box n=3 or n=10 faux fish (depending on the group), without looking inside the box;
  • each participant averages the numbers, writes down the answer and marks it on the large sheet of paper assigned to each group;
  • this is repeated until a sufficient number of simulated experiments have been carried out to assess the shape of the resulting sampling distribution.

To speed up the exercise, a participant picks n fish, writes down the weights, puts all the fish back in the box, and passes the box to the next participant. While the next participant is sampling fish from the box, the previous participant computes the mean and marks the result on the group graph.

Once done, the class discusses the results:

  • the sampling distributions are compared;
  • the population mean is revealed;
  • the population is revealed by showing the handout from the book and opening the boxes.

Then we do the same in R, but much quicker!

Here is an example of simulated results for n=3 (the vertical line marks the population mean):

References

Kelsey, Kathryn, and Ashley Steel. The Truth about Science: A Curriculum for Developing Young Scientists. NSTA Press, 2001.

Steel, E. Ashley, Martin Liermann, and Peter Guttorp. Beyond Calculations: A Course in Statistical Thinking. The American Statistician 73, no. sup1 (29 March 2019): 392–401. https://doi.org/10.1080/00031305.2018.1505657.

The bootstrap-t technique

There are many types of bootstrap methods, but for most applications, two methods are most common: the percentile bootstrap, presented in an earlier post, and the bootstrap-t technique—also known as the percentile-t bootstrap or the studentized bootstrap (Efron & Tibshirani, 1994; Wilcox, 2017)​. For inferences on the population mean, the standard ​T-test and the percentile bootstrap can give unsatisfactory results when sampling from skewed distributions, especially when sample size is small. To illustrate the problem with the t-test, imagine that we sample from populations of increasing skewness.

Probability density functions for ​g&h​ distributions. Parameter ​g​ varies from 0 to 1. Parameter ​h=​0.

Here we use ​g&h​ distributions, in which parameter ​g​ controls the skewness, and parameter ​h​ controls the thickness of the tails—a normal distribution is obtained by setting ​g​=​h​=0 (Hoaglin, 1985; Yan & Genton, 2019)​. If we take many samples of size n=30 from these distributions, and for each sample we compute a ​T​ value, using the population mean as the null value, we obtain progressively more negatively skewed ​T​ value sampling distributions.

Sampling distributions of ​T​ values for different ​g​ values. Results are based on a simulation with 50,000 iterations and samples of size n=30.​

However, when we perform a ​T​-test, the ​T​ values are assumed to be symmetric, irrespective of sample size. This assumption leads to incorrect confidence intervals (CIs). The idea behind the bootstrap-t technique is to use the bootstrap (sampling with replacement) to compute a data-driven T​ distribution. In the presence of skewness, this ​T​ distribution could be skewed, as suggested by the data. Then, the appropriate quantile of the bootstrap ​T distribution is plugged into the standard CI equation to obtain a parametric bootstrap CI.

Bootstrap-t procedure

Let’s illustrate the procedure for a CI for the population mean. We start with a sample of 30 observations from a ​g&h​ distribution with ​g​=1 and ​h=​ 0.

Sample of size n=30 from a ​g&h​ distribution with ​g=1 and ​h​=0. The vertical line indicates the sample mean.

In a first step, we centre the distribution: for inferences on the mean, we subtract the mean from each observation in the sample, so that the mean of the centred distribution is now zero. This is a way to create a data-driven null distribution, in which there is no effect (the mean is zero), but the shape of the distribution and the absolute distance among observations are unaffected, as shown in the next figure. For inferences on a trimmed mean, we subtract the trimmed mean from each observation, so that the centred distribution now has a trimmed mean of zero.

Same distribution as in the previous figure, but the distribution has been mean centred, so that the sample mean is now zero.

In the next step, we sample with replacement from the centred distribution many times, and for each random sample we compute a ​T​ value. That way, we obtain a bootstrap distribution of ​T​ values expected by random sampling, under the hypothesis that the population has a mean (or trimmed mean) of zero, given the distribution of the data. Then, we use some quantile of the bootstrap ​T distribution in the standard CI equation. (Note that for trimmed means, the T-test equation is adjusted—see Tukey & McLaughlin, 1963).

5,000 bootstrap ​T​ values obtained by sampling with replacement from the mean centred data. In the asymmetric bootstrap-t technique, the quantiles (red vertical lines) of that distribution of ​T​ values are used to define the CI bounds. The insets contain the formulas for the lower (CI​lo)​ and upper bounds (CI​up)​ of the CI. Note that the lower ​T​ quantile is used to compute the upper bound (this not an error). In the symmetric bootstrap-t technique, one quantile of the distribution of absolute ​T​ values is used to define the CI bounds.​

Because the bootstrap distribution is potentially asymmetric, we have two choices of quantiles: for a 95% CI, either we use the 0.025 and the 0.975 quantiles of the signed ​T​ values to obtain a potentially asymmetric CI, also called an equal-tailed CI, or we use the 0.95 quantile of the absolute ​T​ values, thus leading to a symmetric CI.

In our example, for the mean the symmetric CI is [-0.4, 1.62] and the asymmetric CI is [0.08, 1.87]. If instead we use the 20% trimmed mean, the symmetric CI is [-0.36, 0.59] and the asymmetric CI is [-0.3, 0.67] (see Rousselet, Pernet & Wilcox, 2019). So clearly, confidence intervals can differ a lot depending on the estimator and method we use. In other words, a 20% trimmed mean is not a substitute for the mean, it asks a different question about the data.

Bootstrap samples

Why does the bootstrap-t approach work better than the standard ​T-test CI? Imagine we take multiple samples of size n=30 from a ​g&h​ distribution with ​g=​1 and ​h​=0.

Comparison of ​T​ distributions for ​g​=1 & h=0: the theoretical ​T distribution in red​ is the one used in the T-​test, the empirical ​T​ distribution in black was obtained by sampling with replacement multiple times from the g&h distribution. The red and black vertical lines indicate the ​T​ quantiles for a 95% CI. The grey lines show examples of 20 bootstrap sampling distributions, based on samples of size n=30 and 5000 bootstrap samples.

In the figure above, the standard ​T​-test assumes the sampling distribution in red, symmetric around zero. As we considered above, the sampling distribution is actually asymmetric, with negative skewness, as shown in black. However, the black empirical distribution is unobservable, unless we can perform thousands of experiments. So, with the bootstrap, we try to estimate this correct, yet unobservable, sampling distribution. The grey curves show examples of 20 simulated experiments: in each experiment, a sample of 30 observations is drawn, and then 5,000 bootstrap ​T​ values are computed. The resulting bootstrap sampling distributions are negatively skewed and are much closer to the empirical distribution in black than the theoretical symmetric distribution in red. Thus, it seems that using data-driven ​T​ distributions could help achieve better CIs than if we assumed symmetry.

How do these different methods perform? To find out we carry out simulations in which we draw samples from​ g&h distributions with the​ g​ parameter varying from 0 to 1, keeping ​h=0. For each sample, we compute a one-sample CI using the standard ​T-​ test, the two bootstrap-t methods just described (asymmetric and symmetric), and the percentile bootstrap. When estimating the population mean, for all four methods, coverage goes down with skewness.

Confidence interval coverage for the 4 methods applied to the mean. Results of a simulation with 20,000 iterations, sample sizes of n=30, and 599 bootstrap samples.​ You can see what happens for the 10% trimmed mean and the 20% trimmed mean in Rousselet, Pernet & Wilcox, 2019.

Among the parametric methods, the standard ​T-​test is the most affected by skewness, with coverage less than 90% for the most skewed condition. The asymmetric bootstrap-t CI seems to perform the best. The percentile bootstrap performs the worst in all situations, and has coverage systematically below 95%, including for normal distributions.

In addition to coverage, it is useful to consider the width of the CIs from the different techniques.

Confidence interval median width, based on the same simulation reported in the previous figure.

The width of a CI is its upper bound minus its lower bound. For each combination of parameters. the results are summarised by the median width across simulations. At low levels of asymmetry, for which the three parametric methods have roughly 95% coverage, the CIs also tend to be of similar widths. As asymmetry increases, all methods tend to produce larger CIs, but the ​T-test produces CIs that are too short, a problem that stems from the symmetric theoretical ​T​ distribution, which assumes T​ ​values too small. Compared to the parametric approaches, the percentile bootstrap produces the shortest CIs for all ​g​ values.

Confidence intervals: a closer look

We now have a closer look at the confidence intervals in the different situations considered above. We use a simulation with 20,000 iterations, sample size n=30, and 599 bootstrap samples.

Under normality

As we saw above, under normality the coverage is close to nominal (95%) for every method, although coverage for the percentile bootstrap is slightly too low, at 93.5%. Out of 20,000 simulated experiments, about 1,000 CI (roughly 5%) did not include the population value. About the same number of CIs were shifted to the left and to the right of the population value for all methods, and the CIs were of similar sizes:

We observed the same behaviour for several parametric methods in a previous post. Now, what happens when we sample from a skewed population?

In the presence of skewness (g=1, h=0)

Coverage is lower than the expected 95% for all methods. Coverage is about 88% for the standard and percentile bootstrap CIs, 92.3% for the asymmetric bootstrap-t CIs, and 91% for the symmetric bootstrap-t CIs. As we saw above, CIs are larger for the bootstrap-t CIs relative to the standard and percentile bootstrap CIs. CIs that did not include the population value tended to be shifted to the left of the population value, and more so for the standard CIs and the bootstrap-t symmetric CIs.

So when making inferences about the mean using the standard T-test, our CI coverage is lower than expected, and we are likely to underestimate the population value (the sample mean is median biased—Rousselet & Wilcox, 2019).

Relative to the other methods, the asymmetric bootstrap-t CIs are more evenly distributed on either side of the population and the right shifted CIs tends to be much larger and variable. The difference with the symmetric CIs is particularly striking and suggests that the asymmetric CIs could be misleading in certain situations. This intuition is confirmed by a simulation in which outliers are likely (h=0.2).

In the presence of skewness and outliers (g=1, h=0.2)

In the presence of outliers, the patterns observed in the previous figure are exacerbated. Some of the percentile bootstrap and asymmetric bootstrap-t intervals are ridiculously wide (x axis is truncated).

In such situation, inferences on trimmed means would greatly improve performance over the mean.

Conclusion

As we saw in a previous post, a good way to handle skewness and outliers is to make inferences about the population trimmed means. For instance, trimming 20% is efficient in many situations, even when using parametric methods that do not rely on the bootstrap. So what’s the point of the bootstrap-t? From the examples above, the bootstrap-t can perform much better than the standard Student’s approach and the percentile bootstrap when making inferences about the mean. So, in the presence of skewness and the population mean is of interest, the bootstrap-t is highly recommended. Whether to use the symmetric or asymmetric approach is not completely clear based on the literature (Wilcox, 2017). Intuition suggests that the asymmetric approach is preferable but our last example suggests that could be a bad idea when making inferences about the mean.

Symmetric or not, the bootstrap-t confidence intervals combined with the mean do not necessarily deal with skewness as well as other methods combined with trimmed means. But the bootstrap-t can be used to make inferences about trimmed means too! So which combination of approaches should we use? For instance, we could make inferences about the mean, the 10% trimmed mean or the 20% trimmed mean, in conjunction with a non-bootstrap parametric method, the percentile bootstrap or the bootstrap-t. We saw that for the mean, the bootstrap-t method is preferable in the presence of skewness. For inferences about trimmed means, the percentile bootstrap works well when trimming 20%. If we trim less, then the other methods should be considered, but a blanket recommendation cannot be provided. The choice of combination can also depend on the application. For instance, to correct for multiple comparisons in brain imaging analyses, cluster-based statistics are strongly recommended, in which case a bootstrap-t approach is very convenient. And the bootstrap-t is easily extended to factorial ANOVAs (Wilcox, 2017; Field & Wilcox, 2017).

What about the median? The bootstrap-t should not be used to make inferences about the median (50% trimming), because the standard error is not estimated correctly. Special parametric techniques have been developed for the median (Wilcox, 2017). The percentile bootstrap also works well for the median and other quantiles in some situations, providing sample sizes are sufficiently large (Rousselet, Pernet & Wilcox, 2019).

References

Efron, Bradley, and Robert Tibshirani. An Introduction to the Bootstrap. Chapman and Hall/CRC, 1994.

Field, Andy P., and Rand R. Wilcox. ‘Robust Statistical Methods: A Primer for Clinical Psychology and Experimental Psychopathology Researchers’. Behaviour Research and Therapy 98 (November 2017): 19–38. https://doi.org/10.1016/j.brat.2017.05.013.

Hesterberg, Tim C. ‘What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum’. The American Statistician 69, no. 4 (2 October 2015): 371–86. https://doi.org/10.1080/00031305.2015.1089789.

Hoaglin, David C. ‘Summarizing Shape Numerically: The g-and-h Distributions’. In Exploring Data Tables, Trends, and Shapes, 461–513. John Wiley & Sons, Ltd, 1985. https://doi.org/10.1002/9781118150702.ch11.

Rousselet, Guillaume A., Cyril R. Pernet, and Rand R. Wilcox. ‘A Practical Introduction to the Bootstrap: A Versatile Method to Make Inferences by Using Data-Driven Simulations’. Preprint. PsyArXiv, 27 May 2019. https://doi.org/10.31234/osf.io/h8ft7.

Rousselet, Guillaume A., and Rand R. Wilcox. ‘Reaction Times and Other Skewed Distributions: Problems with the Mean and the Median’. Preprint. PsyArXiv, 17 January 2019. https://doi.org/10.31234/osf.io/3y54r.

Tukey, John W., and Donald H. McLaughlin. ‘Less Vulnerable Confidence and Significance Procedures for Location Based on a Single Sample: Trimming/Winsorization 1’. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002) 25, no. 3 (1963): 331–52.

Wilcox, Rand R. Introduction to Robust Estimation and Hypothesis Testing. 4th edition. Academic Press, 2017.

Wilcox, Rand R., and Guillaume A. Rousselet. ‘A Guide to Robust Statistical Methods in Neuroscience’. Current Protocols in Neuroscience 82, no. 1 (2018): 8.42.1-8.42.30. https://doi.org/10.1002/cpns.41.

Yan, Yuan, and Marc G. Genton. ‘The Tukey G-and-h Distribution’. Significance 16, no. 3 (2019): 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x.

Comparing two independent Pearson’s correlations: confidence interval coverage

This post looks at the coverage of confidence intervals for the difference between two independent correlation coefficients. Previously, we saw how the standard Fisher’s r-to-z transform can lead to inflated false positive rates when sampling from non-normal bivariate distributions and the population correlation differs from zero. In this post, we look at a complementary perspective: using simulations, we’re going to determine how often confidence intervals include the population difference. As we saw in our previous post, because we compute say 95% confidence intervals does not mean that over the long run, 95% of such confidence intervals will include the population we’re trying to estimate. In some situations, the coverage is much lower than expected, which means we might fool ourselves more often that we thought (although in practice in most discussions I’ve ever read, authors behave as if their 95% confidence intervals were very narrow 100% confidence intervals — but that’s another story).

We look at confidence interval coverage for the difference between Pearsons’ correlations using Zou’s method (2007) and a modified percentile bootstrap method (Wilcox, 2009). We do the same for the comparison of Spearmans’ correlations using the standard percentile bootstrap. We used simulations with 4,000 iterations. Sampling is from bivariate g & h distributions (see illustrations here).

We consider 4 cases:

  • g = h = 0, difference = 0.1, vary rho
  • g = 1, h = 0, difference = 0.1, vary rho
  • rho = 0.3, difference = 0.2, vary g, h = 0
  • rho = 0.3, difference = 0.2, vary g, h = 0.2

g = h = 0, difference = 0.1, vary rho

That’s the standard normal bivariate distribution. Group 1 has values of rho1 = 0 to 0.8, in steps of 0.1. Group 2 has values of rho2 = rho1 + 0.1.

For normal bivariate distributions, coverage is at the nominal level for all methods, sample sizes and population correlations. (Here I only considered sample sizes of 50+ because otherwise power is far too low, so there is no point.)

The width of the CIs (upper bound minus lower bound) decreases with rho and with sample size. That’s expected from the sampling distributions of correlation coefficients

When CIs do not include the population value, are they located to the left or the right of the population? In the figure below, negative values indicate a preponderance of left shifts, positive values a preponderance of right shifts. A value of 1 = 100% right shifts, -1 = 100% left shifts. For Pearson, CIs not including the population value tend to be located evenly to the left and right of the population. For Spearman, there is a preponderance of left shifted CIs for rho1 = 0.8. This left shift implies a tendency to over-estimate the difference (the difference group 1 minus group 2 is negative).

g = 1, h = 0, vary rho

What happens when we sample from a skewed distribution?

The coverage is lower than the expected 95% for Zou’s method and the discrepancy worsens with increasing rho1 and with increasing sample size. The percentile bootstrap does a much better job. Spearman’s combined with the percentile bootstrap is spot on.

For CIs that did not include the population value, the pattern of shifts varies as a function of rho. For Pearson, CIs are more likely to be located to the right of the population (under-estimation of the population value or wrong sign) for rho = 0, whereas for rho = 0.8, CIs are more likely to be located to the left. Spearman + bootstrap produces much more balanced results.

To investigate the asymmetry, we look at CIs for g=1, a sample size of n = 200 and the extremes of the distributions, rho1 = 0 and rho2 = 0.8. The figure below shows the preponderance of right shifted CIs for the two Pearson methods. The vertical line marks the population difference of -0.1.

For rho1 = 0.8, the pattern changes to a preponderance of left shifts for all methods, which means that the CIs tended to over-estimate the population difference. CIs for differences between Spearman’s correlations were quite smaller than Pearson’s ones though, thus showing less bias and less uncertainty.  

rho=0.3, diff=0.2, vary g, h = 0

For another perspective on the three methods, we now look at a case with:

  • group 1: rho1 = 0.3
  • group 2: rho2 = 0.5
  • we vary g from 0 to 1.

For Pearson + Zou, coverage progressively decreases with increasing g, and to a much more limited extent with increasing sample size. Pearson + bootstrap is much more resilient to changes in g. And Spearman + bootstrap just doesn’t care about asymmetry!

The better coverage of Pearson + bootstrap seems to be achieved by producing wider CIs.

Matters only get’s worse for Pearson + Zou when outliers are likely (see notebook on GitHub).

Conclusion

Based on this new comparison of the 3 methods, I’d argue again that Spearman + bootstrap should be preferred over the two Pearson methods. But if the goal is to assess linear relationships, then Pearson + bootstrap is preferable to Zou’s method. I’ll report on other methods in another post.

References

Comparison of correlation coefficients

Zou, Guang Yong. Toward Using Confidence Intervals to Compare Correlations. Psychological Methods 12, no. 4 (2007): 399–413. https://doi.org/10.1037/1082-989X.12.4.399.

Wilcox, Rand R. Comparing Pearson Correlations: Dealing with Heteroscedasticity and Nonnormality. Communications in Statistics – Simulation and Computation 38, no. 10 (1 November 2009): 2220–34. https://doi.org/10.1080/03610910903289151.

Baguley, Thom. Comparing correlations: independent and dependent (overlapping or non-overlapping) https://seriousstats.wordpress.com/2012/02/05/comparing-correlations/

Diedenhofen, Birk, and Jochen Musch. Cocor: A Comprehensive Solution for the Statistical Comparison of Correlations. PLoS ONE 10, no. 4 (2 April 2015). https://doi.org/10.1371/journal.pone.0121945.

g & h distributions

Hoaglin, David C. Summarizing Shape Numerically: The g-and-h Distributions. In Exploring Data Tables, Trends, and Shapes, 461–513. John Wiley & Sons, Ltd, 1985. https://doi.org/10.1002/9781118150702.ch11.

Yan, Yuan, and Marc G. Genton. The Tukey G-and-h Distribution. Significance 16, no. 3 (2019): 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x.

When is a 95% confidence interval not a 95% confidence interval?


In previous posts, we saw how skewness and outliers can affect false positives (type I errors) and true positives (power) in one-sample tests. In particular, when making inferences about the population mean, skewness tends to inflate false positives, and skewness and outliers can destroy power. Here we investigate a complementary perspective, looking at how confidence intervals are affected by skewness and outliers.

Spoiler alert: 95% confidence intervals most likely do not have a coverage of 95%. In fact, I’ll show you an example in which a 95% CI for the mean has an 80% coverage…

The R code for this post is on GitHub.


Back to the title of the post. Seems like a weird question? Not if we consider the definition of a confidence interval (CI). Let say we conduct an experiment to estimate quantity x from a sample, where x could be the median or the mean for instance. Then a 95% CI for the population value of x refers to a procedure whose behaviour is defined in the long-run: CIs computed in the same way should contain the population value in 95% of exact replications of the experiment. For a single experiment, the particular CI does or does not contain the population value, there is no probability associated with it. A CI can also be described as the interval compatible with the data given our model — see definitions and common misinterpretations in Greenland et al. (2016).

So 95% refers to the (long-term) coverage of the CI; the exact values of the CI bounds vary across experiments. The CI procedure is associated with a certain coverage probability, in the long-run, given the model. Here the model refers to how we collected data, data cleaning procedures (e.g. outlier removal), assumptions about data distribution, and the methods used to compute the CI. Coverage can differ from the expected one if model assumptions are violated or the model is just plain wrong.

Wrong models are extremely common, for instance when applying a standard t-test CI to percent correct data (Kruschke, 2014; Jaeger, 2008) or Likert scale data (Bürkner & Vuorre, 2019; Liddell & Kruschke, 2019). 

For continuous data, CI coverage is not at the expected, nominal level, for instance when the model expects symmetric distributions and we’re actually sampling from skewed populations (which is the norm, not the exception, when we measure sizes, durations, latencies etc.). Here we explore this issue using g & h distributions that let us manipulate asymmetry.

Illustrate g & h distributions

All g & h distributions have a median of zero. The parameter g controls the asymmetry of the distribution, while the parameter h controls the thickness of the tails (Hoaglin, 1985; Yan & Genton, 2019). Let’s look at some illustrations to make things clear.

Examples in which we vary g from 0 to 1.

As g increases, the asymmetry of the distributions increases. Using negative g values would produce distributions with negative skewness.

Examples in which we vary h from 0 to 0.2.

As h increases, the tails are getting thicker, which means that outliers are more likely. 

Test with normal (g=h=0) distribution

Let’s run simulations to look at coverage probability in different situations and for different estimators. First, we sample with replacement from a normal population (g=h=0) 20,000 times (that’s 20,000 simulated experiments). Each sample has size n=30. Confidence intervals are computed for the mean, the 10% trimmed mean (tm), the 20% trimmed mean and the median using standard parametric methods (see details in the code on GitHub, and references for equations in Wilcox & Rousselet, 2018). The trimmed mean and the median are robust measures of central tendency. To compute a 10% trimmed mean, observations are sorted, the 10% lowest and 10% largest values are discarded (20% in total), and the remaining values are averaged. In this context, the mean is a 0% trimmed mean and the median is a 50% trimmed mean. Trimming the data attenuates the influence of the tails of the distributions and thus the effects of asymmetry and outliers on confidence intervals.

First we look at coverage for the 4 estimators: we look at the proportion of simulated experiments in which the CIs included the population value for each estimator. As expected for the special case of a normal distribution, the coverage is close to nominal (95%) for every method:

Mean 10% tm 20% tm Median
0.949 0.948 0.943 0.947

In addition to coverage, we also look at the width of the CIs (upper bound minus lower bound). Across simulations, we summarise the results using the median width. CIs tends to be larger for trimmed means and median relative to the mean, which implies lower power under normality for these methods (Wilcox & Rousselet, 2018). 

Mean 10% tm 20% tm Median
0.737 0.761 0.793 0.889

For CIs that did not include the population, the distribution is fairly balanced between the left and the right of the population. To see this, I computed a shift index: if the CI was located to the left of the population value, it receives a score of -1, when it was located to the right, it receives a score of 1. The shift index was then computed by averaging the scores only for those CI excluding the population.

Mean 10% tm 20% tm Median
0.046 0.043 0.009 0.013

Illustrate CIs that did not include the population

Out of 20,000 simulated experiments, about 1,000 CI (roughly 5%) did not include the population value for each estimator. About the same number of CIs were shifted to the left and to the right of the population value, which is illustrated in the next figure. In each panel, the vertical line marks the population value (here it’s zero in all conditions because the population is symmetric). The CIs are plotted in the order of occurrence in the simulation. So the figure shows that if we miss the population value, we’re as likely to overshoot than undershoot our estimation.

Across panels, the figure also shows that the more we trim (10%, 20%, median) the larger the CIs get. So for a strictly normal population, we more precisely estimate the mean than trimmed means and the median.

Test with g=1 & h=0 distribution

What happens for a skewed population? Three things happen for the mean:

  • coverage goes down
  • width increases
  • CIs not including the population value tend to be shifted to the left (negative average shift values)

The same effects are observed for the trimmed means, but less so the more we trim, because trimming alleviates the effects of the tails.

Measure Mean 10% tm 20% tm Median
Coverage 0.880 0.936 0.935 0.947
Width 1.253 0.956 0.879 0.918
Shift -0.962 -0.708 -0.661 0.017
# left 2350 1101 1084 521
# right 45 188 221 539

Illustrate CIs that did not include the population

The figure illustrates the strong imbalance between left and right CI shifts. If we try to estimate the mean of a skewed population, our CIs are likely to miss it more than 5% of the time, and when that happens, the CIs are most likely to be shifted towards the bulky part of the distribution (here the left for a right skewed distribution). Also, the right shifted CIs vary a lot in width and can be very large.

As we trim, the imbalance is progressively resolved. With 20% trimming, when CIs do not contain the population value, the distribution of left and right shifts is more balanced, although with still far more left shifts. With the median we have roughly 50% left / 50% right shifts and CIs are narrower than for the mean.

Test with g=1 & h=0.2 distribution

What happens if we sample from a skewed distribution (g=1) in which outliers are likely (h=0.2)?

Measure Mean 10% tm 20% tm Median
Coverage 0.801 0.934 0.936 0.947
Width 1.729 1.080 0.934 0.944
Shift -0.995 -0.797 -0.709 0.018
# left 3967 1194 1086 521
# right 9 135 185 540

The results are similar to those observed for h=0, only exacerbated. Coverage for the mean is even lower, CIs are larger, and the shift imbalance even more severe. I have no idea how often such a situation occur, but I suspect if you study clinical populations that might be rather common. Anyway, the point is that it is a very bad idea to assume the distributions we study are normal, apply standard tools, and hope for the best. Reporting CIs as 95% or some other value, without checking, can be very misleading.

Simulations in which we vary g

We now explore CI properties as a function of g, which we vary from 0 to 1, in steps of 0.1. The parameter h is set to 0 (left column of next figure) or 0.2 (right column). Let’s look at column A first (h=0). For the median, coverage is unaffected by g. For the other estimators, there is a monotonic decrease in coverage with increasing g. The effect is much stronger for the mean than the trimmed means.

For all estimators, increasing g leads to monotonic increases in CI width. The effect is very subtle for the median and more pronounced the less we trim. Under normality, g=0, CIs are the shortest for the mean, explaining the larger power of mean based methods relative to trimmed means in this unusual situation.

In the third panel, the zero line represents an equal proportion of left and right shifts, relative to the population, for CIs that did not include the population value. The values are consistently above zero for the median, with a few more right shifts than left shifts for all values of g. For the other estimators, the preponderance of left shifts increases markedly with g.

Now we look at results in panel B (h=0.2). When outliers are likely, coverage drops faster with g for the mean. Other estimators are resistant to outliers.

When outliers are common, CIs for the population mean are larger than for all other estimators, irrespective of g.

Again, there is a constant over-representation of right shifted CIS for the median. For the other estimators, the left shifted CIs dominate more and more with increasing g. The trend is more pronounced for the mean relative to the h=0 situation, with a sharper monotonic downward trajectory.

Conclusion

The answer to the question in the title is: most of the time! Simply because our models are wrong most of the time. So I would take all published confidence intervals with a pinch of salt. [Some would actually go further and say that if the sampling and analysis plans for an experiment were not clearly stipulated before running the experiment, then confidence interval, like P values, are not even defined (Wagenmakers, 2007). That is, we can compute a CI, but the coverage is meaningless, because exact repeated sampling might be impossible or contingent on external factors that would need to be simulated.] The best way forward is probably not to advocate for the use of trimmed means or the median over the mean in all cases, because different estimators address different questions about the data. And there are more estimators of central tendency than means, trimmed means and medians. There are also more interesting questions to ask about the data than their central tendencies (Rousselet, Pernet & Wilcox, 2017). For these reasons, we need data sharing to be the default, so that other users can ask different questions using different tools. The idea that the one approach used in a paper is the best to address the problem at hand is just silly.

To see what happens when we use the percentile bootstrap or the bootstrap-t to build confidence intervals for the mean, see this more recent post.

References

Bürkner, Paul-Christian, and Matti Vuorre. ‘Ordinal Regression Models in Psychology: A Tutorial’. Advances in Methods and Practices in Psychological Science 2, no. 1 (1 March 2019): 77–101. https://doi.org/10.1177/2515245918823199.

Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. ‘Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations’. European Journal of Epidemiology 31, no. 4 (1 April 2016): 337–50. https://doi.org/10.1007/s10654-016-0149-3.

Hoaglin, David C. ‘Summarizing Shape Numerically: The g-and-h Distributions’. In Exploring Data Tables, Trends, and Shapes, 461–513. John Wiley & Sons, Ltd, 1985. https://doi.org/10.1002/9781118150702.ch11.

Jaeger, T. Florian. ‘Categorical Data Analysis: Away from ANOVAs (Transformation or Not) and towards Logit Mixed Models’. Journal of Memory and Language 59, no. 4 (November 2008): 434–46. https://doi.org/10.1016/j.jml.2007.11.007.

Kruschke, John K. Doing Bayesian Data Analysis. 2nd Edition. Academic Press, 2014.

Liddell, Torrin M., and John K. Kruschke. ‘Analyzing Ordinal Data with Metric Models: What Could Possibly Go Wrong?’ Journal of Experimental Social Psychology 79 (1 November 2018): 328–48. https://doi.org/10.1016/j.jesp.2018.08.009.

Rousselet, Guillaume A., Cyril R. Pernet, and Rand R. Wilcox. ‘Beyond Differences in Means: Robust Graphical Methods to Compare Two Groups in Neuroscience’. European Journal of Neuroscience 46, no. 2 (1 July 2017): 1738–48. https://doi.org/10.1111/ejn.13610.

Rousselet, Guillaume A., and Rand R. Wilcox. ‘Reaction Times and Other Skewed Distributions: Problems with the Mean and the Median’. Preprint. PsyArXiv, 17 January 2019. https://doi.org/10.31234/osf.io/3y54r.

Wagenmakers, Eric-Jan. ‘A Practical Solution to the Pervasive Problems of p Values’. Psychonomic Bulletin & Review 14, no. 5 (1 October 2007): 779–804. https://doi.org/10.3758/BF03194105.

Wilcox, Rand R., and Guillaume A. Rousselet. ‘A Guide to Robust Statistical Methods in Neuroscience’. Current Protocols in Neuroscience 82, no. 1 (2018): 8.42.1-8.42.30. https://doi.org/10.1002/cpns.41.

Yan, Yuan, and Marc G. Genton. ‘The Tukey G-and-h Distribution’. Significance 16, no. 3 (2019): 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x.

“Amends might be made in the interest of the new generation of students by printing in leaded type in future editions of existing text-books and in all new text-books:

Normality is a myth; there never was, and never will be, a normal distribution.

This is an over-statement from the practical point of view, but it represents a safer initial mental attitude than any in fashion during the past two decades.”

Geary, Biometrika, 1947