Tag Archives: bootstrap

Cluster correction for multiple dependent comparisons

In this post I explain the benefits of applying cluster based statistics, developed for brain imaging applications, to other experimental designs, in which tests are correlated. Here are some examples of such designs:

  • different groups of rats are tested with increasing concentrations of a molecule;

  • different groups of humans or the same humans are tested with stimulations of different intensities or durations (e.g. in neuro/psych it could be TMS, contrast, luminance, priming, masking, SOA);

  • pain thresholds are measured on contiguous patches of skin;

  • insects are sampled from neighbouring fields;

  • participants undergo a series of training sessions. 

In these examples, whatever is measured leads to statistical tests that are correlated in one or a combination of factors: time, space, stimulus parameters. In the frequentist framework, if the outcome of the family of tests is corrected for multiple comparisons using standard procedures (Bonferroni, Hochberg etc.), power will decrease with the number of tests. Cluster based correction for multiple comparison methods can keep false positives at the nominal level (say 0.05), without compromising power. 

These types of dependencies can also be explicitly modelled using Gaussian processes (for a Bayesian example, see McElreath, 2018, chapter 13). Cluster-based statistics are much simpler to use, but they do not provide the important shrinkage afforded by hierarchical methods…  

Cluster-based statistics

To get started, let’s consider an example involving a huge number of correlated tests. In this example, measurements are made at contiguous points in space (y axis) and time (x axis). The meaning of the axes is actually irrelevant – what matters is that the measurements are contiguous. In the figure below, left panel, we define our signal, which is composed of 2 clusters of high intensities among a sea of points with no effect (dark blue = 0). Fake measurements are then simulated by adding white noise to each point. By doing that 100 times, we obtain 100 noisy maps. The mean of these noisy maps is shown in the right  panel.

fig1

We also create 100 maps composed entirely of noise. Then we perform a t-test for independent groups at each point in the map (n=100 per group). 

fig2

What do we get? If we use a threshold of 0.05, we get two nice clusters of statistically significant tests where they are supposed to be. But we also get many false positives. If we try to get rid off the false positives by changing the thresholds, it works to some extent, but at the cost of removing true effects. Even with a threshold of 0.0005, there are still many false positives, and the clusters of true signal have been seriously truncated. 

fig3

The problem is that lowering the alpha is a brute force technique that does not take into account information we have about the data: measurement points are correlated. There is a family of techniques that can correct for multiple comparisons by taking these dependencies into account: cluster based statistics (for an introduction, see Maris & Oostenveld, 2007). These techniques control the family-wise error rate but maintain high power. The family-wise error rate (FWER) is the probably to obtain at least one significant test among a family of tests, when the null hypothesis is true.

When we use a frequentist approach and perform a family of tests, we increase the probably of reporting false positives. The multiple comparison problem is difficult to tackle in many situations because of the need to balance false positives and false negatives. Probably the best known and most widely applied correction for multiple comparison technique is Bonferroni, in which the alpha threshold is divided by the number of comparisons. However, this procedure is notoriously conservative, as it comes at the cost of lower power. Many other techniques have been proposed (I don’t know of a good review paper on this topic – please add a comment if you do).

In the example below, two time-courses are compared point-by-point. Panel a shows the mean time-courses across participants. Panel b shows the time-course of the t-test for 2 dependent groups (the same participants were tested in each condition). Panel c shows time-points at which significant t-tests were observed. Without correction, a large cluster of significant points is observed, but also a collection of smaller clusters. We know from physiology that some of these clusters are too small to be true so they are very likely false positives.

fig4_maris2007

Figure 1 from Maris & Oostenveld, 2007.

If we change the significance threshold using the Bonferroni correction for multiple comparisons, in these examples we remove all significant clusters but the largest one. Good job?! The problem is that our large cluster has been truncated: it now looks like the effect starts later and ends sooner. The cluster-based inferences do not suffer from this problem.

Applied to our 2D example with two clusters embedded in noise, the clustering technique identifies 17,044 clusters of significant t-tests. After correction, only 2 clusters are significant!

fig6

So how do we compute cluster-based statistics? The next figure illustrates the different steps. At the top, we start with a time-course of F-values, from a series of point-by-point ANOVAs. Based on some threshold, say the critical F values for alpha = 0.05, we identify 3 clusters. The clusters are formed based on contiguity. For each cluster we then compute a summary statistics: it could be its duration (length), its height (maximum), or its sum. Here we use the sum. Now we ask a different question: for each cluster, is it likely to obtain that cluster sum by chance? To answer this question, we use non-parametric statistics to estimate the distribution expected by chance. 

fig5

There are several ways to achieve this goal using permutation, percentile bootstrap or bootstrap-t methods (Pernet et al., 2015). Whatever technique we use, we simulate time-courses of F values expected by chance, given the data. For each of these simulated time-courses, we apply a threshold, identify clusters, take the sum of each cluster and save the maximum sum across clusters. If we do that 1,000 times, we end up with a collection of 1,000 cluster sums (shown in the top right corner of the figure). We then sort these values and identify a quantile of interest, say the 0.95 quantile. Finally, we use this quantile as our cluster-based threshold: each original cluster sum is then compared to that threshold. In our example, out of the 3 original clusters, the largest 2 are significant after cluster-based correction for multiple comparisons, whereas the smallest one is not. 

Simulations

From the description above, it is clear that using cluster-based statistics require a few choices:

  • a method to estimate the null distribution;
  • a method to form clusters;

  • a choice of cluster statistics;

  • a choice of statistic to form the null distribution (max across clusters for instance);

  • a number of resamples…

Given a set of choices, we need to check that our method does what it’s supposed to do. So let’s run a few simulations…

5 dependent groups

First we consider the case of 5 dependent groups. The 5 measurements are correlated in time or space or some other factor, such that clusters can be formed by simple proximity: 2 significant tests are grouped in a cluster if they are next to each other. Data are normally distributed, the population SD is 1, and the correlation between repeated measures is 0.75. Here is the FWER after 10,000 simulations, in which we perform 5 one-sample t-tests on means.

fig7_fwer

With correction for multiple comparisons, the probability to get at least one false positive is well above the nominal level (here 0.05). The grey area marks Bradley’s (1978) satisfactory range of false positives (between 0.025 and 0.075). Bonferroni’s and Hochberg’s corrections dramatically reduce the FWER, as expected. For n = 10, the FWER remains quite high, but drops within the acceptable range for higher sample sizes. But these corrections tend to be conservative, leading to FWER systematically under 0.05 from n = 30. Using a cluster-based correction, the FWER is near the nominal level at all sample sizes. 

The cluster correction was done using a bootstrap-t procedure, in which the original data are first mean-centred, so that the null hypothesis is true, and t distributions expected by chance are estimated by sampling the centred data with replacement 1,000 times, and each time computing a series of t-test. For each bootstrap, a max cluster sum statistics was saved and the 95th quantile of this distribution was used to threshold the original clusters.

Next we consider power. We sample from a population with 5 dependent conditions: there is no effect in conditions 1 and 5 (mean = 0), the mean is 1 for condition 3, and the mean is 0.5 for conditions 2 and 4. We could imagine a TMS experiment   where participants first receive a sham stimulation, then stimulation of half intensity, full, half, and sham again… Below is an illustration of a random sample of data from 30 participants.

fig8_5group_example

If we define power as the probability to observe a significant t-test simultaneously in conditions 3, 4 and 5, we get these results:

fig9_power_all

Maximum power is always obtain in the condition without correction, by definition. The cluster correction always reaches maximum possible power, except for n = 10. In contrast, Bonferroni and Hochberg lead to lower power, with Bonferroni being the most conservative. For a desired long run power value, we can use interpolation to find out the matching sample size. To achieve at least 80% power, the minimum sample size is:

  • 39 observations for the cluster test;
  • 50 observations for Hochberg;

  • 57 observations for Bonferroni.

7 dependent groups

If we run the same simulation but with 7 dependent groups instead of 5, the pattern of results does not change, but the FWER increases if we do not apply any correction for multiple comparisons.

fig10_7_fwer

As for power, if we keep a cluster of effects with means 0.5, 1, 0.5 for conditions 3, 4 and 5, and zero effect for conditions 1, 2, 6 and 7, the power advantage of the cluster test increases. Now, to achieve at least 80% power, the minimum sample size is:

  • 39 observations for the cluster test;
  • 56 observations for Hochberg;

  • 59 observations for Bonferroni.

fig11_power_7_all

7 independent groups

Finally, we consider a situation with 7 independent groups. For instance, measurements were made in 7 contiguous fields. So the measurements are independent (done at different times), but there is spatial dependence between fields, so that we would expect that if a measurement is high in one field, it is likely to be high in the next field too. Here are the FWER results, showing a pattern similar to that in the previous examples:

fig12_7ind_fwer

The cluster correction does the best job at minimising false positives, whereas Bonferroni and Hochberg are too liberal for sample sizes 10 and 20.

To look at power, I created a simulation with a linear pattern: there is no effect in position 1, then a linear increase from 0 to a maximum effect size of 2 at position 7. Here is the sequence of effect sizes:

c(0, 0, 0.4, 0.8, 1.2, 1.6, 2)

And here is an example of a random sample with n = 70 measurements per group:

fig13_7ind_group_example

In this situation, again the cluster correction dominates the other methods in terms of power. For instance, to achieve at least 80% power, the minimum sample size is:

  • 50 observations for the cluster test;
  • 67 observations for Hochberg;

  • 81 observations for Bonferroni.

fig14_power_7ind_all

Conclusion

I hope the examples above have convinced you that cluster-based statistics could dramatically increase your statistical power relative to standard techniques used to correct for multiple comparisons. Let me know if you use a different correction method and would like to see how they compare. Or you could re-use the simulation code and give it a go yourself. 

Limitations: cluster-based methods make inferences about clusters, not about individual tests. Also, these methods require a threshold to form clusters, which is arbitrary and not convenient if you use non-parametric tests that do not come with p values. An alternative technique eliminates this requirement, instead forming a statistic that integrates across many potential cluster thresholds (TFCE, Smith & Nichols, 2009; Pernet et al. 2015). TFCE also affords inferences for each test, not the cluster of tests. But it is computationally much more demanding than the standard cluster test demonstrated in this post. 

Code

Matlab code for ERP analyses is available on figshare and as part of the LIMO EEG toolbox. The code can be used for other purposes – just pretend you’re dealing with one EEG electrode and Bob’s your uncle.

R code to reproduce the simulations is available on github. I’m planning to develop an R package to cover different experimental designs, using t-tests on means and trimmed means. In the meantime, if you’d like to apply the method but can’t make sense of my code, don’t hesitate to get in touch and I’ll try to help.

References

Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, 144–152. doi: 10.1111/j.2044-8317.1978.tb00581.x. 

Maris, E. & Oostenveld, R. (2007) Nonparametric statistical testing of EEG- and MEG-data. Journal of neuroscience methods, 164, 177-190.

McElreath, R. (2018) Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.

Oostenveld, R., Fries, P., Maris, E. & Schoffelen, J.M. (2011) FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput Intell Neurosci, 2011, 156869.

Pernet, C.R., Chauveau, N., Gaspar, C. & Rousselet, G.A. (2011) LIMO EEG: a toolbox for hierarchical LInear MOdeling of ElectroEncephaloGraphic data. Comput Intell Neurosci, 2011, 831409.

Pernet, C.R., Latinus, M., Nichols, T.E. & Rousselet, G.A. (2015) Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. Journal of neuroscience methods, 250, 85-93.

Rousselet, Guillaume (2016): Introduction to robust estimation of ERP data. figshare. Fileset. 

https://doi.org/10.6084/m9.figshare.3501728.v1

Smith, S.M. & Nichols, T.E. (2009) Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage, 44, 83-98.

Advertisements

Power estimation for correlation analyses

Following the previous posts on small n correlations [post 1][post 2][post 3], in this post we’re going to consider power estimation (if you do not care about power, but you’d rather focus on estimation, this post is for you). 

To get started, let’s look at examples of n=1000 samples from bivariate populations with known correlations (rho), with rho increasing from 0.1 to 0.9 in steps of 0.1. For each rho, we draw a random sample and plot Y as a function of X. The variances of the two correlated variables are independent – there is homoscedasticity. Later we will look at heteroscedasticity, when the variance of Y varies with X.

demo_homo_dist

For the same distributions illustrated in the previous figure, we compute the proportion of positive Pearson’s correlation tests for different sample sizes. This gives us power curves (here based on simulations with 50,000 samples). We also include rho = 0 to determine the proportion of false positives.

figure_power_homo

Power increases with sample size and with rho. When rho = 0, the proportion of positive tests is the proportion of false positives. It should be around 0.05 for a test with alpha = 0.05. This is the case here, as Pearson’s correlation is well behaved for bivariate normal data.

For a given expected population correlation and a desired long run power value, we can use interpolation to find out the matching sample size.

To achieve at least 80% power given an expected population rho of 0.4, the minimum sample size is 46 observations.

To achieve at least 90% power given an expected population rho of 0.3, the minimum sample size is 118 observations.

figure_power_homo_arrows

Alternatively, for a given sample size and a desired power, we can determine the minimum effect size we can hope to detect. For instance, given n = 40 and a desired power of at least 90%, the minimum effect size we can detect is 0.49.

So far, we have only considered situations where we sample from bivariate normal distributions. However, Wilcox (2012 p. 444-445) describes 6 aspects of data that affect Pearson’s r:

  • outliers

  • the magnitude of the slope around which points are clustered

  • curvature

  • the magnitude of the residuals

  • restriction of range

  • heteroscedasticity

The effect of outliers on Pearson’s and Spearman’s correlations is described in detail in Pernet et al. (2012) and Rousselet et al. (2012).

Next we focus on heteroscedasticity. Let’s look at Wilcox’s heteroscedasticity example (2012, p. 445). If we correlate variable X with variable Y, heteroscedasticity means that the variance of Y depends on X. Wilcox considers this example:

X and Y have normal distributions with both means equal to zero. […] X and Y have variance 1 unless |X|>0.5, in which case Y has standard deviation |X|.”

Here is an example of such data:

demo_wilcox_dist

Next, Wilcox (2012) considers the effect of this heteroscedastic situation on false positives. We superimpose results for the homoscedastic case for comparison. In the homoscedastic case, as expected for a test with alpha = 0.05, the proportion of false positives is very close to 0.05 at every sample size. In the heteroscedastic case, instead of 5%, the number of false positives is between 12% and 19%. The number of false positives actually increases with sample size! That’s because the standard T statistics associated with Pearson’s correlation assumes homoscedasticity, so the formula is incorrect when there is heteroscedasticity.

figure_power_hetero_wilcox

As a consequence, when Pearson’s test is positive, it doesn’t always imply the existence of a correlation. There could be dependence due to heteroscedasticity, in the absence of a correlation.

Let’s consider another heteroscedastic situation, in which the variance of Y increases linearly with X. This could correspond for instance to situations in which cognitive performance or income are correlated with age – we might expect the variance amongst participants to increase with age.

We keep rho constant at 0.4 and increase the maximum variance from 1 (homoscedastic case) to 9. That is, the variance of Y linear increases from 1 to the maximum variance as a function of X.

demo_hetero_dist

For rho = 0, we can compute the proportion of false positives as a function of both sample size and heteroscedasticity. In the next figure, variance refers to the maximum variance. 

figure_power_hetero_rho0

From 0.05 for the homoscedastic case (max variance = 1), the proportion of false positives increases to 0.07-0.08 for a max variance of 9. This relatively small increase in the number of false positives could have important consequences if 100’s of labs are engaged in fishing expeditions and they publish everything with p<0.05. However, it seems we shouldn’t worry much about linear heteroscedasticity as long as sample sizes are sufficiently large and we report estimates with appropriate confidence intervals. An easy way to build confidence intervals when there is heteroscedasticity is to use the percentile bootstrap (see Pernet et al. 2012 for illustrations and Matlab code).

Finally, we can run the same simulation for rho = 0.4. Power progressively decreases with increasing heteroscedasticity. Put another way, with larger heteroscedasticity, larger sample sizes are needed to achieve the same power.

figure_power_hetero_rho04

We can zoom in:

figure_power_hetero_rho04_zoom

The vertical bars mark approximately a 13 observation increase to keep power at 0.8 between a max variance of 0 and 9. This decrease in power can be avoided by using the percentile bootstrap or robust correlation techniques, or both (Wilcox, 2012).

Conclusion

The results presented in this post are based on simulations. You could also use a sample size calculator for correlation analyses – for instance this one.

But running simulations has huge advantages. For instance, you can compare multiple estimators of association in various situations. In a simulation, you can also include as much information as you have about your target populations. For instance, if you want to correlate brain measurements with response times, there might be large datasets you could use to perform data-driven simulations (e.g. UK biobank), or you could estimate the shape of the sampling distributions to draw samples from appropriate theoretical distributions (maybe a gamma distribution for brain measurements and an exGaussian distribution for response times).

Simulations also put you in charge, instead of relying on a black box, which most likely will only cover Pearson’s correlation in ideal conditions, and not robust alternatives when there are outliers or heteroscedasticity or other potential issues.

The R code to reproduce the simulations and the figures is on GitHub.

References

Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.

Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Correlations in neuroscience: are small n, interaction fallacies, lack of illustrations and confidence intervals the norm?

As reviewer, editor and reader of research articles, I’m regularly annoyed by the low standards in correlation analyses. In my experience with such articles, typically:

  • Pearson’s correlation, a non-robust measure of association, is used;
  • R and p values are reported, but not confidence intervals;
  • sample sizes tend to be small, leading to large estimation bias and inflated effect sizes in the literature;
  • R values and confidence intervals are not considered when interpreting the results;
  • instead, most analyses are reported as significant or non-significant (p<0.05), leading to the conclusion that an association exists or not (frequentist fallacy);
  • often figures illustrating the correlations are absent;
  • the explicit or implicit comparison of two correlations is done without a formal test (interaction fallacy).

To find out if my experience was in fact representative of the typical paper, I had a look at all papers published in 2017 in the European Journal of Neuroscience, where I’m a section editor. I care about the quality of the research published in EJN, so this is not an attempt at blaming a journal in particular, rather it’s a starting point to address a general problem. I really hope the results presented below will serve as a wake-up call for all involved and will lead to improvements in correlation analyses. Also, I bet if you look systematically at articles published in other neuroscience journals you’ll find the same problems. If you’re not convinced, go ahead, prove me wrong 😉 

I proceeded like this: for all 2017 articles (volumes 45 and 46), I searched for “correl” and I scanned for figures of scatterplots. If either searches were negative, the article was categorised as not containing a correlation analysis, so I might have missed a few. When at least one correlation was present, I looked for these details: 

  • n
  • estimator
  • confidence interval
  • R
  • p value
  • consideration of effect sizes
  • figure illustrating positive result
  • figure illustrating negative result
  • interaction test.

164 articles reported no correlation.

7 articles used regression analyses, with sample sizes as low as n=6, n=10, n=12 in 3 articles.

48 articles reported correlations.

Sample size

The norm was to not report degrees of freedom or sample size along with the correlation analyses or their illustrations. In 7 articles, the sample sizes were very difficult or impossible to guess. In the others, sample sizes varied a lot, both within and between articles. To confirm sample sizes, I counted the observations in scatterplots when they were available and not too crowded – this was a tedious job and I probably got some estimations and checks wrong. Anyway, I shouldn’t have to do all these checks, so something went wrong during the reviewing process. 

To simplify the presentation of the results, I collapsed the sample size estimates across articles. Here is the distribution: 

figure_ejn_sample_sizes

The figure omits 3 outliers with n= 836, 1397, 1407, all from the same article.

The median sample size is 18, which is far too low to provide sufficiently precise estimation.

Estimator

The issue with low sample sizes is made worse by the predominant use of Pearson’s correlation or the lack of consideration for the type of estimator. Indeed, 21 articles did not mention the estimator used at all, but presumably they used Pearson’s correlation.

Among the 27 articles that did mention which estimator was used:

  • 11 used only Pearson’s correlation;
  • 11 used only Spearman’s correlation;
  • 4 used Pearson’s and Spearman’s correlations;
  • 1 used Spearman’s and Kendall’s correlations.

So the majority of studies used an estimator that is well-known for its lack of robustness and its inaccurate confidence intervals and p values (Pernet, Wilcox & Rousselet, 2012).

R & p values

Most articles reported R and p values. Only 2 articles did not report R values. The same 2 articles also omitted p values, simply mentioning that the correlations were not significant. Another 3 articles did not report p values along with the R values.

Confidence interval

Only 3 articles reported confidence intervals, without mentioning how they were computed. 1 article reported percentile bootstrap confidence intervals for Pearson’s correlations, which is the recommended procedure for this estimator (Pernet, Wilcox & Rousselet, 2012).

Consideration for effect sizes

Given the lack of interest for measurement uncertainty demonstrated by the absence of confidence intervals in most articles, it is not surprising that only 5 articles mentioned the size of the correlation when presenting the results. All other articles simply reported the correlations as significant or not.

Illustrations

In contrast with the absence of confidence intervals and consideration for effect sizes, 23 articles reported illustrations for positive results. 4 articles reported only negative results, which leaves us with 21 articles that failed to illustrate the correlation results. 

Among the 40 articles that reported negative results, only 13 illustrated them, which suggests a strong bias towards positive results.

Interaction test

Finally, I looked for interaction fallacies (Nieuwenhuis, Forstmann & Wagenmakers 2011). In the context of correlation analyses, you commit an interaction fallacy when you present two correlations, one significant, the other not, implying that the 2 differ, but without explicitly testing the interaction. In other versions of the interaction fallacy, two significant correlations with the same sign are presented together, implying either that the 2 are similar, or that one is stronger than the other, without providing a confidence interval for the correlation difference. You can easily guess the other flavours… 

10 articles presented only one correlation, so there was no scope for the interaction fallacy. Among the 38 articles that presented more than one correlation, only one provided an explicit test for the comparison of 2 correlations. However, the authors omitted the explicit test for their next comparison!

Recommendations

In conclusion, at least in 2017 EJN articles, the norm is to estimate associations using small sample sizes and a non-robust estimator, to not provide confidence intervals and to not consider effect sizes and measurement uncertainty when presenting the results. Also, positive results are more likely to be illustrated than negative ones. Finally, interaction fallacies are mainstream.

How can we do a better job?

If you want to do a correlation analysis, consider your sample size carefully to assess statistical power and even better, your long-term estimation precision. If you have a small n, I wouldn’t even look at the correlation. 

Do not use Pearson’s correlation unless you have well-behaved and large samples, and you are only interested in linear relationships; otherwise explore robust measures of associations and techniques that provide valid confidence intervals (Pernet, Wilcox & Rousselet, 2012; Wilcox & Rousselet, 2018).

Reporting

These details are essential in articles reporting correlation analyses:

  • sample size for each correlation;
  • estimator of association;
  • R value;
  • confidence interval;
  • scatterplot illustration of every correlation, irrespective of the p value;
  • explicit comparison test of all correlations explicitly or implicitly compared;
  • consideration of effect sizes (R values) and their uncertainty (confidence intervals) in the interpretation of the results.

 Report p values if you want but they are not essential and should not be given a special status (McShane et al. 2018).

Finally, are you sure you really want to compute a correlation?

“Why then are correlation coefficients so attractive? Only bad reasons seem to come to mind. Worst of all, probably, is the absence of any need to think about units for either variable. Given two perfectly meaningless variables, one is reminded of their meaninglessness when a regression coefficient is given, since one wonders how to interpret its value. A correlation coefficient is less likely to bring up the unpleasant truth—we think we know what r = —.7 means. Do we? How often? Sweeping things under the rug is the enemy of good data analysis. Often, using the correlation coefficient is “sweeping under the rug” with a vengeance. Being so disinterested in our variables that we do not care about their units can hardly be desirable.”
Analyzing data: Sanctification or detective work?

John W. Tukey.
 American Psychologist, Vol 24(2), Feb 1969, 83-91. http://dx.doi.org/10.1037/h0027108

 

References

McShane, B.B., Gal, D., Gelman, A., Robert, C. & Tackett, J.L. (2018) Abandon Statistical Significance. arxiv.

Nieuwenhuis, S., Forstmann, B.U. & Wagenmakers, E.J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci, 14, 1105-1107.

Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.

Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.

[preprint]

A new shift function for dependent groups?

UPDATE (2018-05-17): the method suggested here is completely bogus. I’ve edited the post to explain why. To make inferences about differences scores, use the difference asymmetry function or make inferences about the quantiles of the differences (Rousselet, Pernet & Wilcox, 2017).


The shift function is a graphical and inferential method that allows users to quantify how two distributions differ. It is a frequentist tool that also comes in several Bayesian flavours, and can be applied to independent and dependent groups. The version for dependent groups uses differences between the quantiles of each group. However, for paired observations, it would be also useful to assess the quantiles of the pairwise differences. This is what the this new shift function does was supposed to do.

Let’s consider the fictive reaction time data below, generated using exGaussian distributions (n = 100 participants).

figure_kde

The kernel density estimates suggest interesting differences: condition 1 is overall more spread out than condition 2; as a result, the two distributions differ in both the left (fast participants) and right (slow participants) tails. However, this plot does not reveal the pairwise nature of the observations. This is better illustrated using a scatterplot.

figure_scatter

The scatterplot reveals more clearly the relationship between conditions:
– fast participants, shown in dark blue on the left, tended to be a bit faster in condition 1 than in condition 2;
– slower participants, shown in yellow on the right, tended to be slower in condition 1 than in condition 2;
– this effect seems to be more prominent for participants with responses larger than about 500 ms, with a trend for larger differences with increasing average response latencies.

A shift function can help assess and quantify this pattern. In the shift function below, the x axis shows the deciles in condition 1. The y axis shows the differences between deciles from the two conditions. The difference is reported in the coloured label. The vertical lines show the 95% percentile bootstrap confidence intervals. As we travel from left to right along the x axis, we consider progressively slower participants in condition 1. These slower responses in condition 1 are associated with progressively faster responses in condition 2 (the difference condition 1 – condition 2 increases).

figure_sf_dhd

So here the inferences are made on differences between quantiles of the marginal distributions: for each distribution, we compute quantiles, and then subtract the quantiles.

What if we want to make inferences on the pairwise differences instead? This can be done by computing the quantiles of the differences, and plotting them as a function of the quantiles in one group. A small change in the code gives us a new shift function for dependent groups.

figure_sf_pdhd

The two versions look very similar, which is re-assuring, but does not demonstrate anything (except confirmation bias and wishful thinking on my part). But there might be situations where the two versions differ. Also, the second version makes explicit inferences about the pairwise differences, not about the differences between marginal distributions: so despite the similarities, they afford different conclusions.

Let’s look at the critical example that I should have considered before getting all excited and blogging about the “new method”. A simple negative control demonstrates what is wrong with the approach. Here are two dependent distributions, with a clear shift between the marginals.

figure_kde2

The pairwise relationships are better illustrated using a scatterplot, which shows a seemingly uniform shift between conditions.

figure_scatter2_1

Plotting the pairwise differences as a function of observations in condition 1 confirms the pattern: the differences don’t seem to vary much with the results in condition 1. In other words, differences don’t seem to be particularly larger or smaller for low results in condition 1 relative to high results.

figure_scatter2_2

The shift function on marginals does a great job at capturing the differences, showing a pattern characteristic of stochastic dominance (Speckman, Rouder, Morey & Pratte, 2008): one condition (condition 2) dominates the other at every decile. The differences also appear to be a bit larger for higher than lower deciles in condition 1.

figure_sf_dhd2

The modified shift function, shown next, makes no sense. That’s because the deciles of condition 1 and the deciles of the difference scores necessarily increase from 1 to 9, so plotting one as a function of the other ALWAYS gives a positive slope. The same positive slope I thought was capturing a pattern of regression to the mean! So I fooled myself because I was so eager to find a technique to quantify regression to the mean, and I only used examples that confirmed my expectations (confirmation bias)! This totally blinded me to what in retrospect is a very silly mistake.

figure_sf_pdhd2

Finally, let’s go back to the pattern observed in the previous shift function, where it seemed that the difference scores were increasing from low to high quantiles of condition 1. The presence of this pattern can better be tested using a technique that makes inferences about pairwise differences. One such technique is the difference asymmetry function. The idea from Wilcox (2012, Wilcox & Erceg-Hurn, 2012) goes like this: if two distributions are identical, then the difference scores should be symmetrically distributed around zero. To test for asymmetry, we can estimate sums of lower and higher quantiles; for instance, the sum of quantile 0.05 and quantile 0.95, 0.10 + 0.90, 0.15 + 0.85… For symmetric distributions with a median of zero, these sums should be close to zero, leading to a flat function centred at zero. If for instance the negative differences tend to be larger than the positive differences, the function will start with negative sums and will increase progressively towards zero (see example in Rousselet, Pernet & Wilcox). In our example, the difference asymmetry function is negative and flat, which is characteristic of a uniform shift, without much evidence for an asymmetry. Which is good because that’s how the fake data were generated! So using  graphical representations such as scatterplots, in conjunction with the shift function and the difference asymmetry function, can provide a very detailed and informative account of how two distributions differ.figure_daf2

Conclusion

I got very excited by the new approach because after spending several days thinking about test-retest reliability assessment from a graphical perspective, I thought I had found the perfect tool, as explained in the next post. So the ingredients of my mistake are clear: statistical sloppiness and confirmation bias.

The code for the figures in this post and for the new bogus shift function is available on github. I’ll will not update the rogme package, which implements the otherwise perfectly valid shift functions and difference asymmetry functions.

References

Speckman, P.L., Rouder, J.N., Morey, R.D. & Pratte, M.S. (2008) Delta plots and coherent distribution ordering. Am Stat, 62, 262-266.

Rousselet, G.A., Pernet, C.R. & Wilcox, R.R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience. The European journal of neuroscience, 46, 1738-1748. [preprint] [reproducibility package]

Wilcox, R.R. (2012) Comparing Two Independent Groups Via a Quantile Generalization of the Wilcoxon-Mann-Whitney Test. Journal of Modern Applied Statistical Methods, 11, 296-302.

Wilcox, R.R. & Erceg-Hurn, D.M. (2012) Comparing two dependent groups via quantiles. J Appl Stat, 39, 2655-2664.

Reaction times and other skewed distributions: problems with the mean and the median (part 4/4)

This is part 4 of a 4 part series. Part 1 is here.

In this post, I look at median bias in a large dataset of reaction times from participants engaged in a lexical decision task. The dataset was described in a previous post.

After removing a few participants who didn’t pay attention to the task (low accuracy or too many very late responses), we’re left with 959 participants to play with. Each participant had between 996 and 1001 trials for each of two conditions, Word and Non-Word.

Here is an illustration of reaction time distributions from 100 randomly sampled participants in the Word condition:

figure_flp_w_100_kde

Same in the Non-Word condition:

figure_flp_nw_100_kde

Skewness tended to be larger in the Word than the Non-Word condition. Based on the standard parametric definition of skewness, that was the case in 80% of participants. If we use a non-parametric estimate instead (mean – median), it was the case in 70% of participants.

If we save the median of every individual distribution, we get the two following group distributions, which display positive skewness:

figure_flp_all_p_median

The same applies to distributions of means:

figure_flp_all_p_mean

So we have to worry about skewness at 2 levels:

  • individual distributions

  • group distributions

Here I’m only going to explore estimation bias as a result of skewness and sample size in individual distributions. From what we learnt in previous posts, we can already make predictions: because skewness tended to be stronger in the Word than in the Non-Word condition, the bias of the median will be stronger in the former than the later for small sample sizes. That is, the median in the Word condition will tend to be more over-estimated than the median in the Non-Word condition. As a consequence, the difference between the median of the Non-Word condition (larger RT) and the median of the Word condition (smaller RT) will tend to be under-estimated. To check this prediction, I estimated bias in every participant using a simulation with 2,000 iterations. I assumed that the full sample was the population, from which we can compute population means and population medians. Because the Non-Word condition is the least skewed, I used it as the reference condition, which always had 200 trials. The Word condition had 10 to 200 trials, with 10 trial increments. In the simulation, single RT were sampled with replacements among the roughly 1,000 trials available per condition and participant, so that each iteration is equivalent to a fake experiment. 

Let’s look at the results for the median. The figure below shows the bias in the long run estimation of the difference between medians (Non-Word – Word), as a function of sample size in the Word condition. The Non-Word condition always had 200 trials. All participants are superimposed and shown as coloured traces. The average across participants is shown as a thicker black line. 

figure_flp_bias_diff_md

As expected, bias tended to be negative with small sample sizes. For the smallest sample size, the average bias was -11 ms. That’s probably substantial enough to seriously distort estimation in some experiments. Also, variability is high, with a 80% highest density interval of [-17.1, -2.6] ms. Bias decreases rapidly with increasing sample size. For n=60, it is only 1 ms.

But inter-participant variability remains high, so we should be cautious interpreting results with large numbers of trials but few participants. To quantify the group uncertainty, we could measure the probability of being wrong, given a level of desired precision, as demonstrated here for instance.

After bootstrap bias correction (with 200 bootstrap resamples), the average bias drops to roughly zero for all sample sizes:

figure_flp_bias_diff_md_bc

Bias correction also reduced inter-participant variability. 

As we saw in the previous post, the sampling distribution of the median is skewed, so the standard measure of bias (taking the mean across simulation iterations) does not provide a good indication of the bias we can expect in a typical experiment. If instead of the mean, we compute the median bias, we get the following results:

figure_flp_mdbias_diff_md

Now, at the smallest sample size, the average bias is only -2 ms, and it drops to near zero for n=20. This result is consistent with the simulations reported in the previous post and confirms that in the typical experiment, the average bias associated with the median is negligible.

What happens with the mean?

figure_flp_bias_diff_m

The average bias of the mean is near zero for all sample sizes. Individual bias values are also much less variable than median values. This difference in bias variability does not reflect a difference in variability among participants for the two estimators of central tendency. In fact, the distributions of differences between Non-Word and Word conditions are very similar for the mean and the median. 

figure_flp_all_p_diff

Estimates of spread are also similar between distributions:

IQR: mean RT = 78; median RT = 79

MAD: mean RT = 57; median RT = 54

VAR: mean RT = 4507; median RT = 4785

This suggests that the inter-participant bias differences are due to the shape differences observed in the first two figures of this post. 

Finally, let’s consider the median bias of the mean.

figure_flp_mdbias_diff_m

For the smallest sample size, the average bias across participants is 7 ms. This positive bias can be explained easily from the simulation results of post 3: because of the larger skewness in the Word condition, the sampling distribution of the mean was more positively skewed for small samples in that condition compared to the Non-Word condition, with the bulk of the bias estimates being negative. As a result, the mean tended to be more under-estimated in the Word condition, leading to larger Non-Word – Word differences in the typical experiment. 

I have done a lot more simulations and was planning even more, using other datasets, but it’s time to move on! Of particular note, it appears that in difficult visual search tasks, skewness can differ dramatically among set size conditions – see for instance data posted here.

Concluding remarks

The data-driven simulations presented here confirm results from our previous simulations:

  • if we use the standard definition of bias, for small sample sizes, mean estimates are not biased, median estimates are biased;
  • however, in the typical experiment (median bias), mean estimates can be more biased than median estimates;

  • bootstrap bias correction can be an effective tool to reduce bias.

Given the large differences in inter-participant variability between the mean and the median, an important question is how to spend your money: more trials or more participants (Rouder & Haaf 2018)? An answer can be obtained by running simulations, either data-driven or assuming generative distributions (for instance exGaussian distributions for RT data). Simulations that take skewness into account are important to estimate bias and power. Assuming normality can have disastrous consequences.

Despite the potential larger bias and bias variability of the median compared to the mean, for skewed distributions I would still use the median as a measure of central tendency, because it provides a more informative description of the typical observations. Large sample sizes will reduce both bias and estimation variability, such that high-precision single-participant estimation should be easy to obtain in many situations involving non-clinical samples. For group estimations, much larger samples than commonly used are probably required to improve the precision of our inferences.

Although the bootstrap bias correction seems to work very well in the long run, for a single experiment there is no guarantee it will get you closer to the truth. One possibility is to report results with and without bias correction. 

For group inferences on the median, traditional techniques use incorrect estimations of the standard error, so consider modern parametric or non-parametric techniques instead (Wilcox & Rousselet, 2018). 

References

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

Rouder, J.N. & Haaf, J.M. (2018) Power, Dominance, and Constraint: A Note on the Appeal of Different Design Traditions. Advances in Methods and Practices in Psychological Science, 1, 19-26.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.

Reaction times and other skewed distributions: problems with the mean and the median (part 2/4)

As we saw in the previous post, the sample median is biased when sampling from skewed distributions. The bias increases with decreasing sample size. According to Miller (1988), because of this bias, group comparison can be affected if the two groups differ in skewness or sample size, or both. As a result, real differences can be lowered or increased, and non-existent differences suggested. In Miller’s own words:

“An important practical consequence of the bias in median reaction time is that sample medians must not be used to compare reaction times across experimental conditions when there are unequal numbers of trials in the conditions.”

Here we will see that this advice is wrong for two reasons.

We assess the problem using a simulation in which we draw samples of same or different sizes from populations that differ in skewness or not. We use the same 12 distributions used by Miller (1988), as described previously.

Group 2 has size 200 and is sampled from the distribution with the least skewness.

Group 1 has size 10 to 200, in increments of 10, and is sampled from the 12 distributions.

After 10,000 iterations, here are the results for the mean:

figure_bias_diff_m

All the bias values are near zero, as expected. The shaded area shows the upper part of the mean’s bias 50% HDI (highest density interval), when group 1 and group 2 have the same, least, skewness (6). This interval shows the location of the bulk of the 10,000 simulations. The same area is shown in the next figures. Again, this is a reminder that bias is defined in the long run: a single experiment (one of our simulation iterations) can be far off the population value, especially with small sample sizes.

Here are the results for the median:

figure_bias_diff_md

As in the previous figure, the shaded area shows the upper part of the mean’s bias 50% HDI, when group 1 and group 2 have the same skewness. Bias increases with skewness and sample size difference and is particularly large for n = 10. However, if the two groups have the same skewness (skewness 6), there is almost no bias even when group 2 has 200 observations and group 1 only has 10. So it seems Miller’s (1988) warning about differences in sample sizes was inappropriate, because the main factor causing bias is skewness.

Next, let’s find out if we can correct the bias. Bias correction is performed in 2 ways:

  • using the bootstrap

  • using subsamples, following Miller’s suggestion.

Miller (1988) suggested:

“Although it is computationally quite tedious, there is a way to use medians to reduce the effects of outliers without introducing a bias dependent on sample size. One uses the regular median from Condition F and compares it with a special “average median” (Am) from Condition M. To compute Am, one would take from Condition M all the possible subsamples of Size f where f is the number of trials in Condition F. For each subsample one computes the subsample median. Then, Am is the average, across all possible subsamples, of the subsample medians. This procedure does not introduce bias, because all medians are computed on the basis of the same sample (subsample) size.”

Using all possible subsamples would take far too long. For instance, if one group has 5 observations and the other group has 20 observations, there are 15504 (choose(20,5)) subsamples to consider. Slightly larger sample sizes would force us to consider millions of subsamples. So instead we compute K random subsamples. I arbitrarily set K to 1,000. Although this is not what Miller (1988) suggested, the K loop shortcut should reduce bias to some extent if it is due to sample size differences. Here are the results:

figure_bias_diff_md_sub

The K loop approach has only a small effect on bias. The reason is simple: the main cause of the bias is not a difference in sample size, it is a difference in skewness. This skewness difference can be handled by the bootstrap. Here is what we get using 200 bootstrap resamples for each simulation iteration:

figure_bias_diff_md_bbc

The bootstrap bias correction works very well. The median’s bias is still a bit larger than the mean’s bias for n = 10 in some conditions. For instance, for the most skewed distribution and n = 10, the mean’s bias is -0.51, whereas the median’s bias after bias correction is 0.73. None of them are exactly zero and the absolute values are very similar. At most, for n = 10, the median’s maximum bias across distributions is 1.85 ms, whereas the mean’s is 0.7 ms.

Conclusion

Miller’s (1988) advice was inappropriate because, when comparing two groups, bias originates from a conjunction of differences in skewness and in sample size, not from differences in sample size alone. Although, in practice it might well be that skewed distributions from different conditions/groups tend to differ in skewness, in which case unequal sample sizes will exacerbate the bias. To be cautious, when sample size is relatively small, it would be useful to report median effects with and without bootstrap bias correction.

Finally, data & code are available on github.

[GO TO POST 3/4]

Reaction times and other skewed distributions: problems with the mean and the median (part 1/4)


In this series of 4 posts, I replicate, expand and discuss the results from

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

Part 1 = replicate Miller’s simulations + apply bootstrap bias correction

Part 2 = expand Miller’s simulations to group comparison

Part 3 = problems with the mean

Part 4 = application to a large dataset

Data & code are available on github. The content of the 4 posts is also described in this article.


Reaction times (RT) and many other quantities in neuroscience & psychology are skewed. This asymmetry tends to differ among experimental conditions, such that a measure of central tendency and a measure of spread are insufficient to capture how conditions differ. Instead, to understand the potentially rich differences among distributions, it is advised to consider multiple quantiles of the distributions (Doksum 1974; Pratte et al. 2010; Rousselet et al. 2017), or to model the shapes of the distributions (Heathcote et al. 1991; Rouder et al. 2005; Palmer et al. 2011; Matzke et al. 2013). Yet, it is still common practice to summarise reaction time distributions using a single number, most often the mean: that one value for each participant and each condition can then be entered into a group ANOVA to make statistical inferences. Because of the skewness of reaction times, the mean is however a poor measure of central tendency: skewness shifts the mean away from the bulk of the distribution, an effect that can be amplified by the presence of outliers or a thick right tail. For instance, in the figure below, the median better represents the typical observation than the mean because it is closer to the bulky part of the distribution.

figure_skew92_m_md

Mean and median for a very right skewed distribution. The distribution is bounded to the left and has a long right tail. This is an ex-Gaussian distribution, which is popular to model the shape of reaction time distributions, even though the matching between its parameters and mental processes can be debated (Sternberg, 2014).

So the median appears to be a better choice than the mean if the goal is to have a single value to tell us about the location of most observations in a skewed distribution. The choice between the mean and the median is however more complicated. It could be argued that because the mean is sensitive to skewness, outliers and the thickness of the right tail, it is better able to capture changes in the shapes of the distributions among conditions. But the use of a single value to capture shape differences will lead to intractable analyses because the same mean could correspond to various shapes. Instead, a multiple quantile approach or explicit shape modelling should be used.

The mean and the median differ in another important aspect: for small sample sizes, the sample mean is unbiased, whereas the sample median is biased (see illustrations in previous post). Concretely, if we perform many times the same RT experiment, and for each experiment we compute the mean and the median, the average mean will be very close to the population mean. As the number of experiments increases, the average sample mean will converge to the exact population mean. This is not the case for the median when sample size is small.

The reason for this bias is explained by Miller (1988):

“Like all sample statistics, sample medians vary randomly (from sample to sample) around the true population median, with more random variation when the sample size is smaller. Because medians are determined by ranks rather than magnitudes of scores, the population percentiles of sample medians vary symmetrically around the desired value of 50%. For example, a sample median is just as likely to be the score at the 40th percentile in the population as the score at the 60th percentile. If the original distribution is positively skewed, this symmetry implies that the distribution of sample medians will also be positively skewed. Specifically, unusually large sample medians (e.g., 60th percentile) will be farther above the population median than unusually small sample medians (e.g., 40th percentile) will be below it. The average of all possible sample medians, then, will be larger than the true median, because sample medians less than the true value will not be small enough to balance out the sample medians greater than the true value. Naturally, the more the distribution is skewed, the greater will be the bias in the sample median.”

To illustrate the sample median’s bias, Miller (1988) employed 12 ex-Gaussian distributions that differ in skewness. The distributions are illustrated in the next figure, and colour-coded using the difference between the mean and the median as a non-parametric measure of skewness.

figure_miller_distributions

Here are the parameters of the 12 distributions with the associated population parameters.

 

table_pop_param

The first figure in this post used the most skewed distribution of the 12, with parameters (300, 20, 300).

To estimate bias, we run a simulation in which we sample with replacement 10,000 times from each of the 12 distributions. We take random samples of sizes 4, 6, 8, 10, 15, 20, 25, 35, 50 and 100, as did Miller. For each random sample, we compute the mean and the median. For each sample size and ex-Gaussian parameter, the bias is then defined as the difference between the mean of the 10,000 sample estimates and the population value.

First, we check that the mean is not biased:

figure_miller_bias_m

Each line shows the results for one type of ex-Gaussian distribution: the mean of 10,000 simulations for different sample sizes. The grey area marks the 50% highest-density interval (HDI) of the 10,000 simulations for the least skewed distribution (the same interval is shown in the next two figures for comparison). The interval shows the variability across simulations and highlights an important aspect of the results: bias is a long-run property of an estimator; there is no guarantee that one value from a single experiment will be close to the population value. Also, the variability among samples increases with decreasing sample size, which is why results across small n experiments can differ substantially.

Here are the median bias estimates in table format:

 

table_md_bias

Columns = sample sizes; rows = skewness

 

The values are very close to the values reported in Miller 1998:

miller1988_table1

An illustration is easier to grasp:

figure_miller_bias_md

As reported by Miller (1988), bias can be quite large and it gets worse with decreasing sample sizes and increasing skewness.

Based on these results, Miller made this recommendation:

“An important practical consequence of the bias in median reaction time is that sample medians must not be used to compare reaction times across experimental conditions when there are unequal numbers of trials in the conditions.”

According to Google Scholar, Miller (1988) has been cited 172 times. A look at some of the oldest and most recent citations reveals that his advice has been followed.

For instance, Lavie (1995) noted:

“In the following experiments I used the mean RTs for each participant rather than the medians, as the increase in number of go/no-go errors under the high-load conditions resulted in a different number of responses in between conditions (see Miller, 1988).”

Tipper et al. (1992):

Analysis by Miller (1988) shows that for large numbers of trials, differences in the numbers between conditions […] has no impact on the medians obtained.

More recently, Robinson et al. (2018):

“[…] comparing medians among conditions with an unequal number of trials may lead to false positives (Miller, 1988)”

Du et al. (2017):

“We chose the mean rather than median of RT as […] the sample median may provide a biased estimation of RT (Miller, 1988)”

The list goes on. Also, in a review paper, Whelan (2008), cited 324 times, reiterates the advice:

“The median should never be used on RT data to compare conditions with different numbers of trials.”

However, there are several problems with Miller’s advice. In particular, using the mean leads to many issues with estimation and statistical inferences, as we will see in the 3rd post. In this post, we tackle one key omission from Miller’s assessment: the bias of the sample median can be corrected, using a percentile bootstrap bias correction, as described in this previous post.

For each iteration in the simulation, bias correction was performed using 200 bootstrap samples. Here are the bias corrected results:

figure_miller_bias_md_bc

The bias correction works very well on average, except for the smallest sample sizes. The failure of the bias correction for very small n is not surprising, because the shape of the sampling distribution cannot be properly estimated by the bootstrap from so few observations. From n = 10, the bias values are very close to those observed for the mean. So it seems that in the long-run, we can eliminate the bias of the sample median by using a simple bootstrap procedure. As we will see in the next post, the bootstrap bias correction is also effective when comparing two groups.


Update: 06 Feb 2018

Finally, let’s have a look at how well bias correction works as a function of sample size and skewness. The smaller the sample size, the less the bootstrap can estimate the sampling distribution and its bias. Ideally, after bias correction, we would expect the sample medians to be centred around zero, with limited variability. This ideal situation could look something like that:

figure_miller_bc_ideal_P1_N4

Here we’re considering the most skewed distribution with the smallest sample size. The x-axis shows the median bias before bias correction, whereas the y-axis shows the median bias after bias correction. In this ideal case, the correction works extremely well irrespective of the original bias (average bias is the red vertical line). As a result, the average bias after bias correction is very near zero (horizontal green line).

The reality is very different. The bias correction is only partially effective and is inhomogeneous.

figure_miller_bc_check_P1_N4

Let’s add some markup to help understand what’s going on.

figure_miller_bc_check_P1_N4_markup

If the original bias is negative, after correction, the median tends to be even more negative, so over corrected in the wrong direction (lower left triangle).

If the original bias is positive, after correction, the median is either:
– over corrected in the right direction (lower right triangle)
– under corrected in the right direction (middle right triangle)
– over corrected in the wrong direction (upper right triangle)

This pattern remains, although attenuated, if we consider the largest sample size.

figure_miller_bc_check_P1_N100

Or if we consider the least skewed distribution.

figure_miller_bc_check_P12_N4

We can look at the different patterns as a function of sample size and skewness.

The figure below shows the probability of over correcting in the wrong direction given that the original bias is negative (lower left triangle of the marked up figure).

figure_miller_bc_check_neg_over_wrong

In the ideal situation illustrated previously, the expected proportion of over correction in the wrong direction, given an original negative bias, is 6.7%. So here we clearly have an overrepresentation of these cases. When the original bias is negative, in most cases the bootstrap is unable to correct in the right direction. The situation gets worse with increasing skewness and smaller sample sizes.

The figure below shows the probability of under correcting in the right direction given that the original bias is positive (middle right triangle of the marked up figure)

figure_miller_bc_check_pos_under_right

In the ideal situation illustrated previously, the expected proportion of under correction in the right direction, given an original positive bias, is 44.7 %. So here, we have an overrepresentation of these cases. When the original bias is positive, in too many cases the bootstrap corrects in the right direction, but it under-corrects. The situation gets worse with increasing skewness and smaller sample sizes.

Ok, so bias correction is imperfect and it varies a lot, it part depending on whether the sample median fell above or below the unknown population median. Think using the mean is safer? There are several strong arguments to the contrary – more in another post.

[GO TO POST 2/4]

References

Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Annals of Statistics, 2, 267-277.

Du, Y., Valentini, N.C., Kim, M.J., Whitall, J. & Clark, J.E. (2017) Children and Adults Both Learn Motor Sequences Quickly, But Do So Differently. Frontiers in Psychology, 8.

Heathcote, A., Popiel, S.J. & Mewhort, D.J.K. (1991) Analysis of Response-Time Distributions – an Example Using the Stroop Task. Psychol Bull, 109, 340-347.

Lavie, N. (1995) Perceptual Load as a Necessary Condition for Selective Attention. J Exp Psychol Human, 21, 451-468.

Matzke, D., Love, J., Wiecki, T.V., Brown, S.D., Logan, G.D. & Wagenmakers, E.J. (2013) Release the BEESTS: Bayesian Estimation of Ex-Gaussian STop Signal reaction time distributions. Front Psychol, 4.

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

Palmer, E.M., Horowitz, T.S., Torralba, A. & Wolfe, J.M. (2011) What Are the Shapes of Response Time Distributions in Visual Search? J Exp Psychol Human, 37, 58-71.

Pratte, M.S., Rouder, J.N., Morey, R.D. & Feng, C.N. (2010) Exploring the differences in distributional properties between Stroop and Simon effects using delta plots. Atten Percept Psycho, 72, 2013-2025.

Robinson, M. M., Clevenger, J., & Irwin, D. E. (2018). The action is in the task set, not in the action. Cognitive psychology, 100, 17-42.

Rouder, J.N., Lu, J., Speckman, P., Sun, D.H. & Jiang, Y. (2005) A hierarchical model for estimating response time distributions. Psychon B Rev, 12, 195-223.

Sternberg, S. (2014) Reaction times and the ex-Gaussian distribution: When is it appropriate? Retrieved from http://www.psych.upenn.edu//~saul/

Tipper, S.P., Lortie, C. & Baylis, G.C. (1992) Selective Reaching – Evidence for Action-Centered Attention. J Exp Psychol Human, 18, 891-905.

Whelan, R. (2008) Effective analysis of reaction time data. Psychol Rec, 58, 475-482.