Tag Archives: median

When is a 95% confidence interval not a 95% confidence interval?


In previous posts, we saw how skewness and outliers can affect false positives (type I errors) and true positives (power) in one-sample tests. In particular, when making inferences about the population mean, skewness tends to inflate false positives, and skewness and outliers can destroy power. Here we investigate a complementary perspective, looking at how confidence intervals are affected by skewness and outliers.

Spoiler alert: 95% confidence intervals most likely do not have a coverage of 95%. In fact, I’ll show you an example in which a 95% CI for the mean has an 80% coverage…

The R code for this post is on GitHub.


Back to the title of the post. Seems like a weird question? Not if we consider the definition of a confidence interval (CI). Let say we conduct an experiment to estimate quantity x from a sample, where x could be the median or the mean for instance. Then a 95% CI for the population value of x refers to a procedure whose behaviour is defined in the long-run: CIs computed in the same way should contain the population value in 95% of exact replications of the experiment. For a single experiment, the particular CI does or does not contain the population value, there is no probability associated with it. A CI can also be described as the interval compatible with the data given our model — see definitions and common misinterpretations in Greenland et al. (2016).

So 95% refers to the (long-term) coverage of the CI; the exact values of the CI bounds vary across experiments. The CI procedure is associated with a certain coverage probability, in the long-run, given the model. Here the model refers to how we collected data, data cleaning procedures (e.g. outlier removal), assumptions about data distribution, and the methods used to compute the CI. Coverage can differ from the expected one if model assumptions are violated or the model is just plain wrong.

Wrong models are extremely common, for instance when applying a standard t-test CI to percent correct data (Kruschke, 2014; Jaeger, 2008) or Likert scale data (Bürkner & Vuorre, 2019; Liddell & Kruschke, 2019). 

For continuous data, CI coverage is not at the expected, nominal level, for instance when the model expects symmetric distributions and we’re actually sampling from skewed populations (which is the norm, not the exception, when we measure sizes, durations, latencies etc.). Here we explore this issue using g & h distributions that let us manipulate asymmetry.

Illustrate g & h distributions

All g & h distributions have a median of zero. The parameter g controls the asymmetry of the distribution, while the parameter h controls the thickness of the tails (Hoaglin, 1985; Yan & Genton, 2019). Let’s look at some illustrations to make things clear.

Examples in which we vary g from 0 to 1.

As g increases, the asymmetry of the distributions increases. Using negative g values would produce distributions with negative skewness.

Examples in which we vary h from 0 to 0.2.

As h increases, the tails are getting thicker, which means that outliers are more likely. 

Test with normal (g=h=0) distribution

Let’s run simulations to look at coverage probability in different situations and for different estimators. First, we sample with replacement from a normal population (g=h=0) 20,000 times (that’s 20,000 simulated experiments). Each sample has size n=30. Confidence intervals are computed for the mean, the 10% trimmed mean ™, the 20% trimmed mean and the median using standard parametric methods (see details in the code on GitHub, and references for equations in Wilcox & Rousselet, 2018). The trimmed mean and the median are robust measures of central tendency. To compute a 10% trimmed mean, observations are sorted, the 10% lowest and 10% largest values are discarded (20% in total), and the remaining values are averaged. In this context, the mean is a 0% trimmed mean and the median is a 50% trimmed mean. Trimming the data attenuates the influence of the tails of the distributions and thus the effects of asymmetry and outliers on confidence intervals.

First we look at coverage for the 4 estimators: we look at the proportion of simulated experiments in which the CIs included the population value for each estimator. As expected for the special case of a normal distribution, the coverage is close to nominal (95%) for every method:

Mean 10% tm 20% tm Median
0.949 0.948 0.943 0.947

In addition to coverage, we also look at the width of the CIs (upper bound minus lower bound). Across simulations, we summarise the results using the median width. CIs tends to be larger for trimmed means and median relative to the mean, which implies lower power under normality for these methods (Wilcox & Rousselet, 2018). 

Mean 10% tm 20% tm Median
0.737 0.761 0.793 0.889

For CIs that did not include the population, the distribution is fairly balanced between the left and the right of the population. To see this, I computed a shift index: if the CI was located to the left of the population value, it receives a score of -1, when it was located to the right, it receives a score of 1. The shift index was then computed by averaging the scores only for those CI excluding the population.

Mean 10% tm 20% tm Median
0.046 0.043 0.009 0.013

Illustrate CIs that did not include the population

Out of 20,000 simulated experiments, about 1,000 CI (roughly 5%) did not include the population value for each estimator. About the same number of CIs were shifted to the left and to the right of the population value, which is illustrated in the next figure. In each panel, the vertical line marks the population value (here it’s zero in all conditions because the population is symmetric). The CIs are plotted in the order of occurrence in the simulation. So the figure shows that if we miss the population value, we’re as likely to overshoot than undershoot our estimation.

Across panels, the figure also shows that the more we trim (10%, 20%, median) the larger the CIs get. So for a strictly normal population, we more precisely estimate the mean than trimmed means and the median.

Test with g=1 & h=0 distribution

What happens for a skewed population? Three things happen for the mean:

  • coverage goes down
  • width increases
  • CIs not including the population value tend to be shifted to the left (negative average shift values)

The same effects are observed for the trimmed means, but less so the more we trim, because trimming alleviates the effects of the tails.

Measure Mean 10% tm 20% tm Median
Coverage 0.880 0.936 0.935 0.947
Width 1.253 0.956 0.879 0.918
Shift -0.962 -0.708 -0.661 0.017
# left 2350 1101 1084 521
# right 45 188 221 539

Illustrate CIs that did not include the population

The figure illustrates the strong imbalance between left and right CI shifts. If we try to estimate the mean of a skewed population, our CIs are likely to miss it more than 5% of the time, and when that happens, the CIs are most likely to be shifted towards the bulky part of the distribution (here the left for a right skewed distribution). Also, the right shifted CIs vary a lot in width and can be very large.

As we trim, the imbalance is progressively resolved. With 20% trimming, when CIs do not contain the population value, the distribution of left and right shifts is more balanced, although with still far more left shifts. With the median we have roughly 50% left / 50% right shifts and CIs are narrower than for the mean.

Test with g=1 & h=0.2 distribution

What happens if we sample from a skewed distribution (g=1) in which outliers are likely (h=0.2)?

Measure Mean 10% tm 20% tm Median
Coverage 0.801 0.934 0.936 0.947
Width 1.729 1.080 0.934 0.944
Shift -0.995 -0.797 -0.709 0.018
# left 3967 1194 1086 521
# right 9 135 185 540

The results are similar to those observed for h=0, only exacerbated. Coverage for the mean is even lower, CIs are larger, and the shift imbalance even more severe. I have no idea how often such a situation occur, but I suspect if you study clinical populations that might be rather common. Anyway, the point is that it is a very bad idea to assume the distributions we study are normal, apply standard tools, and hope for the best. Reporting CIs as 95% or some other value, without checking, can be very misleading.

Simulations in which we vary g

We now explore CI properties as a function of g, which we vary from 0 to 1, in steps of 0.1. The parameter h is set to 0 (left column of next figure) or 0.2 (right column). Let’s look at column A first (h=0). For the median, coverage is unaffected by g. For the other estimators, there is a monotonic decrease in coverage with increasing g. The effect is much stronger for the mean than the trimmed means.

For all estimators, increasing g leads to monotonic increases in CI width. The effect is very subtle for the median and more pronounced the less we trim. Under normality, g=0, CIs are the shortest for the mean, explaining the larger power of mean based methods relative to trimmed means in this unusual situation.

In the third panel, the zero line represents an equal proportion of left and right shifts, relative to the population, for CIs that did not include the population value. The values are consistently above zero for the median, with a few more right shifts than left shifts for all values of g. For the other estimators, the preponderance of left shifts increases markedly with g.

Now we look at results in panel B (h=0.2). When outliers are likely, coverage drops faster with g for the mean. Other estimators are resistant to outliers.

When outliers are common, CIs for the population mean are larger than for all other estimators, irrespective of g.

Again, there is a constant over-representation of right shifted CIS for the median. For the other estimators, the left shifted CIs dominate more and more with increasing g. The trend is more pronounced for the mean relative to the h=0 situation, with a sharper monotonic downward trajectory.

Conclusion

The answer to the question in the title is: most of the time! Simply because our models are wrong most of the time. So I would take all published confidence intervals with a pinch of salt. [Some would actually go further and say that if the sampling and analysis plans for an experiment were not clearly stipulated before running the experiment, then confidence interval, like P values, are not even defined (Wagenmakers, 2007). That is, we can compute a CI, but the coverage is meaningless, because exact repeated sampling might be impossible or contingent on external factors that would need to be simulated.] The best way forward is probably not to advocate for the use of trimmed means or the median over the mean in all cases, because different estimators address different questions about the data. And there are more estimators of central tendency than means, trimmed means and medians. There are also more interesting questions to ask about the data than their central tendencies (Rousselet, Pernet & Wilcox, 2017). For these reasons, we need data sharing to be the default, so that other users can ask different questions using different tools. The idea that the one approach used in a paper is the best to address the problem at hand is just silly.

To see what happens when we use the percentile bootstrap or the bootstrap-t to build confidence intervals for the mean, see this more recent post.

References

Bürkner, Paul-Christian, and Matti Vuorre. ‘Ordinal Regression Models in Psychology: A Tutorial’. Advances in Methods and Practices in Psychological Science 2, no. 1 (1 March 2019): 77–101. https://doi.org/10.1177/2515245918823199.

Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. ‘Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations’. European Journal of Epidemiology 31, no. 4 (1 April 2016): 337–50. https://doi.org/10.1007/s10654-016-0149-3.

Hoaglin, David C. ‘Summarizing Shape Numerically: The g-and-h Distributions’. In Exploring Data Tables, Trends, and Shapes, 461–513. John Wiley & Sons, Ltd, 1985. https://doi.org/10.1002/9781118150702.ch11.

Jaeger, T. Florian. ‘Categorical Data Analysis: Away from ANOVAs (Transformation or Not) and towards Logit Mixed Models’. Journal of Memory and Language 59, no. 4 (November 2008): 434–46. https://doi.org/10.1016/j.jml.2007.11.007.

Kruschke, John K. Doing Bayesian Data Analysis. 2nd Edition. Academic Press, 2014.

Liddell, Torrin M., and John K. Kruschke. ‘Analyzing Ordinal Data with Metric Models: What Could Possibly Go Wrong?’ Journal of Experimental Social Psychology 79 (1 November 2018): 328–48. https://doi.org/10.1016/j.jesp.2018.08.009.

Rousselet, Guillaume A., Cyril R. Pernet, and Rand R. Wilcox. ‘Beyond Differences in Means: Robust Graphical Methods to Compare Two Groups in Neuroscience’. European Journal of Neuroscience 46, no. 2 (1 July 2017): 1738–48. https://doi.org/10.1111/ejn.13610.

Rousselet, Guillaume A., and Rand R. Wilcox. ‘Reaction Times and Other Skewed Distributions: Problems with the Mean and the Median’. Preprint. PsyArXiv, 17 January 2019. https://doi.org/10.31234/osf.io/3y54r.

Wagenmakers, Eric-Jan. ‘A Practical Solution to the Pervasive Problems of p Values’. Psychonomic Bulletin & Review 14, no. 5 (1 October 2007): 779–804. https://doi.org/10.3758/BF03194105.

Wilcox, Rand R., and Guillaume A. Rousselet. ‘A Guide to Robust Statistical Methods in Neuroscience’. Current Protocols in Neuroscience 82, no. 1 (2018): 8.42.1-8.42.30. https://doi.org/10.1002/cpns.41.

Yan, Yuan, and Marc G. Genton. ‘The Tukey G-and-h Distribution’. Significance 16, no. 3 (2019): 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x.

Reaction times and other skewed distributions: problems with the mean and the median (part 4/4)

This is part 4 of a 4 part series. Part 1 is here.

In this post, I look at median bias in a large dataset of reaction times from participants engaged in a lexical decision task. The dataset was described in a previous post.

After removing a few participants who didn’t pay attention to the task (low accuracy or too many very late responses), we’re left with 959 participants to play with. Each participant had between 996 and 1001 trials for each of two conditions, Word and Non-Word.

Here is an illustration of reaction time distributions from 100 randomly sampled participants in the Word condition:

figure_flp_w_100_kde

Same in the Non-Word condition:

figure_flp_nw_100_kde

Skewness tended to be larger in the Word than the Non-Word condition. Based on the standard parametric definition of skewness, that was the case in 80% of participants. If we use a non-parametric estimate instead (mean – median), it was the case in 70% of participants.

If we save the median of every individual distribution, we get the two following group distributions, which display positive skewness:

figure_flp_all_p_median

The same applies to distributions of means:

figure_flp_all_p_mean

So we have to worry about skewness at 2 levels:

  • individual distributions

  • group distributions

Here I’m only going to explore estimation bias as a result of skewness and sample size in individual distributions. From what we learnt in previous posts, we can already make predictions: because skewness tended to be stronger in the Word than in the Non-Word condition, the bias of the median will be stronger in the former than the later for small sample sizes. That is, the median in the Word condition will tend to be more over-estimated than the median in the Non-Word condition. As a consequence, the difference between the median of the Non-Word condition (larger RT) and the median of the Word condition (smaller RT) will tend to be under-estimated. To check this prediction, I estimated bias in every participant using a simulation with 2,000 iterations. I assumed that the full sample was the population, from which we can compute population means and population medians. Because the Non-Word condition is the least skewed, I used it as the reference condition, which always had 200 trials. The Word condition had 10 to 200 trials, with 10 trial increments. In the simulation, single RT were sampled with replacements among the roughly 1,000 trials available per condition and participant, so that each iteration is equivalent to a fake experiment. 

Let’s look at the results for the median. The figure below shows the bias in the long run estimation of the difference between medians (Non-Word – Word), as a function of sample size in the Word condition. The Non-Word condition always had 200 trials. All participants are superimposed and shown as coloured traces. The average across participants is shown as a thicker black line. 

figure_flp_bias_diff_md

As expected, bias tended to be negative with small sample sizes. For the smallest sample size, the average bias was -11 ms. That’s probably substantial enough to seriously distort estimation in some experiments. Also, variability is high, with a 80% highest density interval of [-17.1, -2.6] ms. Bias decreases rapidly with increasing sample size. For n=60, it is only 1 ms.

But inter-participant variability remains high, so we should be cautious interpreting results with large numbers of trials but few participants. To quantify the group uncertainty, we could measure the probability of being wrong, given a level of desired precision, as demonstrated here for instance.

After bootstrap bias correction (with 200 bootstrap resamples), the average bias drops to roughly zero for all sample sizes:

figure_flp_bias_diff_md_bc

Bias correction also reduced inter-participant variability. 

As we saw in the previous post, the sampling distribution of the median is skewed, so the standard measure of bias (taking the mean across simulation iterations) does not provide a good indication of the bias we can expect in a typical experiment. If instead of the mean, we compute the median bias, we get the following results:

figure_flp_mdbias_diff_md

Now, at the smallest sample size, the average bias is only -2 ms, and it drops to near zero for n=20. This result is consistent with the simulations reported in the previous post and confirms that in the typical experiment, the average bias associated with the median is negligible.

What happens with the mean?

figure_flp_bias_diff_m

The average bias of the mean is near zero for all sample sizes. Individual bias values are also much less variable than median values. This difference in bias variability does not reflect a difference in variability among participants for the two estimators of central tendency. In fact, the distributions of differences between Non-Word and Word conditions are very similar for the mean and the median. 

figure_flp_all_p_diff

Estimates of spread are also similar between distributions:

IQR: mean RT = 78; median RT = 79

MAD: mean RT = 57; median RT = 54

VAR: mean RT = 4507; median RT = 4785

This suggests that the inter-participant bias differences are due to the shape differences observed in the first two figures of this post. 

Finally, let’s consider the median bias of the mean.

figure_flp_mdbias_diff_m

For the smallest sample size, the average bias across participants is 7 ms. This positive bias can be explained easily from the simulation results of post 3: because of the larger skewness in the Word condition, the sampling distribution of the mean was more positively skewed for small samples in that condition compared to the Non-Word condition, with the bulk of the bias estimates being negative. As a result, the mean tended to be more under-estimated in the Word condition, leading to larger Non-Word – Word differences in the typical experiment. 

I have done a lot more simulations and was planning even more, using other datasets, but it’s time to move on! Of particular note, it appears that in difficult visual search tasks, skewness can differ dramatically among set size conditions – see for instance data posted here.

Concluding remarks

The data-driven simulations presented here confirm results from our previous simulations:

  • if we use the standard definition of bias, for small sample sizes, mean estimates are not biased, median estimates are biased;
  • however, in the typical experiment (median bias), mean estimates can be more biased than median estimates;

  • bootstrap bias correction can be an effective tool to reduce bias.

Given the large differences in inter-participant variability between the mean and the median, an important question is how to spend your money: more trials or more participants (Rouder & Haaf 2018)? An answer can be obtained by running simulations, either data-driven or assuming generative distributions (for instance exGaussian distributions for RT data). Simulations that take skewness into account are important to estimate bias and power. Assuming normality can have disastrous consequences.

Despite the potential larger bias and bias variability of the median compared to the mean, for skewed distributions I would still use the median as a measure of central tendency, because it provides a more informative description of the typical observations. Large sample sizes will reduce both bias and estimation variability, such that high-precision single-participant estimation should be easy to obtain in many situations involving non-clinical samples. For group estimations, much larger samples than commonly used are probably required to improve the precision of our inferences.

Although the bootstrap bias correction seems to work very well in the long run, for a single experiment there is no guarantee it will get you closer to the truth. One possibility is to report results with and without bias correction. 

For group inferences on the median, traditional techniques use incorrect estimations of the standard error, so consider modern parametric or non-parametric techniques instead (Wilcox & Rousselet, 2018). 

References

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

Rouder, J.N. & Haaf, J.M. (2018) Power, Dominance, and Constraint: A Note on the Appeal of Different Design Traditions. Advances in Methods and Practices in Psychological Science, 1, 19-26.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.

Cohen’s d is biased

The R notebook associated with this post is available on github.

Cohen’s d is a popular measure of effect size. In the one-sample case, d is simply computed as the mean divided by the standard deviation (SD). For repeated measures, the same formula is applied to difference scores (see detailed presentation and explanation of variants in Lakens, 2013). 

Because d relies on a non-robust measure of central tendency (the mean), and a non-robust measure of dispersion (SD), it is a non-robust measure of effect size, meaning that a single observation can have a dramatic effect on its value, as explained here. Cohen’s d also makes very strict assumptions about the data, so it is only appropriate in certain contexts. As a consequence, it should not be used as the default measure of effect size, and more powerful and informative alternatives should be considered – see a few examples here. For comparisons across studies and meta-analyses, nothing will beat data-sharing though.

Here we look at another limitation of Cohen’s d: it is biased when we draw small samples. Bias is covered in detail in another post. In short, in the one-sample case, when Cohen’s d is estimated from a small sample, in the long run it tends to be larger than the population value. This over-estimation is due to a bias of SD, which tends to be lower than the population’s SD. Because the mean is not biased, when divided by an under-estimated SD, it leads to an over-estimated measure of effect size. The bias of SD is explained in intro stat books, in the section describing Student’s t. Not surprisingly it is never mentioned in the discussions of small n studies, as a limitation of effect size estimation…

In this demonstration, we sample with replacement 10,000 times from the ex-Gaussian distributions below, for various sample sizes, as explained here:

figure_miller_distributions

The table below shows the population values for each distribution. For comparison, we also consider a robust equivalent to Cohen’s d, in which the mean is replaced by the median, and SD is replaced by the percentage bend mid-variance (pbvar, Wilcox, 2017). As we will see, this robust alternative is also biased – there is no magic solution I’m afraid.

m:          600  600  600  600  600  600  600  600  600  600  600  600

md:        509  512  524  528  540  544  555  562  572  579  588  594

m-md:    92    88     76    72    60    55     45    38    29    21    12     6

m.den:   301  304  251  255  201  206  151  158  102  112  54     71

md.den: 216  224  180  190  145  157  110  126  76    95    44     68

m.es:       2.0   2.0    2.4   2.4   3.0   2.9   4.0   3.8   5.9   5.4   11.1  8.5

md.es:     2.4   2.3    2.9   2.8   3.7   3.5   5.0   4.5   7.5   6.1   13.3  8.8

m = mean

md = median

den = denominator

es = effect size

m.es = Cohen’s d

md.es = md / pbvar

 

Let’s look at the behaviour of d as a function of skewness and sample size.

figure_es_m_es

Effect size d tends to decrease with increasing skewness, because SD tends to increase with skewness. Effect size also increases with decreasing sample size. This bias is stronger for samples from the least skewed distributions. This is counterintuitive, because one would think estimation tends to get worse with increased skewness. Let’s find out what’s going on.

Computing the bias normalises the effect sizes across skewness levels, revealing large bias differences as a function of skewness. Even with 100 observations, the bias (mean of 10,000 simulation iterations) is still slightly larger than zero for the least skewed distributions. This bias is not due to the mean, because we the sample mean is an unbiased estimator of the population mean.

figure_es_m_es_bias

Let’s check to be sure:

figure_es_m_num

So the problem must be with the denominator:

figure_es_m_den

Unlike the mean, the denominator of Cohen’s d, SD, is biased. Let’s look at bias directly.

figure_es_m_den_bias

SD is most strongly biased for small sample sizes and bias increases with skewness. Negative values indicate that sample SD tends to under-estimate the population values. This is because the sampling distribution of SD is increasingly skewed with increasing skewness and decreasing sample sizes. This can be seen in this plot of the 80% highest density intervals (HDI) for instance:

figure_m_den_hdi80

The sampling distribution of SD is increasingly skewed and variable with increasing skewness and decreasing sample sizes. As a result, the sampling distribution of Cohen’s d is also skewed. The bias is strongest in absolute term for the least skewed distributions because the sample SD is overall smaller for these distributions, resulting in overall larger effect sizes. Although SD is most biased for the most skewed distributions, SD is also overall much larger for them, resulting in much smaller effect sizes than those obtained for less skewed distributions. This strong attenuation of effect sizes with increasing skewness swamps the absolute differences in SD bias. This explains the counter-intuitive lower d bias for more skewed distributions.

As we saw previously, bias can be corrected using a bootstrap approach. Applied, to Cohen’s d, this technique does reduce bias, but it still remains a concern:

figure_es_m_es_bias_after_bc

Finally, let’s look at the behaviour of a robust equivalent to Cohen’s d, the median normalised by the percentage bend mid-variance.

figure_es_md_es

The median effect size shows a similar profile to the mean effect size. It is overall larger than the mean effect size because it uses a robust measure of spread, which is less sensitive to the long right tails of the skewed distributions we sample from.

figure_es_md_bias

The bias disappears quickly with increasing sample sizes, and quicker than for the mean effect size.

However, unlike what we observed for d, in this case the bias correction does not work for small samples, because the repetition of the same observations in some bootstrap samples leads to very large values of the denominator. It’s ok for n>=15, for which bias is relatively small anyway, so at least based on these simulations, I wouldn’t use bias correction for this robust effect size.

figure_es_md_bias_after_bc

Conclusion

Beware of small sample sizes: they are associated with increased variability (see discussion in a clinical context here) and can accentuate the bias of some effect size estimates. If effect sizes tend to be reported more often if they pass some arbitrary threshold, for instance p < 0.05, then the literature will tend to over-estimate them (see demonstration here), a phenomenon exacerbated by small sample sizes (Button et al. 2013). 

Can’t say it enough: small n is bad for science if the goal is to provide accurate estimates of effect sizes.

To determine how the precision and accuracy of your results depend on sample size, the best approach is to perform simulations, providing some assumptions about the shape of the population distributions.

References

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S. & Munafo, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14, 365-376.

Lakens, D. (2013) Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Front Psychol, 4, 863.

Wilcox, R.R. (2017) Introduction to Robust Estimation and Hypothesis Testing. Academic Press, 4th edition., San Diego, CA.

Reaction times and other skewed distributions: problems with the mean and the median (part 3/4)

Bias is defined as the distance between the mean of the sampling distribution (here estimated using Monte-Carlo simulations) and the population value. In part 1 and part 2, we saw that for small sample sizes, the sample median provides a biased estimation of the population median, which can significantly affect group comparisons. However, this bias disappears with large sample sizes, and it can be corrected using a bootstrap bias correction. In part 3, we look in more detail at the shape of the sampling distributions, which was ignored by Miller (1988).

Sampling distributions

Let’s consider the sampling distributions of the mean and the median for different sample sizes and ex-Gaussian distributions with skewness ranging from 6 to 92 (Figure 1). When skewness is limited (6, top row), the sampling distributions are symmetric and centred on the population values: there is no bias. As we saw previously, with increasing sample size, variability decreases, which is why studies with larger samples provide more accurate estimations. The flip side is that studies with small samples are much noisier, which is why their results tend not to replicate…

figure_samp_dist_summary

Figure 1

When skewness is large (92, middle row), sampling distributions get more positively skewed with decreasing sample sizes. To better understand how the sampling distributions change with sample size, we turn to the last row of Figure 1, which shows 50% highest-density intervals (HDI). Each horizontal line is a HDI for a particular sample size. The labels contain the values of the interval boundaries. The coloured vertical tick inside the interval marks the median of the distribution. The red vertical line spanning the entire plot is the population value.

Means For small sample sizes, the 50% HDI is offset to the left of the population mean, and so is the median of the sampling distribution. This demonstrates that the typical sample mean tends to under-estimate the population mean – that is to say, the mean sampling distribution is median biased. This offset reduces with increasing sample size, but is still present even for n=100.

Medians With small sample sizes, there is a discrepancy between the 50% HDI, which is shifted to the left of the population median, and the median of the sampling distribution, which is shifted to the right of the population median. This contrasts with the results for the mean, and can be explained by differences in the shapes of the sampling distributions, in particular the larger skewness and kurtosis of the median sampling distribution compared to that of the mean (see code on github for extra figures). The offset between 50% HDI and the population reduces quickly with increasing sample size. For n=10, the median bias is already very small. From n=15, the median sample distribution is not median bias, which means that the typical sample median is not biased.

Another representation of the sampling distributions is provided in Figure 2: 50% HDI are shown as a function of sample size. For both the mean and the median, bias increase with increasing skewness and decreasing sample size. Skewness also increases the asymmetry of the sampling distributions, but more for the mean than median.

figure_samp_dist_hdi_summary

Figure 2

So what’s going on here? Is the mean also biased?  According to the standard definition of bias, which is based on the distance between the population mean and the average of the sampling distribution of the mean, the mean is not biased. But this definition applies to the long run, after we replicate the same experiment many times. In practice, we never do that. So what happens in practice, when we perform only one experiment? In that case, the median of the sampling distribution provides a better description of the typical experiment than the mean of the distribution. And the median of the sample distribution of the mean is inferior to the population mean when sample size is small. So if you conduct one small n experiment and compute the mean of a skewed distribution, you’re likely to under-estimate the true value.

Is the median biased after all? The median is indeed biased according to the standard definition. However, with small n, the typical median (represented by the median of the sampling distribution of the median) is close to the population median, and the difference disappears for even relatively small sample sizes. 

Is it ok to use the median then?

If the goal is to accurately estimate the central tendency of a RT distribution, while protecting against the influence of outliers, the median is far more efficient than the mean (Wilcox & Rousselet, 2018). Providing sample sizes are large enough, bias is actually not a problem, and the typical bias is actually very small, as we’ve just seen. So, if you have to choose between the mean and the median, I would go for the median without hesitation.

It’s more complicated though. In an extensive series of simulations, Ratcliff (1993) demonstrated that when performing standard group ANOVAs, the median can lack power compared to other estimators. Ratcliff’s simulations involved ANOVAs on group means, in which for each participant, very few trials (7 to 12) are available for each condition. Based on the simulations, Ratcliff recommended data transformations or computing the mean after applying specific cut-offs to maximise power. However, these recommendations should be considered with caution because the results could be very different with more realistic sample sizes. Also, standard ANOVA on group means are not robust, and alternative techniques should be considered (Wilcox 2017). Data transformations are not ideal either, because they change the shape of the distributions, which contains important information about the nature of the effects. Also, once data are transformed, inferences are made on the transformed data, not on the original ones, an important caveat that tends to be swept under the carpet in articles’ discussions… Finally, truncating distributions introduce bias too, especially with the mean – see next section (Miller 1991; Ulrich & Miller 1994)!

At this stage, I don’t see much convincing evidence against using the median of RT distributions, if the goal is to use only one measure of location to summarise the entire distribution. Clearly, a better alternative is to not throw away all that information, by studying how entire distributions differ (Rousselet et al. 2017). For instance, explicit modelling of RT distributions can be performed with the excellent brms R package.

Other problems with the mean

In addition to being median biased, and a poor measure of central tendency for asymmetric distributions, the mean is also associated with several other important problems. Standard procedures using the mean lack of power, offer poor control over false positives, and lead to inaccurate confidence intervals. Detailed explanations of these problems are provided in Field & Wilcox (2017) and Wilcox & Rousselet (2018) for instance. For detailed illustrations of the problems associated with means in the one-sample case, when dealing with skewed distributions, see the companion reproducibility package on figshare.

If that was not enough, common outlier exclusion techniques lead to bias estimation of the mean (Miller, 1991). When applied to skewed distributions, removing any values more than 2 or 3 SD from the mean affects slow responses more than fast ones. As a consequence, the sample mean tends to underestimate the population mean. And this bias increases with sample size because the outlier detection technique does not work for small sample sizes, which results from the lack of robustness of the mean and the SD. The bias also increases with skewness. Therefore, when comparing distributions that differ in sample size, or skewness, or both, differences can be masked or created, resulting in inaccurate quantification of effect sizes.

Truncation using absolute thresholds (for instance RT < 300 ms or RT > 1,200 ms) also leads to potentially severe bias of the mean, median, standard deviation and skewness of RT distributions (Ulrich & Miller 1994). The median is much less affected by truncation bias than the mean though.

In the next and final post of this series, we will explore sampling bias in a real dataset, to see how much of a problem we’re really dealing with. Until then, thanks for reading.

[GO TO POST 4/4]

References

Field, A.P. & Wilcox, R.R. (2017) Robust statistical methods: A primer for clinical psychology and experimental psychopathology researchers. Behav Res Ther, 98, 19-38.

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

Miller, J. (1991) Reaction-Time Analysis with Outlier Exclusion – Bias Varies with Sample-Size. Q J Exp Psychol-A, 43, 907-912.

Ratcliff, R. (1993) Methods for dealing with reaction time outliers. Psychol Bull, 114, 510-532.

Rousselet, G.A., Pernet, C.R. & Wilcox, R.R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience. The European journal of neuroscience, 46, 1738-1748.

Ulrich, R. & Miller, J. (1994) Effects of Truncation on Reaction-Time Analysis. Journal of Experimental Psychology-General, 123, 34-80.

Wilcox, R.R. (2017) Introduction to Robust Estimation and Hypothesis Testing. Academic Press, 4th edition., San Diego, CA.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.

Reaction times and other skewed distributions: problems with the mean and the median (part 2/4)

As we saw in the previous post, the sample median is biased when sampling from skewed distributions. The bias increases with decreasing sample size. According to Miller (1988), because of this bias, group comparison can be affected if the two groups differ in skewness or sample size, or both. As a result, real differences can be lowered or increased, and non-existent differences suggested. In Miller’s own words:

“An important practical consequence of the bias in median reaction time is that sample medians must not be used to compare reaction times across experimental conditions when there are unequal numbers of trials in the conditions.”

Let’s evaluate this advice.

We assess the problem using a simulation in which we draw samples of same or different sizes from populations with the same skewness, using the same 12 distributions used by Miller (1988), as described previously.

Group 2 has size 200. Group 1 has size 10 to 200, in increments of 10.

After 10,000 iterations, here are the results for the mean:

figure_bias_diff_m

 

All the bias values are near zero, as expected.

Here are the results for the median:

figure_bias_diff_md

 

Bias increases with skewness and sample size difference and is particularly large for n = 10. At least about 90-100 trials in Group 1 are required to bring bias to values similar to the mean.

Next, let’s find out if we can correct the bias. Bias correction is performed in 2 ways:

  • using the bootstrap

  • using subsamples, following Miller’s suggestion.

Miller (1988) suggested:

“Although it is computationally quite tedious, there is a way to use medians to reduce the effects of outliers without introducing a bias dependent on sample size. One uses the regular median from Condition F and compares it with a special “average median” (Am) from Condition M. To compute Am, one would take from Condition M all the possible subsamples of Size f where f is the number of trials in Condition F. For each subsample one computes the subsample median. Then, Am is the average, across all possible subsamples, of the subsample medians. This procedure does not introduce bias, because all medians are computed on the basis of the same sample (subsample) size.”

Using all possible subsamples would take far too long. For instance, if one group has 5 observations and the other group has 20 observations, there are 15504 (choose(20,5)) subsamples to consider. Slightly larger sample sizes would force us to consider millions of subsamples. So instead we compute K random subsamples. I arbitrarily set K to 1,000. Although this is not what Miller (1988) suggested, the K loop shortcut should reduce bias to some extent if it is due to sample size differences. Here are the results:

figure_bias_diff_md_sub

 

The K loop approach works very well! Bias can also be handled by the bootstrap. Here is what we get using 200 bootstrap resamples for each simulation iteration:

figure_bias_diff_md_bbc

 

The bootstrap bias correction works very well too! So in the long-run, the bias in the estimation of differences between medians can be eliminated using the subsampling or the percentile bootstrap approaches. Because of the skewness of the sampling distributions, we also consider the median bias: the bias observed in a typical experiment. In that case, the difference between group means tends to underestimate the population difference:

figure_bias_diff_m_mdbias

For the median, the median bias is much lower than the standard (mean) bias, and near zero from n = 20.

figure_bias_diff_md_mdbias

Thus, for a typical experiment, the difference between group medians actually suffers less from bias than the difference between group means.

Conclusion

Miller’s (1988) advice was inappropriate because, when comparing two groups, bias in a typical experiment is actually negligible. To be cautious, when sample size is relatively small, it could be useful to report median effects with and without bootstrap bias correction. It would be even better to run simulations to determine the sample sizes required to achieve an acceptable measurement precision, irrespective of the estimator used.

Finally, data & code are available on github.

[GO TO POST 3/4]

What can we learn from 10,000 experiments?


The code and a notebook for this post are available on github.

Before we conduct an experiment, we decide on a number of trials per condition, and a number of participants, and then we hope that whatever we measure comes close to the population values we’re trying to estimate. In this post I’ll use a large dataset to illustrate how close we can get to the truth – at least in a lexical decision task. The data are from the French lexicon project:

Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Meot, A., Augustinova, M. & Pallier, C. (2010) The French Lexicon Project: lexical decision data for 38,840 French words and 38,840 pseudowords. Behav Res Methods, 42, 488-496.

After discarding participants who clearly did not pay attention, we have 967 participants who performed a word/non-word discrimination task. Each participant has about 1000 trials per condition. I only consider the reaction time (RT) data, but correct/incorrect data are also available. Here are RT distributions for 3 random participants (PXXX refers to their ID number in the dataset):

figure_flp_p113

figure_flp_p388

figure_flp_p965

The distributions are positively skewed, as expected for RT data, and participants tend to be slower in the non-word condition compared to the word condition. Usually, a single number is used to summarise each individual RT distribution. From 1000 values to 1, that’s some serious data compression. (For alternative strategies to compare RT distributions, see for instance this paper). In psychology, the mean is often used, but here the median gives a better indication of the location of the typical observation. Let’s use both. Here is the distribution across participants of median RT for the word and non-word conditions:

figure_all_p_medians

And the distribution of median differences:

figure_all_p_median_diff

Interestingly, the differences between the medians of the two conditions is also skewed: that’s because the two distributions tend to differ in skewness.

We can do the same for the mean:

figure_all_p_means

Mean differences:

figure_all_p_mean_diff

With this large dataset, we can play a very useful game. Let’s pretend that the full dataset is our population that we’re trying to estimate. Across all trials and all participants, the population medians are:

Word = 679.5 ms
Non-word = 764 ms
Difference = 78.5

the population means are:

Word = 767.6 ms
Non-word = 853.2 ms
Difference = 85.5

Now, we can perform experiments by sampling with replacement from our large population. I think we can all agree that a typical experiment does not have 1,000 trials per condition, and certainly does not have almost 1,000 participants! So what happens if we perform experiments in which we collect smaller sample sizes? How close can we get to the truth?

10,000 experiments: random samples of participants only

In the first simulation, we use all the trials (1,000) available for each participant but vary how many participants we sample from 10 to 100, in steps of 10. For each sample size, we draw 10,000 random samples of participants from the population, and we compute the mean and the median. For consistency, we compute group means of individual means, and group medians of individual medians. We have 6 sets of results to consider: for the word condition, the non-word condition, their difference; for the mean and for the median. Here are the results for the group medians in the word condition. The vertical red line marks the population median for the word condition.

figure_md_w_size

With increasing sample size, we gain in precision: on average the result of each experiment is closer to the population value. With a small sample size, the result of a single experiment can be quite far from the population. This is not surprising, and also explains why estimates from small n experiments tend to disagree, especially between an original study and subsequent replications.

To get a clearer sense of how close the estimates are to the population value, it is useful to consider highest density intervals (HDI). A HDI is the shorted interval that contains a certain proportion of observations: it shows the location of the bulk of the observations. Here we consider 50% HDI:

figure_md_w_size_hdi

In keeping with the previous figure, the intervals get smaller with increasing sample size. The intervals are also asymmetric: the right side is always closer to the population value than the left side. That’s because the sampling distributions are asymmetric. This means that the typical experiment will tend to under-estimate the population value.

We observe a similar pattern for the non-word condition, with the addition of two peaks in the distribution from n = 40. I’m not sure what’s causing it, or if it is a common shape. One could check by analysing the lexicon project data from other countries. Any volunteers? One thing for sure: the two peaks are caused by the median, because they are absent from the mean distributions (see notebook on github).

figure_md_nw_size

Finally, we can consider the distributions of group medians of median differences:

figure_md_diff_size

Not surprisingly, it has a similar shape to the previous distributions. It shows a landscape of expected effects. It would be interesting to see where the results of other, smaller, experiments fall in that distribution. The distribution could also be used to plan experiments to achieve a certain level of precision. This is rather unusual given the common obsession for statistical power. But experiments should be predominantly about quantifying effects, so a legitimate concern is to determine how far we’re likely to be from the truth in a given experiment. Using our simulated distribution of experiments, we can determine the probability of the absolute difference between an experiment and the population value to be larger than some cut-offs. In the figure below, we look at cut-offs from 5 ms to 50 ms, in steps of 5.

figure_md_diff_size_prob

The probability of being wrong by at least 5 ms is more than 50% for all sample sizes. But 5 ms is probably not a difference folks studying lexical decision would worry about. It might be an important difference in other fields. 10 ms might be more relevant: I certainly remember a conference in which group differences of 10 ms between attention conditions were seriously discussed, and their implications for theories of attention considered. If we care about a resolution of at least 10 ms, n=30 still gives us 44% chance of being wrong…

(If someone is looking for a cool exercise: you could determine the probability of two random experiments to differ by at least certain amounts.)

We can also look at the 50% HDI of the median differences:

figure_md_diff_size_hdi

As previously noted, the intervals shrink with increasing sample size, but are asymmetric, suggesting that the typical experiment will under-estimate the population difference. This asymmetry disappears for larger samples of group means:

figure_m_diff_size_hdi

Anyway, these last figures really make me think that we shouldn’t make such a big deal out of any single experiment, given the uncertainty in our measurements.

10,000 experiments: random samples of trials and participants

In the previous simulation, we sampled participants with replacement, using all their 1,000 trials. Typical experiments use fewer trials. Let say I plan to collect data from 20 participants, with 100 trials per condition. What does the sampling distribution look like? Using our large dataset, we can sample trials and participants to find out. Again, for consistency, group means are computed from individual means, and group medians are computed from individual medians. Here are the sampling distributions of the mean and the median in the word condition:

figure_sim_word

The vertical lines mark the population values. The means are larger than the medians because of the positive skewness of RT distributions.

In the non-word condition:

figure_sim_nonword

And the difference distributions:

figure_sim_difference

Both distributions are slightly positively skewed, and their medians fall to the left of the population values. The spread of expected values is quite large.

Similarly to simulation 1, we can determine the probability of being wrong by at least 5 ms, which is 39%. For 10 ms, the probability is 28%, and for 20 ms 14%. But that’s assuming 20 participants and 100 trials per condition. A larger simulation could investigate many combinations of sample sizes…

Finally, we consider the 50% HDI, which are both asymmetric with respect to the population values and with similar lengths:

figure_sim_difference_hdi

Conclusion

To answer the question from the title, I think the main things we can learn from the simulated sampling distributions above are:
– to embrace uncertainty
– modesty in our interpretations
– healthy skepticism about published research
One way to highlight uncertainty in published research would be to run simulations using descriptive statistics from the articles, to illustrate the range of potential outcomes that are compatible with the results. That would make for interesting stories I’m sure.

Bias & bootstrap bias correction


UPDATE: for more on the topic, see this article:

Reaction times and other skewed distributions: problems with the mean and the median

[Preprint] [Reproducibility package]


 


The code and a notebook for this post are available on github.

The bootstrap bias correction technique is described in detail in chapter 10 of this classic textbook:

Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.

A mathematical summary + R code are available here.


 

In a typical experiment, we draw samples from an unknown population and compute a summary quantity, or sample statistic, which we hope will be close to the true population value. For instance, in a reaction time (RT) experiment, we might want to estimate how long it takes for a given participant to detect a visual stimulus embedded in noise. Now, to get a good estimate of our RTs, we ask our participant to perform a certain number of trials, say 100. Then, we might compute the mean RT, to get a value that summarises the 100 trials. This mean RT is an estimate of the true, population, value. In that case the population would be all the unknowable reaction times that could be generated by the participant, in the same task, in various situations – for instance after x hours of sleep, y cups of coffee, z pints of beer the night before – typically all these co-variates that we do not control. (The same logic applies across participants). So for that one experiment, the mean RT might under-estimate or over-estimate the population mean RT. But if we ask our participant to come to the lab over and over, each time to perform 100 trials, the average of all these sample estimates will converge to the population mean. We say that the sample mean is an unbiased estimator of the population mean. Certain estimators, in certain situations, are however biased: no matter how many experiments we perform, the average of the estimates is systematically off, either above or below the population value.

Let’s say we’re sampling from this very skewed distribution. It is an ex-Gaussian distribution which looks a bit like a reaction time distribution, but could be another skewed quantity – there are plenty to choose from in nature. The mean is 600, the median is 509.

figure.kde.pop.m.md

Now imagine we perform experiments to try to estimate these population values. Let say we take 1,000 samples of 10 observations. For each experiment (sample), we compute the mean. These sample means are shown as grey vertical lines in the next figure. A lot of them fall very near the population mean (black vertical line), but some of them are way off.

figure.kde.1000m

The mean of these estimates is shown with the black dashed vertical line. The difference between the mean of the mean estimates and the population value is called bias. Here bias is small (2.5). Increasing the number of experiments will eventually lead to a bias of zero. In other words, the sample mean is an unbiased estimator of the population mean.

For small sample sizes from skewed distributions, this is not the case for the median. In the example below, the bias is 15.1: the average median across 1,000 experiments over-estimates the population median.

figure.kde.1000md

Increasing sample size to 100 reduces the bias to 0.7 and improves the precision of our estimates. On average, we get closer to the population median, and the distribution of sample medians has much lower variance.

figure.kde.1000md_n100

So bias and measurement precision depend on sample size. Let’s look at sampling distributions as a function of sample size. First, we consider the mean.

The sample mean is not (mean) biased

Using our population, let’s do 1000 experiments, in which we take different numbers of samples 1000, 500, 100, 50, 20, 10.

Then, we can illustrate how close all these experiments get to the true (population) value.

figure_sample_mean

As expected, our estimates of the population mean are less and less variable with increasing sample size, and they converge towards the true population value. For small samples, the typical sample mean tends to underestimate the true population value (yellow curve). But despite the skewness of the sampling distribution with small n, the average of the 1000 simulations/experiments is very close to the population value, for all sample sizes:

  • population = 677.8

  • average sample mean for n=10 = 683.8

  • average sample mean for n=20 = 678.3

  • average sample mean for n=50 = 678.1

  • average sample mean for n=100 = 678.7

  • average sample mean for n=500 = 678.0

  • average sample mean for n=1000 = 678.0

The approximation will get closer to the true value with more experiments. With 10,000 experiments of n=10, we get 677.3. The result is very close to the true value. That’s why we say that the sample mean is an unbiased estimator of the population mean.

It remains that with small n, the sample mean tends to underestimate the population mean.

The median of the sampling distribution of the mean in the previous figure is 656.9, which is 21 ms under the population value. So if the mean is not mean biased, it is actually median biased (for more on this topic, see this article).

Bias of the median

The sample median is biased when n is small and we sample from skewed distributions:

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

However, the bias can be corrected using the bootstrap. Let’s first look at the sampling distribution of the median for different sample sizes.

figure_sample_median

Doesn’t look too bad, but for small sample sizes, on average the sample median over-estimates the population median:

  • population = 508.7

  • sample mean for n=10 = 522.1

  • sample mean for n=20 = 518.4

  • sample mean for n=50 = 511.5

  • sample mean for n=100 = 508.9

  • sample mean for n=500 = 508.7

  • sample mean for n=1000 = 508.7

Unlike what happened with the mean, the approximation does not get closer to the true value with more experiments. Let’s try 10,000 experiments of n=10:

  • sample mean = 523.9

There is systematic shift between average sample estimates and the population value: thus the sample median is a biased estimate of the population median. Fortunately, this bias can be corrected using the bootstrap.

Bias estimation and bias correction

A simple technique to estimate and correct sampling bias is the percentile bootstrap. If we have a sample of n observations:

  • sample with replacement n observations from our original sample

  • compute the estimate

  • perform steps 1 and 2 nboot times

  • compute the mean of the nboot bootstrap estimates

The difference between the estimate computed using the original sample and the mean of the bootstrap estimates is a bootstrap estimate of bias.

Let’s consider one sample of 10 observations from our skewed distribution. It’s median is: 535.9

The population median is 508.7, so our sample considerably over-estimate the population value.

Next, we sample with replacement from our sample, and compute bootstrap estimates of the median. The distribution obtained is a bootstrap estimate of the sampling distribution of the median. The idea is this: if the bootstrap distribution approximates, on average, the shape of the sampling distribution of the median, then we can use the bootstrap distribution to estimate the bias and correct our sample estimate. However, as we’re going to see, this works on average, in the long run. There is no guarantee for a single experiment.

figure_boot_md

Using the current data, the mean of the bootstrap estimates is  722.8.

Therefore, our estimate of bias is the difference between the mean of the bootstrap estimates and the sample median = 187.

To correct for bias, we subtract the bootstrap bias estimate from the sample estimate:

sample median – (mean of bootstrap estimates – sample median)

which is the same as:

2 x sample median – mean of bootstrap estimates.

Here the bias corrected sample median is 348.6. Quite a drop from 535.9. So the sample bias has been reduced dramatically, clearly too much. But bias is a long run property of an estimator, so let’s look at a few more examples. We take 100 samples of n = 10, and compute a bias correction for each of them. The arrows go from the sample median to the bias corrected sample median. The vertical black line shows the population median.

figure_mdbc_examples

  • population median =  508.7

  • average sample median =  515.1

  • average bias corrected sample median =  498.8

So across these experiments, the bias correction was too strong.

What happens if we perform 1000 experiments, each with n=10, and compute a bias correction for each one? Now the average of the bias corrected median estimates is much closer to the true median.

  • population median =  508.7

  • average sample median =  522.1

  • average bias corrected sample median =  508.6

It works very well!

But that’s not always the case: it depends on the estimator and on the amount of skewness.

I’ll cover median bias correction in more detail in a future post. For now, if you study skewed distributions (most bounded quantities such as time measurements are skewed) and use the median, you should consider correcting for bias. But it’s unclear in which situation to act: clearly, bias decreases with sample size, but with small sample sizes the bias can be poorly estimated, potentially resulting in catastrophic adjustments. Clearly more simulations are needed…