The R code to run the simulations and create the figures is on github.
The alpha level, or type I error rate, or significance level, is the long-term probability of false positives: obtaining a significant test when there is in fact no effect. Alpha is traditionally given the arbitrary value of 0.05, meaning that in the long-run, we will commit 5% of false positives. However, on average, we will be wrong much more often (Colquhoun, 2014). As a consequence, some have advocated to lower alpha to avoid fooling ourselves too often in the long run. Others have suggested to avoid using the term “statistically significant” altogether, and to justify the choice of alpha. Another very sensible suggestion is to not bother with arbitrary thresholds at all:
Colquhoun, D. (2014) An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci, 1, 140216.
Here I want to address a related but different problem: assuming we’re happy with setting alpha to a particular value, say 0.05, is alpha actually equal to the expected, nominal, value in realistic situations?
Let’s check using one-sample estimation as an example. First, we assume we live in a place of magic, where unicorns are abundant (Micceri 1989): we draw random samples from a standard normal distribution. For each sample of a given size, we perform a t-test. We do that 5000 times and record the proportions of false positives. The results appear in the figure below:
As expected under normality, alpha is very near 0.05 for all sample sizes. Bradley (1978) suggested that keeping the probability of a type I error between 0.025 and 0.075 is satisfactory if the goal is to achieve 0.05. So we illustrate that satisfactory zone as a grey band in the figure above and subsequent ones.
So far so good, but how many quantities we measure have perfectly symmetric distributions with well-behaved tails extending to infinity? Many (most?) quantities related to time measurements are positively skewed and bounded at zero: behavioural reaction times, onset and peak latencies from recordings of single neurones, EEG, MEG… Percent correct data are bounded [0, 1]. Physiological measurements are naturally bounded, or bounded by our measuring equipment (EEG amplifiers for instance). Brain imaging data can have non-normal distributions (Pernet et al. 2009). The list goes on…
So what happens when we sample from a skewed distribution? Let’s look at an example using a lognormal distribution. This time we run 10,000 iterations sampling from a normal distribution and a lognormal distribution, and each time we apply a t-test:
In the normal case (dashed blue curve), results are similar to those obtained in the previous figure: we’re very near 0.05 at all sample sizes. In the lognormal case (solid blue curve) the type I error rate is much larger than 0.05 for small sample sizes. It goes down with increasing sample size, but is still above 0.05 even with 500 observations. The point is clear: if we sample from skewed distributions, our type I error rate is larger than the nominal level, which means that in the long run, we will commit more false positives than we thought.
Fortunately, there is a solution: the detrimental effects of skewness can be limited by trimming the data. Under normality, t-tests on 20% trimmed means do not perform as well as t-tests on means: they trigger more false positives for small sample sizes (dashed green). Sampling from a lognormal distribution also increases the type I error rate of the 20% trimmed mean (solid green), but clearly not as much as what we observed for the mean. The median (red) performs even better, with estimates very near 0.05 whether we sample from normal or lognormal distributions.
To gain a broader perspective, we compare the mean, the 20% trimmed mean and the median in different situations. To do that, we use g & h distributions (Wilcox 2012). These distributions have a median of zero; the parameter
g controls the asymmetry of the distribution, while the parameter
h controls the tails.
Here are examples of distributions with
h = 0 and
g is varied. The lognormal distribution corresponds to
g = 1. The standard normal distribution is defined by
g = h = 0.
Here are examples of distributions with g=0 and h is varied. The tails get heavier with larger values of h.
Let’s first look at the results for t-tests on means:
The type I error rate is strongly affected by asymmetry, and the effect gets worse with smaller sample sizes.
The effect of asymmetry is much less dramatic is we use a 20% trimmed mean:
And a test on the median doesn’t seem to be affected by asymmetry at all, irrespective of sample size:
The median performs better because it provides a more accurate estimation of location than the mean.
If instead of manipulating the degree of asymmetry, we manipulate the thickness of the tails, now the type I error rate goes down as tails get heavier. This is because sampling from distributions with heavy tails tends to generate outliers, which tend to inflate the variance. The consequence is an alpha below the nominal level and low statistical power (as seen in next post).
Using a 20% trimmed mean improves matter significantly because trimming tends to get rid of outliers.
Using the median leads to even better results.
The simulations presented here are by far not exhaustive:
- they are limited to the one-sample case;
they only consider three statistical tests;
they do not consider conjunctions of
But they help make an important point: when sampling from non-normal distributions, the type I error rate is probably not at the nominal level when making inferences on the mean. And the problem is exacerbated with small sample sizes. So, in all the current discussions about alpha levels, we must also consider the types of distributions we are investigating and the estimators used to make inferences about them. See for instance simulations looking at the two independent sample case here.
Bradley, J. V. (1978). Robustness? British Journal of Mathematical & Statistical Psychology, 31, 144-152.
Micceri, T. (1989) The Unicorn, the Normal Curve, and Other Improbable Creatures. Psychol Bull, 105, 156-166.
Pernet, C.R., Poline, J.B., Demonet, J.F. & Rousselet, G.A. (2009) Brain classification reveals the right cerebellum as the best biomarker of dyslexia. BMC Neurosci, 10, 67.
Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.
Wilcox, Rand; Rousselet, Guillaume (2017): A guide to robust statistical methods in neuroscience. figshare. https://doi.org/10.6084/m9.figshare.5114275.v1