Monthly Archives: November 2017

Your power is lower than you think

The previous post considered alpha when sampling from normal and non-normal distributions. Here the simulations are extended to look at power in the one-sample case. Statistical power is the long term probability of returning a significant test when there is an effect, or probability of true positives.

Power depends both on sample size and effect size. This is illustrated in the figure below, which reports simulations with 5,000 iterations, using t-test on means applied to samples from a normal distribution.

power_mean

Now, let’s look at power in less idealistic conditions, for instance when sampling from a lognormal distribution, which is strongly positively skewed. This power simulation used 10,000 iterations and an effect size of 0.5, i.e. we sample from distributions that are shifted by 0.5 from zero.

power

Under normality (dashed lines), as expected the mean performs better than the 20% trimmed mean and the median: we need smaller sample sizes to reach 80% power when using the mean. However, when sampling from a lognormal distribution the performance of the three estimators is completely reversed: now the mean performs worse; the 20% trimmed mean performs much better; the median performs even better. So when sampling from a skewed distribution, the choice of statistical test can have large effects on power. In particular, a t-test on the mean can have very low power, whereas a t-test on a trimmed mean, or a test on the median can provide much larger power.

As we did in the previous post, let’s look at power in different situations in which we vary the asymmetry and the tails of the distributions. The effect size is 0.5.

Asymmetry manipulation

A t-test on means performs very well under normality (g=0), as we saw in the previous figure. However, as asymmetry increases, power is strongly affected. With large asymmetry (g>1) the t-test is biased: starting from very low sample sizes, power goes down with increasing sample sizes, before going up again in some situations.

figure_power_g_mean

A t-test using a 20% trimmed mean is dramatically less affected by asymmetry than the mean.

figure_power_g_tmean

The median also performs much better than the mean but it behaves differently from the mean and the 20% trimmed mean: power increases with increasing asymmetry!

figure_power_g_median

Tail manipulation

What happens when we manipulate the tails instead? Remember that samples from distributions with heavy tails tend to contain outliers, which affect disproportionally the mean and the variance compared to robust estimators. Not surprisingly, t-tests on means are strongly affected by heavy tails.

figure_power_h_mean

The 20% trimmed mean boosts power significantly, although it is still affected by heavy tails.

figure_power_h_tmean

The median performs the best, showing very limited power drop with increasing tail thickness.

figure_power_h_median

Conclusion

The simulations presented here are of course limited, but they serve as a reminder that power should be estimated using realistic distributions, for instance if the goal is to estimate well-known skewed distributions such as reaction times. The choice of estimators is also critical, and it would be wise to consider robust estimators whenever appropriate.

References

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Wilcox, Rand; Rousselet, Guillaume (2017): A guide to robust statistical methods in neuroscience. figshare. https://doi.org/10.6084/m9.figshare.5114275.v1

Advertisements

Your alpha is probably not 0.05

The R code to run the simulations and create the figures is on github.

The alpha level, or type I error rate, or significance level, is the long-term probability of false positives: obtaining a significant test when there is in fact no effect. Alpha is traditionally given the arbitrary value of 0.05, meaning that in the long-run, we will commit 5% of false positives. However, on average, we will be wrong much more often (Colquhoun, 2014). As a consequence, some have advocated to lower alpha to avoid fooling ourselves too often in the long run. Others have suggested to avoid using the term “statistically significant” altogether, and to justify the choice of alpha. Another very sensible suggestion is to not bother with arbitrary thresholds at all:

Colquhoun, D. (2014) An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci, 1, 140216.

Justify Your Alpha: A Response to “Redefine Statistical Significance”

When the statistical tail wags the scientific dog

Here I want to address a related but different problem: assuming we’re happy with setting alpha to a particular value, say 0.05, is alpha actually equal to the expected, nominal, value in realistic situations?

Let’s check using one-sample estimation as an example. First, we assume we live in a place of magic, where unicorns are abundant (Micceri 1989): we draw random samples from a standard normal distribution. For each sample of a given size, we perform a t-test. We do that 5000 times and record the proportions of false positives. The results appear in the figure below:

alpha_mean

As expected under normality, alpha is very near 0.05 for all sample sizes. Bradley (1978) suggested that keeping the probability of a type I error between 0.025 and 0.075 is satisfactory if the goal is to achieve 0.05. So we illustrate that satisfactory zone as a grey band in the figure above and subsequent ones.

So far so good, but how many quantities we measure have perfectly symmetric distributions with well-behaved tails extending to infinity? Many (most?) quantities related to time measurements are positively skewed and bounded at zero: behavioural reaction times, onset and peak latencies from recordings of single neurones, EEG, MEG… Percent correct data are bounded [0, 1]. Physiological measurements are naturally bounded, or bounded by our measuring equipment (EEG amplifiers for instance). Brain imaging data can have non-normal distributions (Pernet et al. 2009). The list goes on…

So what happens when we sample from a skewed distribution? Let’s look at an example using a lognormal distribution. This time we run 10,000 iterations sampling from a normal distribution and a lognormal distribution, and each time we apply a t-test:

alpha

In the normal case (dashed blue curve), results are similar to those obtained in the previous figure: we’re very near 0.05 at all sample sizes. In the lognormal case (solid blue curve) the type I error rate is much larger than 0.05 for small sample sizes. It goes down with increasing sample size, but is still above 0.05 even with 500 observations. The point is clear: if we sample from skewed distributions, our type I error rate is larger than the nominal level, which means that in the long run, we will commit more false positives than we thought.

Fortunately, there is a solution: the detrimental effects of skewness can be limited by trimming the data. Under normality, t-tests on 20% trimmed means do not perform as well as t-tests on means: they trigger more false positives for small sample sizes (dashed green). Sampling from a lognormal distribution also increases the type I error rate of the 20% trimmed mean (solid green), but clearly not as much as what we observed for the mean. The median (red) performs even better, with estimates very near 0.05 whether we sample from normal or lognormal distributions.

To gain a broader perspective, we compare the mean, the 20% trimmed mean and the median in different situations. To do that, we use g & h distributions (Wilcox 2012). These distributions have a median of zero; the parameter g controls the asymmetry of the distribution, while the parameter h controls the tails.

Here are examples of distributions with h = 0 and g is varied. The lognormal distribution corresponds to g = 1. The standard normal distribution is defined by g = h = 0.

figure_g_distributions

Here are examples of distributions with g=0 and h is varied. The tails get heavier with larger values of h.

figure_h_distributions

Asymmetry manipulation

Let’s first look at the results for t-tests on means:

figure_alpha_g_mean

The type I error rate is strongly affected by asymmetry, and the effect gets worse with smaller sample sizes.

The effect of asymmetry is much less dramatic is we use a 20% trimmed mean:

figure_alpha_g_tmean

And a test on the median doesn’t seem to be affected by asymmetry at all, irrespective of sample size:

figure_alpha_g_median

The median performs better because it provides a more accurate estimation of location than the mean.

Tail manipulation

If instead of manipulating the degree of asymmetry, we manipulate the thickness of the tails, now the type I error rate goes down as tails get heavier. This is because sampling from distributions with heavy tails tends to generate outliers, which tend to inflate the variance. The consequence is an alpha below the nominal level and low statistical power (as seen in next post).

figure_alpha_h_mean

Using a 20% trimmed mean improves matter significantly because trimming tends to get rid of outliers.

figure_alpha_h_tmean

Using the median leads to even better results.

figure_alpha_h_median

Conclusion

The simulations presented here are by far not exhaustive:

  • they are limited to the one-sample case;

  • they only consider three statistical tests;

  • they do not consider conjunctions of g and h values.

But they help make an important point: when sampling from non-normal distributions, the type I error rate is probably not at the nominal level when making inferences on the mean. And the problem is exacerbated with small sample sizes. So, in all the current discussions about alpha levels, we must also consider the types of distributions we are investigating and the estimators used to make inferences about them. See for instance simulations looking at the two independent sample case here.

References

Bradley, J. V. (1978). Robustness? British Journal of Mathematical & Statistical Psychology, 31, 144-152.

Micceri, T. (1989) The Unicorn, the Normal Curve, and Other Improbable Creatures. Psychol Bull, 105, 156-166.

Pernet, C.R., Poline, J.B., Demonet, J.F. & Rousselet, G.A. (2009) Brain classification reveals the right cerebellum as the best biomarker of dyslexia. BMC Neurosci, 10, 67.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Wilcox, Rand; Rousselet, Guillaume (2017): A guide to robust statistical methods in neuroscience. figshare. https://doi.org/10.6084/m9.figshare.5114275.v1

Trimmed means

The R code for this post is on github.

Trimmed means are robust estimators of central tendency. To compute a trimmed mean, we remove a predetermined amount of observations on each side of a distribution, and average the remaining observations. If you think you’re not familiar with trimmed means, you already know one famous member of this family: the median. Indeed, the median is an extreme trimmed mean, in which all observations are removed except one or two.

Using trimmed means confers two advantages:

  • trimmed means provide a better estimation of the location of the bulk of the observations than the mean when sampling from asymmetric distributions;
  • the standard error of the trimmed mean is less affected by outliers and asymmetry than the mean, so that tests using trimmed means can have more power than tests using the mean.

Important point: if we use a trimmed mean in an inferential test (see below), we make inferences about the population trimmed mean, not the population mean. The same is true for the median or any other measure of central tendency. So each robust estimator is a tool to answer a specific question, and this is why different estimators can return different answers…

Here is how we compute a 20% trimmed mean.

Let’s consider a sample of 20 observations:

39 92 75 61 45 87 59 51 87 12  8 93 74 16 32 39 87 12 47 50

First we sort them:

8 12 12 16 32 39 39 45 47 50 51 59 61 74 75 87 87 87 92 93

The number of observations to remove is floor(0.2 * 20) = 4. So we trim 4 observations from each end:

(8 12 12 16) 32 39 39 45 47 50 51 59 61 74 75 87 (87 87 92 93)

And we take the mean of the remaining observations, such that our 20% trimmed mean = mean(c(32,39,39,45,47,50,51,59,61,74,75,87)) = 54.92

Let’s illustrate the trimming process with a normal distribution and 20% trimming:

normdist

We can see how trimming gets rid of the tails of the distribution, to focus on the bulk of the observations. This behaviour is particularly useful when dealing with skewed distributions, as shown here:

fdist

In this skewed distribution (it’s an F distribution), there is more variability on the right side, which appears as stretched compared to the left side. Because we trim the same amount on each side, trimming removes a longer chunk of the distribution on the right side than the left side. As a consequence, the mean of the remaining points is more representative of the location of the bulk of the observations. This can be seen in the following examples.

figure_tm_demo

Panel A shows the kernel density estimate of 100 observations sampled from a standard normal distribution (MCT stands for measure of central tendency). By chance, the distribution is not perfectly symmetric, but the mean, 20% trimmed mean and median give very similar estimates, as expected. In panel B, however, the sample is from a lognormal distribution. Because of the asymmetry of the distribution, the mean is dragged towards the right side of the distribution, away from the bulk of the observations. The 20% trimmed mean is to the left of the mean, and the median further to the left, closer to the location of most observations. Thus, for asymmetric distributions, trimmed means provide more accurate information about central tendency than the mean.

**Q: “By trimming, don’t we loose information?”**

I have heard that question over and over. The answer depends on your goal. Statistical methods are only tools to answer specific questions, so it always depends on your goal. I have never met anyone with a true interest in the mean: the mean is always used, implicitly or explicitly, as a tool to indicate the location of the bulk of the observations. Thus, if your goal is to estimate central tendency, then no, trimming doesn’t discard information, it actually increases the quality of the information about central tendency.

I have also heard that criticism: “I’m interested in the tails of the distributions and that’s why I use the mean, trimming gets rid of them”. Tails certainly have interesting stories to tell, but the mean is absolutely not the tool to study them because it mingles all observations into one value, so we have no way to tell why means differ among samples. If you want to study entire distributions, they are fantastic graphical tools available (Rousselet, Pernet & Wilcox 2017).

Implementation

Base R has trimmed means built in:

mean can be used by changing the trim argument to the desired amount of trimming:

mean(x, trim = 0.2) gives a 20% trimmed mean.

In Matlab, try the tm function available here.

In Python, try the scipy.stats.tmean function. More Python functions are listed here.

Inferences

There are plenty of R functions using trimmed means on Rand Wilcox’s website.

We can use trimmed means instead of means in t-tests. However, the calculation of the standard error is different from the traditional t-test formula. This is because after trimming observations, the remaining observations are no longer independent. The formula for the adjusted standard error was originally proposed by Karen Yuen in 1974, and it involves winsorization. To winsorize a sample, instead of removing observations, we replace them with the remaining extreme values. So in our example, a 20% winsorized sample is:

32 32 32 32 32 39 39 45 47 50 51 59 61 74 75 87 87 87 87 87

Taking the mean of the winsorized sample gives a winsorized mean; taking the variance of the winsorized sample gives a winsorized variance etc. I’ve never seen anyone using winsorized means, however the winsorized variance is used to compute the standard error of the trimmed mean (Yuen 1974). There is also a full mathematical explanation in Wilcox (2012).

You can use all the functions below to make inferences about means too, by setting tr=0. How much trimming to use is an empirical question, depending on the type of distributions you deal with. By default, all functions set tr=0.2, 20% trimming, which has been studied a lot and seems to provide a good compromise. Most functions will return an error with an alternative function suggestion if you set tr=0.5: the standard error calculation is inaccurate for the median and often the only satisfactory solution is to use a percentile bootstrap.

**Q: “With trimmed means, isn’t there a danger of users trying different amounts of trimming and reporting the one that give them significant results?”**

This is indeed a possibility, but dishonesty is a property of the user, not a property of the tool. In fact, trying different amounts of trimming could be very informative about the nature of the effects. Reporting the different results, along with graphical representations, could help provide a more detailed description of the effects.

The Yuen t-test performs better than the t-test on means in many situations. For even better results, Wilcox recommends to use trimmed means with a percentile-t bootstrap or a percentile bootstrap. With small amounts of trimming, the percentile-t bootstrap performs better; with at least 20% trimming, the percentile bootstrap is preferable. Details about these choices are available for instance in Wilcox (2012) and Wilcox & Rousselet (2017).

Yuen’s approach

1-alpha confidence interval for the trimmed mean: trimci(x,tr=.2,alpha=0.05)

Yuen t-test for 2 independent groups: yuen(x,y,tr=.2)

Yuen t-test for 2 dependent groups: yuend(x,y,tr=.2)

Bootstrap percentile-t method

One group: trimcibt(x,tr=.2,alpha=.05,nboot=599)

Two independent groups: yuenbt(x,y,tr=.2,alpha=.05,nboot=599)

Two dependent groups: ydbt(x,y,tr=.2,alpha=.05,nboot=599)

Percentile bootstrap approach

One group: trimpb(x,tr=.2,alpha=.05,nboot=2000)

Two independent groups: trimpb2(x,y,tr=.2,alpha=.05,nboot=2000)

Two dependent groups: dtrimpb(x,y=NULL,alpha=.05,con=0,est=mean)

Matlab

There are some Matlab functions here:

tm – trimmed mean

yuen – t-test for 2 independent groups

yuend – t-test for 2 dependent groups

winvar – winsorized variance

winsample – winsorized sample

wincov – winsorized covariance

These functions can be used with several estimators including  trimmed means:

pb2dg – percentile bootstrap for 2 dependent groups

pb2ig– percentile bootstrap for 2 independent groups

pbci– percentile bootstrap for 1 group

Several functions for trimming large arrays and computing confidence intervals are available in the LIMO EEG toolbox.

References

Karen K. Yuen. The two-sample trimmed t for unequal population variances, Biometrika, Volume 61, Issue 1, 1 April 1974, Pages 165–170, https://doi.org/10.1093/biomet/61.1.165

Rousselet, Guillaume; Pernet, Cyril; Wilcox, Rand (2017): Beyond differences in means: robust graphical methods to compare two groups in neuroscience. figshare. https://doi.org/10.6084/m9.figshare.4055970.v7

Rand R. Wilcox, Guillaume A. Rousselet. A guide to robust statistical methods in neuroscience bioRxiv 151811; doi: https://doi.org/10.1101/151811

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.