Category Archives: summary

Your power is lower than you think

The previous post considered alpha when sampling from normal and non-normal distributions. Here the simulations are extended to look at power in the one-sample case. Statistical power is the long term probability of returning a significant test when there is an effect, or probability of true positives.

Power depends both on sample size and effect size. This is illustrated in the figure below, which reports simulations with 5,000 iterations, using t-test on means applied to samples from a normal distribution.

power_mean

Now, let’s look at power in less idealistic conditions, for instance when sampling from a lognormal distribution, which is strongly positively skewed. This power simulation used 10,000 iterations and an effect size of 0.5, i.e. we sample from distributions that are shifted by 0.5 from zero.

power

Under normality (dashed lines), as expected the mean performs better than the 20% trimmed mean and the median: we need smaller sample sizes to reach 80% power when using the mean. However, when sampling from a lognormal distribution the performance of the three estimators is completely reversed: now the mean performs worse; the 20% trimmed mean performs much better; the median performs even better. So when sampling from a skewed distribution, the choice of statistical test can have large effects on power. In particular, a t-test on the mean can have very low power, whereas a t-test on a trimmed mean, or a test on the median can provide much larger power.

As we did in the previous post, let’s look at power in different situations in which we vary the asymmetry and the tails of the distributions. The effect size is 0.5.

Asymmetry manipulation

A t-test on means performs very well under normality (g=0), as we saw in the previous figure. However, as asymmetry increases, power is strongly affected. With large asymmetry (g>1) the t-test is biased: starting from very low sample sizes, power goes down with increasing sample sizes, before going up again in some situations.

figure_power_g_mean

A t-test using a 20% trimmed mean is dramatically less affected by asymmetry than the mean.

figure_power_g_tmean

The median also performs much better than the mean but it behaves differently from the mean and the 20% trimmed mean: power increases with increasing asymmetry!

figure_power_g_median

Tail manipulation

What happens when we manipulate the tails instead? Remember that samples from distributions with heavy tails tend to contain outliers, which affect disproportionally the mean and the variance compared to robust estimators. Not surprisingly, t-tests on means are strongly affected by heavy tails.

figure_power_h_mean

The 20% trimmed mean boosts power significantly, although it is still affected by heavy tails.

figure_power_h_tmean

The median performs the best, showing very limited power drop with increasing tail thickness.

figure_power_h_median

Conclusion

The simulations presented here are of course limited, but they serve as a reminder that power should be estimated using realistic distributions, for instance if the goal is to estimate well-known skewed distributions such as reaction times. The choice of estimators is also critical, and it would be wise to consider robust estimators whenever appropriate.

References

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Wilcox, Rand; Rousselet, Guillaume (2017): A guide to robust statistical methods in neuroscience. figshare. https://doi.org/10.6084/m9.figshare.5114275.v1

Advertisements

Your alpha is probably not 0.05

The R code to run the simulations and create the figures is on github.

The alpha level, or type I error rate, or significance level, is the long-term probability of false positives: obtaining a significant test when there is in fact no effect. Alpha is traditionally given the arbitrary value of 0.05, meaning that in the long-run, we will commit 5% of false positives. However, on average, we will be wrong much more often (Colquhoun, 2014). As a consequence, some have advocated to lower alpha to avoid fooling ourselves too often in the long run. Others have suggested to avoid using the term “statistically significant” altogether, and to justify the choice of alpha. Another very sensible suggestion is to not bother with arbitrary thresholds at all:

Colquhoun, D. (2014) An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci, 1, 140216.

Justify Your Alpha: A Response to “Redefine Statistical Significance”

When the statistical tail wags the scientific dog

Here I want to address a related but different problem: assuming we’re happy with setting alpha to a particular value, say 0.05, is alpha actually equal to the expected, nominal, value in realistic situations?

Let’s check using one-sample estimation as an example. First, we assume we live in a place of magic, where unicorns are abundant (Micceri 1989): we draw random samples from a standard normal distribution. For each sample of a given size, we perform a t-test. We do that 5000 times and record the proportions of false positives. The results appear in the figure below:

alpha_mean

As expected under normality, alpha is very near 0.05 for all sample sizes. Bradley (1978) suggested that keeping the probability of a type I error between 0.025 and 0.075 is satisfactory if the goal is to achieve 0.05. So we illustrate that satisfactory zone as a grey band in the figure above and subsequent ones.

So far so good, but how many quantities we measure have perfectly symmetric distributions with well-behaved tails extending to infinity? Many (most?) quantities related to time measurements are positively skewed and bounded at zero: behavioural reaction times, onset and peak latencies from recordings of single neurones, EEG, MEG… Percent correct data are bounded [0, 1]. Physiological measurements are naturally bounded, or bounded by our measuring equipment (EEG amplifiers for instance). Brain imaging data can have non-normal distributions (Pernet et al. 2009). The list goes on…

So what happens when we sample from a skewed distribution? Let’s look at an example using a lognormal distribution. This time we run 10,000 iterations sampling from a normal distribution and a lognormal distribution, and each time we apply a t-test:

alpha

In the normal case (dashed blue curve), results are similar to those obtained in the previous figure: we’re very near 0.05 at all sample sizes. In the lognormal case (solid blue curve) the type I error rate is much larger than 0.05 for small sample sizes. It goes down with increasing sample size, but is still above 0.05 even with 500 observations. The point is clear: if we sample from skewed distributions, our type I error rate is larger than the nominal level, which means that in the long run, we will commit more false positives than we thought.

Fortunately, there is a solution: the detrimental effects of skewness can be limited by trimming the data. Under normality, t-tests on 20% trimmed means do not perform as well as t-tests on means: they trigger more false positives for small sample sizes (dashed green). Sampling from a lognormal distribution also increases the type I error rate of the 20% trimmed mean (solid green), but clearly not as much as what we observed for the mean. The median (red) performs even better, with estimates very near 0.05 whether we sample from normal or lognormal distributions.

To gain a broader perspective, we compare the mean, the 20% trimmed mean and the median in different situations. To do that, we use g & h distributions (Wilcox 2012). These distributions have a median of zero; the parameter g controls the asymmetry of the distribution, while the parameter h controls the tails.

Here are examples of distributions with h = 0 and g is varied. The lognormal distribution corresponds to g = 1. The standard normal distribution is defined by g = h = 0.

figure_g_distributions

Here are examples of distributions with g=0 and h is varied. The tails get heavier with larger values of h.

figure_h_distributions

Asymmetry manipulation

Let’s first look at the results for t-tests on means:

figure_alpha_g_mean

The type I error rate is strongly affected by asymmetry, and the effect gets worse with smaller sample sizes.

The effect of asymmetry is much less dramatic is we use a 20% trimmed mean:

figure_alpha_g_tmean

And a test on the median doesn’t seem to be affected by asymmetry at all, irrespective of sample size:

figure_alpha_g_median

The median performs better because it provides a more accurate estimation of location than the mean.

Tail manipulation

If instead of manipulating the degree of asymmetry, we manipulate the thickness of the tails, now the type I error rate goes down as tails get heavier. This is because sampling from distributions with heavy tails tends to generate outliers, which tend to inflate the variance. The consequence is an alpha below the nominal level and low statistical power (as seen in next post).

figure_alpha_h_mean

Using a 20% trimmed mean improves matter significantly because trimming tends to get rid of outliers.

figure_alpha_h_tmean

Using the median leads to even better results.

figure_alpha_h_median

Conclusion

The simulations presented here are by far not exhaustive:

  • they are limited to the one-sample case;

  • they only consider three statistical tests;

  • they do not consider conjunctions of g and h values.

But they help make an important point: when sampling from non-normal distributions, the type I error rate is probably not at the nominal level when making inferences on the mean. And the problem is exacerbated with small sample sizes. So, in all the current discussions about alpha levels, we must also consider the types of distributions we are investigating and the estimators used to make inferences about them. See for instance simulations looking at the two independent sample case here.

References

Bradley, J. V. (1978). Robustness? British Journal of Mathematical & Statistical Psychology, 31, 144-152.

Micceri, T. (1989) The Unicorn, the Normal Curve, and Other Improbable Creatures. Psychol Bull, 105, 156-166.

Pernet, C.R., Poline, J.B., Demonet, J.F. & Rousselet, G.A. (2009) Brain classification reveals the right cerebellum as the best biomarker of dyslexia. BMC Neurosci, 10, 67.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Wilcox, Rand; Rousselet, Guillaume (2017): A guide to robust statistical methods in neuroscience. figshare. https://doi.org/10.6084/m9.figshare.5114275.v1

How to illustrate a 2×2 mixed ERP design

Let’s consider a simple mixed ERP design with 2 repeated measures (2 tasks) and 2 independent groups of participants (young and older participants). The Matlab code and the data are available on github. The data are time-courses of mutual information, with one vector time-course per participant and task. These results are preliminary and have not been published yet, but you can get an idea of how we use mutual information in the lab in recent publications (Ince et al. 2016a, 2016b; Rousselet et al. 2014). The code and illustrations presented in the rest of the post are not specific to mutual information.

Our 2 x 2 experimental design could be analysed using the LIMO EEG toolbox for instance, by computing a 2 x 2 ANOVA at every time point, and correcting for multiple comparisons using cluster based bootstrap statistics (Pernet et al. 2011, 2015). LIMO EEG has been used to investigate task effects for instance (Rousselet et al. 2011). But here, instead of ANOVAs, I’d like to focus on graphical representations and non-parametric assessment of our simple group design, to focus on effect sizes and to demonstrate how a few figures can tell a rich data-driven story.

First, we illustrate the 4 cells of our design. Figure 1 shows separately each group and each task: in each cell all participants are superimposed using thin coloured lines. We can immediately see large differences among participants and between groups, with overall smaller effects (mutual information) in older participants. There also seems to be task differences, in particular in young participants, which tend to present more sustained effects past 200 ms in the expressive task than the gender task.

fig1_gpmi2x2

Figure 1

To complement the individual traces, we can add measures of central tendency. The mean is shown with a thick green line, the median with a thick black line. See how the mean can be biased compared to the median in the presence of extreme values. The median was calculated using the Harrell-Davis estimator of the 50th quantile. To illustrate the group median with a measure of uncertainty, we can add a 95% percentile bootstrap confidence interval for instance (Figure 2).

fig2_gpmi2x2_ci

We can immediately see discrepancies between the median time-courses and their confidence intervals on the one hand, and the individual time-courses on the other hand. There are indeed many distributions of participants that can lead to the same average time-course. That’s why it is essential to show individual results, at least in some illustrations.

In our 2 x 2 design, we now have 3 aspects to consider: group differences, task differences and their interactions. We illustrate them in turn.

Age group differences for each task

We can look at the group differences in each task separately, as shown in Figure 3. The medians of each group is shown with 95% percentile bootstrap confidence intervals. On average, older participants tend to have weaker mutual information than young participants – less than half around 100-200 ms post-stimulus. This will need to be better quantified, for instance by reporting the median of all pairwise differences.

fig3_gpmi_group_diff

Figure 3

Under each panel showing the median + CI for each group, we plot the time-course of the group differences (young-older), with a confidence interval. For group comparisons we cannot illustrate individuals, because participants are be paired. However, we can illustrate all the bootstrap samples, shown in grey. Each sample was obtained by:

  • sampling with replacement Ny observations among Ny young observers
  • sampling with replacement No observations among No older observers
  • compute the median of each group
  • subtract the two medians

It is particularly important to illustrate the bootstrap distributions if they are skewed or contain outliers, or both, to check that the confidence intervals provide a good summary. If the bootstrap samples are very skewed, highest density intervals might be a good alternative to classic confidence intervals.

The lower panels of Figure 3 reveal relatively large group differences in a narrow window within 200 ms. The effect also appears to be stronger in the expressive task. Technically, one could also say that the effects are statistically significant, in a frequentist sense, when the 95% confidence intervals do not include zero. But not much is gained from that because some effects are large and others are small. Correction for multiple comparisons would also be required.

Task differences for each group

Figure 4 has a similar layout to Figure 3, now focusing on the task differences. The top panels suggest that the group medians don’t differ much between tasks, except maybe in young participants around 300-500 ms.

fig4_gpmi_task_effects

Figure 4

Because task effects are paired, we are not limited to the comparison of the medians between tasks; we can also illustrate the individual task differences and the medians of these differences [1]. These are shown in the bottom panels of Figure 4. In both groups, the individual differences are large and the time-courses of the task differences are scattered around zero, except in the young group starting around 300 ms, where most participants have positive differences (expressive > gender).

[1] When the mean is used as a measure of central tendency, these two perspectives are identical, because the difference between two means is the same as the mean of the pairwise differences. However, this is not the case for the median: the difference between medians is not the same as the median of the differences. Because we are interested in effect sizes, it is more informative to report descriptive statistics of the pairwise differences. The advantage of the Matlab code provided with this post is that instead of looking at the median, we can also look at other quantiles, thus getting a better picture of the strength of the effects.

Interaction between tasks and groups

Finally, in Figure 5 we consider the interactions between task and group factors. To do that we first superimpose the medians of the task differences with their confidence intervals (top panel). These traces are the same shown in the bottom panels of Figure 4. I can’t say I’m very happy with the top panel of Figure 5 because the two traces are difficult to compare. Essentially the don’t seem to differ much, except maybe for the late effect in young participants being higher than what is observed in older participants.

fig5_gpmi_task_group_interaction

In the lower panel of Figure 5 we illustrate the age group differences (young – older) between the medians of the pairwise task differences. Again confidence intervals are also provided, along with the original bootstrap samples. Overall, there is very little evidence for a 2 x 2 interaction, suggesting that the age group differences are fairly stable across tasks. Put another way, the weak task effects don’t appear to change much in the two age groups.

References

Ince, R.A., Jaworska, K., Gross, J., Panzeri, S., van Rijsbergen, N.J., Rousselet, G.A. & Schyns, P.G. (2016a) The Deceptively Simple N170 Reflects Network Information Processing Mechanisms Involving Visual Feature Coding and Transfer Across Hemispheres. Cereb Cortex.

Ince, R.A., Giordano, B.L., Kayser, C., Rousselet, G.A., Gross, J. & Schyns, P.G. (2016b) A statistical framework for neuroimaging data analysis based on mutual information estimated via a gaussian copula. Hum Brain Mapp.

Pernet, C.R., Chauveau, N., Gaspar, C. & Rousselet, G.A. (2011) LIMO EEG: a toolbox for hierarchical LInear MOdeling of ElectroEncephaloGraphic data. Comput Intell Neurosci, 2011, 831409.

Pernet, C.R., Latinus, M., Nichols, T.E. & Rousselet, G.A. (2015) Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. Journal of neuroscience methods, 250, 85-93.

Rousselet, G.A., Gaspar, C.M., Wieczorek, K.P. & Pernet, C.R. (2011) Modeling Single-Trial ERP Reveals Modulation of Bottom-Up Face Visual Processing by Top-Down Task Constraints (in Some Subjects). Front Psychol, 2, 137.

Rousselet, G.A., Ince, R.A., van Rijsbergen, N.J. & Schyns, P.G. (2014) Eye coding mechanisms in early human face event-related potentials. J Vis, 14, 7.

Problems with small sample sizes

In psychology and neuroscience, the typical sample size is too small. I’ve recently seen several neuroscience papers with n = 3-6 animals. For instance, this article uses n = 3 mice per group in a one-way ANOVA. This is a real problem because small sample size is associated with:

  • low statistical power

  • inflated false discovery rate

  • inflated effect size estimation

  • low reproducibility

Here is a list of excellent publications covering these points:

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S. & Munafo, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14, 365-376.

Colquhoun, D. (2014) An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci, 1, 140216.

Forstmeier, W., Wagenmakers, E.J. & Parker, T.H. (2016) Detecting and avoiding likely false-positive findings – a practical guide. Biol Rev Camb Philos Soc.

Lakens, D., & Albers, C. J. (2017, September 10). When power analyses based on pilot data are biased: Inaccurate effect size estimators and follow-up bias. Retrieved from psyarxiv.com/b7z4q

See also these two blog posts on small n:

When small samples are problematic

Low Power & Effect Sizes

Small sample size also prevents us from properly estimating and modelling the populations we sample from. As a consequence, small n stops us from answering a fundamental, yet often ignored empirical question: how do distributions differ?

This important aspect is illustrated in the figure below. Columns show distributions that differ in four different ways. The rows illustrate samples of different sizes. The scatterplots were jittered using ggforce::geom_sina in R. The vertical black bars indicate the mean of each sample. In row 1, examples 1, 3 and 4 have exactly the same mean. In example 2 the means of the two distributions differ by 2 arbitrary units. The remaining rows illustrate random subsamples of data from row 1. Above each plot, the t value, mean difference and its confidence interval are reported. Even with 100 observations we might struggle to approximate the shape of the parent population. Without additional information, it can be difficult to determine if an observation is an outlier, particularly for skewed distributions. In column 4, samples with n = 20 and n = 5 are very misleading.

figure1

Small sample size could be less of a problem in a Bayesian framework, in which information from prior experiments can be incorporated in the analyses. In the blind and significance obsessed frequentist world, small n is a recipe for disaster.