Monthly Archives: June 2016

The Harrell-Davis quantile estimator

Quantiles are robust and useful descriptive statistics. They belong to the family of L-estimators, which is to say that they are based on the linear combination of order statistics. They are several ways to compute quantiles. For instance, in R, the function quantile has 9 options. In Matlab, the quantile & prctile functions offer only 1 option. Here I’d like to introduce briefly yet another option: the Harrell-Davis quantile estimator (Harrell & Davis, 1982). It is the weighted average of all the order statistics (Figure 2). And, in combination with the percentile bootstrap, it is a useful tool to derive confidence intervals of quantiles (Wilcox 2012), as we will see quickly in this post. It is also a useful tool to derive confidence intervals of the difference between quantiles of two groups, as we will see in another post. As discussed previously in the percentile bootstrap post, to make accurate confidence intervals, we need to combine an estimator with a particular confidence interval building procedure, and the right combo is not obvious depending on the data at hand.

Before we motor on, a quick google search suggests that there is recent work to try to improve the Harrell-Davis estimator, so this not to say that this estimator is the best in all situations. But according to Rand Wilcox it works well in many situations, and we do use it a lot in the lab…

Let’s look at data from a paper on visual processing speed estimation (Bieniek et al. 2015). We consider ERP onsets from 120 participants aged 18 to 81.

The sorted ages are:

18 18 19 19 19 19 20 20 21 21 21 21 21 21 21 22 22 22 22 22 22 22 22 23 23 23 24 24 24 25 26 28 28 29 29 30 30 31 31 32 32 32 33 34 34 35 35 36 37 38 40 40 41 41 42 42 43 43 44 45 45 45 45 48 49 49 50 51 54 54 55 56 58 59 59 60 60 61 62 62 62 63 63 63 64 64 64 64 65 65 66 66 66 66 66 66 67 67 67 67 68 68 68 68 68 69 70 70 70 71 72 72 72 75 76 77 78 79 81 81

Fig1-age distribution

Figure 1. Age distribution.

The Matlab code to reproduce all the figures in this post is available on github. There is also a list of R functions from Rand Wilcox’s toolbox.

How do we compute Harrell-Davis quantiles of the age distribution? Figure 2 shows the Harrell-Davis weights for the deciles of the age distribution.

Fig2-weights

Figure 2. Decile weights.

The deciles are obtained by multiplying the sorted ages by the weights in Figure 2, which gives us:

21.1, 23.3, 29.7, 37.0, 45.3, 56.1, 63.3, 66.6, 70.4

For comparison, the age deciles from Matlab’s prctile function are:

21, 23, 30, 36, 45, 57, 64, 66, 70

Now, we can update the scatterplot in Figure 1 with the deciles:

Fig3-age deciles

Figure 3. Scatterplot + age deciles. The thick vertical black line marks the 50th quantiles.

We can also compute a confidence interval for a Harrell-Davis quantile. There are two ways to do that:

  • using a percentile bootstrap of the quantile (pbci approach);
  • using a percentile bootstrap estimate of the standard error of the quantile, which is then plugged into a confidence interval formula (pbse approach).

Using the code available with this post, we can try the two approaches on the median:

  • pbci approach gives 45.31 [35.89, 54.73]
  • pbse approach gives 45.31 [38.49, 54.40]

The two methods return similar upper bounds, but quite different lower bounds. Because they are both based on random resampling with replacement, running the same analysis several times will each time also give slightly different results. Actually, this is one important criterion to select a good bootstrap confidence interval technique: despite random sampling, using the same technique many times should provide overall similar results. Another important criterion is the probability coverage: if we build a 95% confidence interval, we want that confidence interval to contain the population value we’re trying to estimate 95% of the time. That’s right, the probability attached to a confidence interval is a long run coverage: assuming a population with a certain median, if we perform the same experiment over and over, every time drawing a sample of n observations and computing an (1-alpha)% confidence interval using the same technique, (1-alpha)% of these confidence intervals will contain the population median. So, if everything is fine (n is large enough, the number of bootstrap samples is large enough, the combination of bootstrap technique and estimator is appropriate), alpha% of the time (usually 5%), a confidence interval WILL NOT include the population parameter of interest. This implies that given the 1,000s of neuroscience & psychology experiments performed every year, 100s of paper report the wrong confidence intervals – but this possibility is never considered in the articles’ conclusions…

In many situations, the long run probability coverage can be actually much lower or much higher than (1-alpha). So can we check that we’re building accurate confidence intervals, at least in the long run? For that, we’ve got to run simulations. Here is an example. First, we create a fake population, for instance with a skewed distribution, which could reflect our belief of the nature of the population we’re studying:

Fig4-sim population

Figure 4. Population of 1,000,000 values with a 10 degrees of freedom chi2 distribution.

Second, we compute benchmark values, e.g. median, mean…

Third, we run simulations in which we perform fake experiments with a given sample size, and then compute confidence intervals of certain quantities. Finally, we check how often the different confidence intervals actually contain the population parameters (probability coverage):

  • pbse(hd) = 0.9530
  • pbci(hd) = 0.9473
  • pbci(median) = 0.9452
  • pbci(mean) = 0.9394

They’re all very close to 95%. However, the confidence intervals of hd created using the pbse approach tended to be larger than those created using the pbci approach. The confidence intervals for the mean missed the population mean 1% of the time compared to the expected 95% – that’s because they tended to be shorter than the other 3. The bootstrap estimates of the sampling distribution of hd, the median and the mean, as well as the width of the confidence intervals can be explored using the code on github.

Of course, no one is ever going to run 10,000 times the same experiment! And these results assume a certain population, a certain number of observations per experiment, and a certain number of bootstrap samples. We would need a more systematic exploration of the different combinations of options to be sure the present results are not special cases.

To be clear: there is absolutely no guarantee that any particular confidence interval contains the population parameter you’re trying to estimate. So be humble, and don’t make such a big deal about your confidence intervals, especially if you have small sample sizes.

Personally, more and more I use confidence intervals to try to describe the variability in the sample at hand. For that purpose, and to avoid potential inferential problems associated with confidence intervals, I think it is more satisfactory to use highest density intervals HDI. I will post R & Matlab functions to compute the HDI of the bootstrap quantiles on github at some stage. By reporting HDI, there are no associated p values and we minimise the temptation to cross proton streams (i.e. dichotomise a continuous variable to make a binary decision – MacCallum et al. 2002).

Finally, we consider something a bit more interesting than the age of our participants: the distribution of ERP onsets.

Here are the onsets in milliseconds:

Fig7-onset distribution

Figure 5. Onsets.

And the deciles with their confidence intervals, which provide a very nice summary of the distribution:

Fig8-onset deciles

Figure 6. Onset deciles with confidence intervals.

If you’re interested, I’ve also attempted a Bayesian estimation of the onset data using R and JAGS. However, I don’t know yet how to perform quantile estimation – please get in touch if you can help.

Conclusion

Now you’ve got the tools to describe a distribution in detail. There is no particular reason why we should be obsessed with the mean, especially when robust and more informative statistics are available. Next, I will show you how to compare all the deciles of two distributions using a mighty tool: the shift function. This will, of course, rely on the Harrell-Davis estimator and the bootstrap.

References

Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2015) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience.

Harrell, F.E. & Davis, C.E. (1982) A new distribution-free quantile estimator. Biometrika, 69, 635-640.

MacCallum RC, Zhang S, Preacher KJ, Rucker DD. 2002. On the practice of dichotomization of quantitative variables. Psychological Methods 7: 19-40

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press.

Advertisements

How to chase ERP monsters hiding behind bars

I think detailed and informative illustrations of results is the most important step in the statistical analysis process, whether we’re looking at a single distribution, comparing groups, or dealing with more complex brain imaging data. Without careful illustrations, it can be difficult, sometimes impossible, to understand our results and to convey them to an audience. Yet, from specialty journals to Science & Nature, the norm is still to hide rich distributions behind bar graphs or one of their equivalents. For instance, in ERP (event-related potential) research, the equivalent of a bar graph looks like this:

figure1

Figure 1. ERP averages in 2 conditions. Paired design, n=30, cute little red star indicates p<0.05.

All the figures in this post can be reproduced using Matlab code available on github.

Figure 1 is very much standard in the field. It comes with a little star to attract your attention to one time point that has reached the magic p<0.05 threshold. Often, the ERP figure will be complemented with a bar graph:

figure1b

Figure 1b. Bar graph of means +/- SEM for conditions 1 & 2.

Ok, what’s wrong with this picture? You might argue that the difference is small, and that the statistical tests have probably not been corrected for multiple comparisons. And in many cases, you would be right. But many ERP folks would reply that because they focus their analyses on peaks, they do not need to correct for multiple comparisons. Well, unless you have a clear hypothesis for each peak, then you should at least correct for the number of peaks or time windows of interest tested if you’re willing to flag any effect p<0.05. I would also add that looking at peaks is wasteful and defeats the purpose of using EEG: it is much more informative to map the full time-course of the effects across all sensors, instead of throwing valuable data away (Rousselet & Pernet, 2011).

Another problem with Figure 1 is the difficulty in mentally subtracting two time-courses, which can lead to underestimating differences occurring between peaks. So, in the next figure, we show the mean difference as well:

figure2

Figure 2. Mean ERPs + mean difference. The black vertical line marks the time of the largest absolute difference between conditions.

Indeed, there is a modest bump in the difference time-course around the time of the significant effect marked by the little star. The effect looks actually more sustained than it appears by just looking at the time-courses of the two original conditions – so we learn something by looking at the difference time-course. The effect is much easier to interpret by adding some measure of accuracy, for instance a 95% confidence interval:

figure3

Figure 3. Mean ERPs + mean difference + confidence interval.

We could also show confidence intervals for condition 1 and condition 2 mean ERPs, but we are primarily interested in how they differ, so the focus should be on the difference. Figure 3 reveals that the significant effect is associated with a confidence interval only very slightly off the zero mark. Although p<0.05, the confidence interval suggests a weak effect, and Bayesian estimation might actually suggest no evidence against the null (Wetzels et al. 2011). And this is why the focus should be on robust effect sizes and their illustration, instead of binary outcomes resulting from the application of arbitrary thresholds. How do we proceed in this case? A simple measure of effect size is to report the difference, which in our case can be illustrated by showing the time-course of the difference for every participant (see a nice example in Kovalenko et al. 2012). And what’s lurking under the hood here? Monsters?

figure4

Figure 4. Mean ERPs + mean difference + confidence interval + individual differences.

Yep, it’s a mess of spaghetti monsters!

After contemplating a figure like that, I would be very cautious about my interpretation of the results. For instance, I would try to put the results into context, looking carefully at effect sizes and how they compare to other manipulations, etc. I would also be very tempted to run a replication of the experiment. This can be done in certain experimental situations on the same participants, to see if effect sizes are similar across sessions (Bieniek et al. 2015). But I would certainly not publish a paper making big claims out of these results, just because p<0.05.

So what can we say about the results? If we look more closely at the distribution of differences at the time of the largest group difference (marked by a vertical line in Figure 2), we can make another observation:

figure5

Figure 5. Distribution of individual differences at the time of the maximum absolute group difference.

About 2/3 of participants show an effect in the same direction as the group effect (difference < 0). So, in addition to the group effect, there are large individual differences. This is not surprising. What is surprising is the usual lack of consideration for individual differences in most neuroscience & psychology papers I have come across. Typically, results portrayed in Figure 1 would be presented like this:

“We measured our favourite peak in two conditions. It was larger in condition 1 than in condition 2 (p<0.05), as predicted by our hypothesis. Therefore, when subjected to condition 1, our brains process (INSERT FAVOURITE STIMULUS, e.g. faces) more (INSERT FAVOURITE PROCESS, e.g. holistically).”

Not only this is a case of bad reverse inference, it is also inappropriate to generalise the effect to the entire human population, or even to all participants in the sample – 1/3 showed an effect in the opposite direction after all. Discrepancies between group statistics and single-participant statistics are not unheard of, if you dare to look (Rousselet et al. 2011).

Certainly, more subtle and honest data description would go a long way towards getting rid of big claims, ghost effects and dodgy headlines. But how many ERP papers have you ever seen with figures such as Figure 4 and Figure 5? How many papers contain monsters behind bars? Certainly, “my software does not have that option” doesn’t cut it; these figures are easy to make in Matlab, R or Python. If you don’t know how, ask a colleague, post questions on online forums, there are plenty of folks eager to help. For Matlab code, you could start here for instance.

Now: the final blow. The original ERP data used for this post are real and have huge effect sizes (check Figure A2 here for instance). However, the effect marked by a little star in Figure 1 is a false positive: there are no real effects in this dataset! The current data were generated by sampling trials with replacement from a pool of 7680 trials, to which pink noise was added, to create 30 fake participants and 2 fake conditions. I ran the fake data making process several times and selected the version that gave me a significant peak difference, because, you know, I love peaks. So yes, we’ve been looking at noise all along. And I’m sure there is plenty of noise out there in published papers. But it is very difficult to tell, because standard ERP figures are so poor.

How do we fix this?

  • make detailed, honest figures of your effects;
  • post your data to an online repository for other people to scrutinise them;
  • conclude honestly about what you’ve measured (e.g. “I only analyse the mean, I don’t know how other aspects of the distributions behave”), without unwarranted generalisation (“1/3 of my participants did not show the group effect”);
  • replicate new effects;
  • report p values if you want, but do not obsess over the 0.05 threshold, it is arbitrary, and continuous distributions should not be dichotomised (MacCallum et al. 2002);
  • focus on effect sizes.

References

Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2015) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience.

Kovalenko, L.Y., Chaumon, M. & Busch, N.A. (2012) A pool of pairs of related objects (POPORO) for investigating visual semantic integration: behavioral and electrophysiological validation. Brain Topogr, 25, 272-284.

MacCallum RC, Zhang S, Preacher KJ, Rucker DD. 2002. On the practice of dichotomization of quantitative variables. Psychological Methods 7: 19-40

Rousselet, G.A. & Pernet, C.R. (2011) Quantifying the Time Course of Visual Object Processing Using ERPs: It’s Time to Up the Game. Front Psychol, 2, 107.

Rousselet, G.A., Gaspar, C.M., Wieczorek, K.P. & Pernet, C.R. (2011) Modeling Single-Trial ERP Reveals Modulation of Bottom-Up Face Visual Processing by Top-Down Task Constraints (in Some Subjects). Front Psychol, 2, 137.

Wetzels, R., Matzke, D., Lee, M.D., Rouder, J.N., Iverson, G.J. & Wagenmakers, E.J. (2011) Statistical Evidence in Experimental Psychology: An Empirical Comparison Using 855 t Tests. Perspectives on Psychological Science, 6, 291-298.