Category Archives: review

Correlations in neuroscience: are small n, interaction fallacies, lack of illustrations and confidence intervals the norm?

As reviewer, editor and reader of research articles, I’m regularly annoyed by the low standards in correlation analyses. In my experience with such articles, typically:

  • Pearson’s correlation, a non-robust measure of association, is used;
  • R and p values are reported, but not confidence intervals;
  • sample sizes tend to be small, leading to large estimation bias and inflated effect sizes in the literature;
  • R values and confidence intervals are not considered when interpreting the results;
  • instead, most analyses are reported as significant or non-significant (p<0.05), leading to the conclusion that an association exists or not (frequentist fallacy);
  • often figures illustrating the correlations are absent;
  • the explicit or implicit comparison of two correlations is done without a formal test (interaction fallacy).

To find out if my experience was in fact representative of the typical paper, I had a look at all papers published in 2017 in the European Journal of Neuroscience, where I’m a section editor. I care about the quality of the research published in EJN, so this is not an attempt at blaming a journal in particular, rather it’s a starting point to address a general problem. I really hope the results presented below will serve as a wake-up call for all involved and will lead to improvements in correlation analyses. Also, I bet if you look systematically at articles published in other neuroscience journals you’ll find the same problems. If you’re not convinced, go ahead, prove me wrong 😉 

I proceeded like this: for all 2017 articles (volumes 45 and 46), I searched for “correl” and I scanned for figures of scatterplots. If either searches were negative, the article was categorised as not containing a correlation analysis, so I might have missed a few. When at least one correlation was present, I looked for these details: 

  • n
  • estimator
  • confidence interval
  • R
  • p value
  • consideration of effect sizes
  • figure illustrating positive result
  • figure illustrating negative result
  • interaction test.

164 articles reported no correlation.

7 articles used regression analyses, with sample sizes as low as n=6, n=10, n=12 in 3 articles.

48 articles reported correlations.

Sample size

The norm was to not report degrees of freedom or sample size along with the correlation analyses or their illustrations. In 7 articles, the sample sizes were very difficult or impossible to guess. In the others, sample sizes varied a lot, both within and between articles. To confirm sample sizes, I counted the observations in scatterplots when they were available and not too crowded – this was a tedious job and I probably got some estimations and checks wrong. Anyway, I shouldn’t have to do all these checks, so something went wrong during the reviewing process. 

To simplify the presentation of the results, I collapsed the sample size estimates across articles. Here is the distribution: 

figure_ejn_sample_sizes

The figure omits 3 outliers with n= 836, 1397, 1407, all from the same article.

The median sample size is 18, which is far too low to provide sufficiently precise estimation.

Estimator

The issue with low sample sizes is made worse by the predominant use of Pearson’s correlation or the lack of consideration for the type of estimator. Indeed, 21 articles did not mention the estimator used at all, but presumably they used Pearson’s correlation.

Among the 27 articles that did mention which estimator was used:

  • 11 used only Pearson’s correlation;
  • 11 used only Spearman’s correlation;
  • 4 used Pearson’s and Spearman’s correlations;
  • 1 used Spearman’s and Kendall’s correlations.

So the majority of studies used an estimator that is well-known for its lack of robustness and its inaccurate confidence intervals and p values (Pernet, Wilcox & Rousselet, 2012).

R & p values

Most articles reported R and p values. Only 2 articles did not report R values. The same 2 articles also omitted p values, simply mentioning that the correlations were not significant. Another 3 articles did not report p values along with the R values.

Confidence interval

Only 3 articles reported confidence intervals, without mentioning how they were computed. 1 article reported percentile bootstrap confidence intervals for Pearson’s correlations, which is the recommended procedure for this estimator (Pernet, Wilcox & Rousselet, 2012).

Consideration for effect sizes

Given the lack of interest for measurement uncertainty demonstrated by the absence of confidence intervals in most articles, it is not surprising that only 5 articles mentioned the size of the correlation when presenting the results. All other articles simply reported the correlations as significant or not.

Illustrations

In contrast with the absence of confidence intervals and consideration for effect sizes, 23 articles reported illustrations for positive results. 4 articles reported only negative results, which leaves us with 21 articles that failed to illustrate the correlation results. 

Among the 40 articles that reported negative results, only 13 illustrated them, which suggests a strong bias towards positive results.

Interaction test

Finally, I looked for interaction fallacies (Nieuwenhuis, Forstmann & Wagenmakers 2011). In the context of correlation analyses, you commit an interaction fallacy when you present two correlations, one significant, the other not, implying that the 2 differ, but without explicitly testing the interaction. In other versions of the interaction fallacy, two significant correlations with the same sign are presented together, implying either that the 2 are similar, or that one is stronger than the other, without providing a confidence interval for the correlation difference. You can easily guess the other flavours… 

10 articles presented only one correlation, so there was no scope for the interaction fallacy. Among the 38 articles that presented more than one correlation, only one provided an explicit test for the comparison of 2 correlations. However, the authors omitted the explicit test for their next comparison!

Recommendations

In conclusion, at least in 2017 EJN articles, the norm is to estimate associations using small sample sizes and a non-robust estimator, to not provide confidence intervals and to not consider effect sizes and measurement uncertainty when presenting the results. Also, positive results are more likely to be illustrated than negative ones. Finally, interaction fallacies are mainstream.

How can we do a better job?

If you want to do a correlation analysis, consider your sample size carefully to assess statistical power and even better, your long-term estimation precision. If you have a small n, I wouldn’t even look at the correlation. 

Do not use Pearson’s correlation unless you have well-behaved and large samples, and you are only interested in linear relationships; otherwise explore robust measures of associations and techniques that provide valid confidence intervals (Pernet, Wilcox & Rousselet, 2012; Wilcox & Rousselet, 2018).

Reporting

These details are essential in articles reporting correlation analyses:

  • sample size for each correlation;
  • estimator of association;
  • R value;
  • confidence interval;
  • scatterplot illustration of every correlation, irrespective of the p value;
  • explicit comparison test of all correlations explicitly or implicitly compared;
  • consideration of effect sizes (R values) and their uncertainty (confidence intervals) in the interpretation of the results.

 Report p values if you want but they are not essential and should not be given a special status (McShane et al. 2018).

Finally, are you sure you really want to compute a correlation?

“Why then are correlation coefficients so attractive? Only bad reasons seem to come to mind. Worst of all, probably, is the absence of any need to think about units for either variable. Given two perfectly meaningless variables, one is reminded of their meaninglessness when a regression coefficient is given, since one wonders how to interpret its value. A correlation coefficient is less likely to bring up the unpleasant truth—we think we know what r = —.7 means. Do we? How often? Sweeping things under the rug is the enemy of good data analysis. Often, using the correlation coefficient is “sweeping under the rug” with a vengeance. Being so disinterested in our variables that we do not care about their units can hardly be desirable.”
Analyzing data: Sanctification or detective work?

John W. Tukey.
 American Psychologist, Vol 24(2), Feb 1969, 83-91. http://dx.doi.org/10.1037/h0027108

 

References

McShane, B.B., Gal, D., Gelman, A., Robert, C. & Tackett, J.L. (2018) Abandon Statistical Significance. arxiv.

Nieuwenhuis, S., Forstmann, B.U. & Wagenmakers, E.J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci, 14, 1105-1107.

Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.

Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.

[preprint]

A quick review of Reader et al. EJN 2018

Here is a quick review of a paper that just came out in the European Journal of Neuroscience. I flagged one concern on Twitter, and the authors asked me to elaborate, so here I provide a short assessment of the sort I write when scanning through a paper I get to edit for the journal. This short assessment would then be added to the reviewers’ comments in my decision letter, or would be used as feedback as part of an editorial rejection.

Repetitive transcranial magnetic stimulation reveals a role for the left inferior parietal lobule in matching observed kinematics during imitation
Reader et al. EJN 2018

Participants: please report min, max, median age

Instead of splitting the analysis between hand and finger gestures, it would have been more powerful to perform a hierarchical analysis including a gesture factor. This sort of item analysis could dramatically alter the outcome of your analyses:
[[https://wellcomeopenresearch.org/articles/1-23/v2]]

If the separate analysis of different gestures was not planned, then the sub-analyses must be corrected for multiple comparisons. In your case, it seems that most effects would not be significant, following the usual arbitrary p<0.05 cut-off.

The permutation + t-test procedure to investigate sequences is quite good. It would probably give you more power to include the size of the t values, by computing cluster sums, instead of cluster lengths. Also, you could consider TFCE, which provides a cluster free approach, with Matlab code available:
[[https://www.ncbi.nlm.nih.gov/pubmed/18501637]]

The multiple sequence tests should be corrected for multiple comparisons, for instance by using a maximum cluster sum statistics, where the max is taken across all comparisons.

The use of t-tests on means should be justified.

The results are presented mostly using t and p values, and whether an arbitrary p<0.05 threshold was reached. Results would be better presented by focusing on effect sizes, in their original units, individual differences, and mentioning t and p values only in passing, as pieces of evidence without a special status.

It’s great to use figures with scatterplots; however, for paired designs it is essential to also show distributions of pairwise differences. See guidelines for instance here:
[[https://onlinelibrary.wiley.com/doi/full/10.1111/ejn.13400]]

The little stars and crosses should be removed from all figures. The emphasis should be on effect sizes and individual differences, which are best portrayed by showing scatterplots of pairwise differences. Also, using different tests, whether parametric or not, would give different p values, so the exact p values are only indicative, especially if the goal is to determine if <0.05.

The discussion should better qualify the results. For instance, this conclusion uses p values to incorrectly imply that the null hypothesis can be accepted, and that effect sizes were zero:
“Whilst stimulation did not influence imitation accuracy, there was some evidence for differences between accuracy in meaningful and meaningless action performance.”

The sample size is rather small, so the discussion should mention the risk of effect sizes being completely off.

To have to contact the corresponding author to get data and code is not a good short term or long term solution. Instead, please upload the material to a third-party repository.

“the fact that” is an awful construction that can be avoided by rephrasing. For instance:
“may reflect the fact that the differences”
->
“may reflect that the differences”

Can someone tell if a person is alive or dead based on a photograph?

In this post I review the now retracted paper:

Delorme A, Pierce A, Michel L and Radin D (2016) Prediction of Mortality Based on Facial Characteristics. Front. Hum. Neurosci. 10:173. doi: 10.3389/fnhum.2016.00173

In typical Frontiers’ style, the reason for the retraction is obscure.

In December 2016, I made negative comments about the paper on Twitter. Arnaud Delorme (first author, and whom I’ve known for over 20 years), got in touch, asking for clarifications about my points. I said I would write something eventually, so there it goes.

The story is simple: some individuals claim to be able to determine if a person is alive or dead based on a photograph. The authors got hold of 12 such individuals and asked them to perform a dead/alive/don’t know discrimination task. EEG was measured while participants viewed 394 photos of individuals alive or dead (50/50).

Here are some of the methodological problems.

Stimuli

Participants were from California. Some photographs were of US politicians outside California. Participants did not recognise any individuals from the photographs, but unconscious familiarity could still influence behaviour and EEG – who knows?

More importantly, if participants make general claims about their abilities, why not use photographs of individuals from another country altogether? Even better, another culture?

Behaviour

The average group performance of the participants was 53.6%. So as a group, they really can’t do the task. (If you want to argue they can, I challenge you to seek treatment from a surgeon with  a 53.6% success record.) Yet, a t-test is reported with p=0.005. Let’s not pay too much attention to the inappropriateness of t-tests for percent correct data. The crucial point is that the participants did not make a claim about their performance as a group: each one of them claimed to be able to tease apart the dead from the living based on photographs. So participants should be assessed individually. Here are the individual performances:

(52.3, 56.7, 53.3, 56.0, 56.6, 51.8, 61.3, 55.3, 50.0, 51.6, 49.5, 49.4)

5 participants have results flagged as significant. One in particular has a performance of 61.3% correct. So how does it compare to participants without super abilities? Well, astonishingly, there is no control group! (Yep, and that paper was peer-reviewed.)

Given the extra-ordinary claims made by the participants, I would have expected a large sample of control participants, to clearly demonstrate that the “readers” perform well beyond normal. I would also have expected the readers to be tested on multiple occasions, to demonstrate the reliability of the effect.

There are two other problems with the behavioural performance:

  • participants’ responses were biased towards the ‘dead’ response, so a sensitivity analysis, such as a d’ or a non-parametric equivalent should have been used.

  • performance varied a lot across the 3 sets of images that composed the 394 photographs. This suggests that the results are not image independent, which could in part be due to the 3 sets containing different proportions of dead and alive persons.

EEG

The ERP analyses were performed at the group level using a 2 x 2 design: alive/dead x correct/incorrect classification. One effect is reported with p<0.05: a larger amplitude for incorrect than correct around 100 ms post-stimulus, only for pictures of dead persons. A proper spatial-temporal cluster correction for multiple comparison was applied. There is no clear interpretation of the effect in the paper, except a quick suggestion that it could be due to low-level image properties or an attention effect. A non-specific attention effect is possible, because sorting ERPs based on behavioural performance can be misleading, as explained here. The effect could also be a false positive – in the absence of replication and comparison to a control group, it’s impossible to tell.

To be frank, I don’t understand why EEG was measured at all. I guess if the self-proclaimed readers could do the task at all, it would be interesting to look at the time-course of the brain activity related to the task. But the literature on face recognition shows very little modulation due to identity, except in priming tasks or using SSVEP protocols – so not likely to show anything with single image presentations. If there was something to exploit, the analysis should be performed at the participant level, perhaps using multivariate logistic regression, with cross-validation, to demonstrate a link between brain activity and image type. Similarly to behaviour, each individual result from the “readers” should be compared to a large set of control results, from participants who cannot perform the behavioural task.

Conclusion

In conclusion, this paper should never have been sent for peer-review. That would have saved everyone involved a lot of time. There is nothing in the paper supporting the authors’ conclusion:

“Our results support claims of individuals who report that some as-yet unknown features of the face predict mortality. The results are also compatible with claims about clairvoyance warrants further investigation.”

If the authors are serious about studying clairvoyance, they should embark on a much more ambitious study. To save time and money, I would suggest to drop EEG from the study, to focus on the creation of a large bank of images from various countries and cultures and repeated measurements of readers and many controls.