Tag Archives: reporting

Improving statistical reporting in psychology

Schubert et al. (2025) describe important steps to improve statistical reporting in psychology. I strongly encourage all psychologists (and neuroscientists) to read this goldmine of an article. There is no doubt that implementing their suggestions would improve the typical psychology article. However useful, some of the suggested steps are insufficient, and some of the proposed examples of good practice are at least missed opportunities to educate the community. Here are the main examples of good practice presented in Table 1 and some of the issues with these statements:

Hypothesis and design

“We hypothesized that participants in a high-load visual working memory
condition would have lower recall accuracy (and longer RTs) than those in a
low-load condition.”

Although more specific than just writing that the conditions will differ, it is unclear how that statement translates into a testable hypothesis. The most common but rarely justified approach is to compare group means. However that approach makes a strong assumption about the symmetry of the distributions or stochastic dominance or both. The mean is also not robust and asks a very specific question about the data, often not the most interesting one. And what about individual participants? Is a significant group difference in means necessary and sufficient to support the theory? What if a large proportion of participants do not show the group effect?

Sample size justification

“Our a priori power analysis (80% power, α=0.05) for detecting a medium
effect size (d=0.50) required 52 participants. We oversampled to 60 in
anticipation of attrition.”

Obviously what is missing here is the statistical test for which the sample size is estimated. This is clearly emphasised in the main text of Schubert et al. (2025), a point that should be prominantly featured in Table 1 too. In my experience, most articles that contain some sort of sample size justification omit the statistical test. Actually, statistical tests are often completely absent from the method section. Or sometimes one test is mentioned, but the results report other ones, often including a more complex ANOVA, for which the sample size will be insufficient. Schubert et al. (2025) address this critical point by suggesting “to base the sample size on the test that requires the largest sample to ensure adequate power across all analyses”. Typically that would be the most complex interaction. But in doing so, one must consider the shape of the interaction too:

Sommet N, Weissman DL, Cheutin N, Elliot AJ. How Many Participants Do I Need to Test an Interaction? Conducting an Appropriate Power Analysis and Achieving Sufficient Power to Detect an Interaction. Advances in Methods and Practices in Psychological Science. 2023;6(3).

Schubert et al. (2025) do a great job at flagging other common issues with sample size justifications, including choosing a realistic expected effect size and dealing with multiple statistical tests. However, the notion that “sample size planning is technically straightforward for simple statistical modeling approaches” is misguided. Standard tests and matching power calculators all make assumptions that are necessarily violated by any data we encounter in psychology: all quantities we measure are bounded or skewed, usually both. In addition, ANOVAs are typically applied to data after aggregation across trials, which makes the common yet insane assumption of infinite measurement precision (aka no measurement error at the participant level). And the number of trials is usually not considered in the sample size justification. There is the added difficulty of analysis dependencies. For instance, some form of outlier removal is applied, which affects degrees of freedom, but this is not considered in the analyses or the sample size calculations–removing outliers and applying a t-test as if nothing had happened is inappropriate. For all these reasons, power analyses from standard calculators are wrong. The same goes with p values. For instance, in the example above, in which a sample size of 52 participants was estimated, but 60 participants were sampled, the p values will be affected by that extra over-sampling step. However, the extra step won’t be considered in the analyses, leading to inaccurate p values.

Outlier and missing data

“In line with our preregistered plan, reaction times exceeding 3 SDs from
the condition mean were treated as outliers and removed from subsequent
analyses, affecting 2.5% of trials. Five participants who withdrew mid-
study were excluded from final analyses (final N=55).”

Because SD and the mean are not robust, this rejection rule is not robust: as outliers get larger, they are less likely to be detected. Also, the rule implies a symmetric distribution, which is unlikely. If participants were removed, authors should also explain how they calculated the p values conditional on this outlier removal procedure; the same goes for the power analysis. Good luck calculating a p value that properly reflects all analysis and pre-processing conditional steps:

Kruschke, J.K. (2013) Bayesian estimation supersedes the t test. J Exp Psychol Gen, 142, 573-603

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804

In practice, every p value you have ever calculated or read is wrong because processing dependencies are ignored.

Schubert et al. (2025) make two excellent points related to the statement above: one, what looks like an outlier might be a legitimate member of a skewed or heavy tailed distribution; two, an alternative to detecting outliers is to use robust methods that can handle them by default, such as making inferences about trimmed means. What Schubert et al. (2025) fail to mention, as mentioned in the previous section, is that removing outliers and applying standard methods to the remaining data is inappropriate because that procedure affects the standard error (their reference 68 covers that important point). Of course, another option is to use an appropriate mixed-effect model, for instance involving shifted lognormal distributions for reaction time data.

Statistical model specification

“A 2 × 2 repeated-measures ANOVA (Condition: high-load vs. low-load;
Time: pre vs. post) on recall accuracy was conducted. The assumption of
normality was examined and met.”

To be complete, that statement should specify that the inference was on means (as opposed to trimmed means, medians or other quantiles). More crucially, ANOVAs on accuracy data are inappropriate. And testing normality assumptions is not a thing: accuracy data are never normally distributed; if you claim normality (p>0.05), you commit a classic statistical fallacy; the list goes on…

Schubert et al. (2025) make this comment in the main text: “If assumptions are violated, researchers should describe how these were addressed—for example, by […] using a nonparametric alternative”. This is a very common strategy, but unless we make strong assumptions about the populations, non-parametric (typically ranked-based) statistics and their parametric counterparts do not test the same hypotheses. Similarly, making inferences about means and trimmed means ask different questions about the data.

Inferential statistics

“A signicant Condition × Time interaction emerged for recall accuracy,
F(1, 54)= 4.37, p= 0.04, partial ω2=0.06 (95% CI [0.00, 0.23]). Post hoc
analyses revealed that accuracy decreased signicantly from pre- to post-
test in the high-load condition (M difference=−0.15, SE= 0.04, p < 0.001,
d= 0.68 (95% CI [0.39, 0.97])), while accuracy did not signicantly differ in
the low-load condition (M difference=−0.02, SE= 0.04, p= 0.630,
d= 0.09 (95% CI [−0.18, 0.36])). This pattern indicates that memory load
impaired performance over time, with a medium-to-large effect size for the
decline in the high-load condition.”

This statement contains several issues, most of which are actually addressed in the article: the claim about the lack of effect from a large p value is a statistical fallacy–use a test of equivalence to support the claim of a negligeable effect; the categorisation of the effect size using a one-size fits all trichotomy is outdated and must be removed.

Other issues are not addressed. Implicitly, the statement suggests that so-called “post-hoc” tests are carried out after finding a significant interaction. In practice, if you have specific expectations, it is perfectly legitimate to pre-register any contrasts of interest you want, without running an omnibus ANOVA. More importantly, graphical representations and extra analyses should be used to support the interpretation of an interaction, to determine if it is removable. Interactions are typically reported assuming implicitly that the measure (here accuracy) maps linearly onto the construct of interest, without addressing the problem of coordination.

Null Results

“We observed no signicant effect of load on RT, F(1, 87)= 0.01, p= 0.92,
partial ω2=0.00 (95% CI [0.00, 0.01]). We further ran an equivalence test
(TOST) using ±0.20 as our smallest effect size of interest, and the 90% CI
for d was fully contained within these bounds, suggesting the effect of load
on RT is practically negligible. Both the lower-bound test, t(87)= 2.08, one-
sided p= 0.020, and the upper-bound test, t(87)=−1.67, one-sided
p=0.049, were statistically signicant.”

This is good practice, although the effect size of .2 would need to be justified. Also, there are more powerful tests of equivalence than TOST. And if using TOST, or the equivalent confidence interval approach, keep in mind that standard methods are not robust.

In the article, Schubert et al. (2025) go on to tackle common reporting errors and misconceptions. Most of the points are excellent, except when they suggest using the Cocor R package to compare correlations, in the context of the classic interaction fallacy. This package is not recommended because it uses parametric methods that assume bivariate normality. Use the percentile bootstrap instead.

Schubert et al. (2025) flag common misinterpretations of confidence intervals and p values. These errors cannot be flagged often enough. However, the description of Bayesian credible intervals as having the probabilistic interpretation often assigns to confidence intervals is misguided: even a credible interval is conditional on the data; there is no magic in the Bayesian world.