How to test for variance homogeneity

It’s simple: don’t do it! (Or do it but for other reasons — see below)

“A Levene’s test was used to test if the distributions have equal variances.”

“To establish normality, we used a Shapiro-Wilk test with p > 0.05; and for equal variances we used Levene’s test.”

This approach of checking assumptions and then deciding on the test to perform suffers from the limitations described in my previous post. In that post, I described why you should avoid normality tests, and instead could use them in a statistics class to cover a lot of core topics. The same goes for tests of variance homogeneity, which are often used in conjunction with normality tests. In particular, methods to detect differences in variance between groups can have low power and inaccurate type I error rates (false positives), which are the topics of this post. But how to capture variance differences is worth looking at, because in some situations a legitimate question is whether distributions differ in spread, or more specifically in variance. There is a wide range of applications of test of variance differences in fields such as neuroscience, economics, genetics, quality control (Coroner et al. 1982; Li et al. 2015; Ritchie et al. 2018; Patil & Kulkarni, 2022). Thus, it is useful to consider the false positive rate and power of these methods in different situations.

Before exploring error rates for methods aimed at detecting variance differences, it’s worth pointing out a paper by Zimmerman (2004), which looked at the error rates for a combo approach, in which a test of variance homogeneity is performed first, followed by a standard t-test or a Welsh t-test depending on the outcome. This approach of conducting a preliminary check is not recommended: false positives (type I errors) and power of the combo depend on the relative and absolute sample sizes, as well as the magnitude of the variance difference between groups, leading to poor performance in realistic situations. What works best is to use methods for unequal variances by default (heteroscedastic methods); a recommendation echoed in more recent work (Delacre et al. 2017). In the presence of skewness or outliers, or both, power can be boosted by using trimmed means in conjunction with parametric or bootstrap methods (Wilcox & Rousselet, 2023). In general, it is a bad idea to rely on methods that make strong assumptions about the sampled populations and to hope for the best.

Now let’s look at power and type I error rates for a few methods aimed at comparing variances. There are so many methods available it is hard to decide on a few candidate approaches. For instance, Conover et al. (1981) compared 56 tests of variance homogeneity. And many more tests have been proposed since then. Conover et al. (1981) recommended several tests: the Brown-Forsythe test, which is the same as Levene’s test, but using the median instead of the mean, and the Fligner-Killeen test. So we’ll use these three tests here. Zimmerman (2004) used Levene’s test. The three tests belong to a family of tests in which absolute or squared distances between observations and a measure of central tendency are compared using parametric (t-tests, ANOVAs) or rank-based methods (Conover et al. 1981; 2018). Levene’s test uses the mean to centre the distributions, whereas the Brown-Forsythe and Fligner-Killeen tests use the median. In addition, we’ll look at Bartlett’s test, which is known to be very sensitive to departures from normality, and a percentile bootstrap method to compare variances (Wilcox 2002). As we will see, all these tests perform poorly in the presence of skewness, for reasons explained in Conover et al. (2018), who go on to suggest better ones.

The code to reproduce the simulations and the figures is available on GitHub. The simulations involved 10,000 iterations, with 1,000 samples for the bootstrap method.

Simulation 1: normality

Let’s first look at what happens under normality, an unlikely situation in many fields, but a useful benchmark. We consider only 2 groups. One population has a standard deviation of 1; the other population has a standard deviation that varies from 1 to 2 in steps of 0.1. Sample size varies from 10 to 100, in steps of 10. Here are the populations we sample from:

The results for the 5 methods are summarised in the next figure:

The results for SD=1 correspond to the type I error rate (false positives), which should be around 5%, because that is the arbitrary threshold I chose here. It is the case for Bartlett, but notice how Levene overshoots, and BF and FK undershoot at the lowest sample sizes. The percentile bootstrap method is systematically under 0.05 at all sample sizes. This is a reminder that your alpha is not necessarily what you think. The differences among methods for SD=1 (false positives) are easier to see in this figure:

As the standard deviation in the second population increases, power increases for all methods, as expected. Notice the large sample sizes required to detect variance differences. How do the methods compare for the largest population difference? Bartlett performed best, FK came last:

Simulation 2: skewness

Now, what happens when the populations are skewed? Here we consider g-and-h distributions, with parameters g=1 and h=0, which gives the same shape as a lognormal distribution, but with median zero. The g-and-h distributions are nicely described and illustrated in Yan & Genton (2019). We vary the standard deviation from 1 to 2, leading to these populations:

This is not to suggest that such distributions will often be encountered in applied work. Rather, methods that can maintain false positive rates near nominal level and high power when dealing with these distributions should be able to handle a large variety of situations and can therefore be recommended. Also, fun fact, the distributions above have the same median (0) and skewness, but differ in mean and variance.

Here is what we get when we sample from skewed populations:

Compared to results obtained under normality, the maximum power is lower (more illustrations of that later), but look at what happens to the false positives (SD=1): they skyrocket for Bartlett and Levene, and increase less dramatically for FK. The BF test handles skewness very well, whereas the percentile bootstrap is a bit liberal. Let’s compare the false positives of the different methods in one figure:

The same for maximum power (SD=2):

Obviously, the higher power of Bartlett and FK cannot be trusted given their huge false positive rates.

Simulation 3: false positives as a function of skewness

Here we sample from g-and-h distributions that vary from g=0 to g=1. So we explore the space between the two previous simulations parametrically. First, we consider false positives (no difference in variance). Here are the populations we sample from:

And the results for the 5 methods:

Comparison of the 5 methods for g=1:

BF and the percentile bootstrap tests are clearly more robust to skewness than the other methods. Bartlett is useless, but why is this test available in stat packages? In R, bartlett.test doesn’t even come with a warning.

Simulation 4: true positives as a function of skewness

We proceed as in simulation 3, but now the populations always differ by one standard deviation:

All the methods are strongly affected by skewness, with power dropping more for methods that are better at preserving the type I error rate at the nominal level (BF and percentile bootstrap):

A reminder that if you rely on power calculators that assume normality, your power is probably lower than you think.

Conclusion

None of the methods considered here were satisfactory. Only the Brown-Forsythe test and Wilcox’s bootstrap method controlled the type I error rate under non-normality, but their power was strongly affected by skewness. Conover et al. (2018) have proposed alternative methods to maintain high power in the presence of skewness. They recommend methods in which distributions are centred using the global mean or median (across groups), a simple step that improves performance considerably over the subtraction of the mean or median separately in each group (as used here and by default in R). See their discussion for an explanation of the lack of robustness to skewness of standard tests when individual group means or medians are used instead of the global ones. Conover et al. (2018) also considered the lognormal distribution, which corresponds to the g-and-h distribution with g=1 studied here, so their proposed methods should perform much better than the ones we considered. And there are plenty more tests on the market. For instance, Patil & Kulkarni (2022) have proposed a new method that promises high power in a range of situations. Please post a comment if you know of R packages that implement modern robust methods.

Finally, comparing variances is a very specific question. More broadly, one might be interested in differences in spread between distributions. For this more general question, other tools are available, relying on robust measures of scale (Wilcox, 2017, chapter 5) and quantile approaches (Rousselet et al. 2017). The distinction between variance and spread is important, because differences in variance could be driven by or masked by outliers or skewness, which might not affect a robust estimator of scale.

References

Conover, W.J., Johnson, M.E., & Johnson, M.M. (1981) A Comparative Study of Tests for Homogeneity of Variances, with Applications to the Outer Continental Shelf Bidding Data. Technometrics, 23, 351–361.

Conover, W.J., Guerrero-Serrano, A.J., & Tercero-Gómez, V.G. (2018) An update on ‘a comparative study of tests for homogeneity of variance.’ Journal of Statistical Computation and Simulation, 88, 1454–1469.

Delacre, M., Lakens, D., & Leys, C. (2017) Why Psychologists Should by Default Use Welch’s t-test Instead of Student’s t-test. International Review of Social Psychology, 30(1).

Li, X., Qiu, W., Morrow, J., DeMeo, D.L., Weiss, S.T., Fu, Y., & Wang, X. (2015) A Comparative Study of Tests for Homogeneity of Variances with Application to DNA Methylation Data. PLoS One, 10, e0145295.

Patil, K.P. & Kulkarni, H.V. (2022) An uniformly superior exact multi-sample test procedure for homogeneity of variances under location-scale family of distributions. Journal of Statistical Computation and Simulation, 92, 3931–3957.

Ritchie, S.J., Cox, S.R., Shen, X., Lombardo, M.V., Reus, L.M., Alloza, C., Harris, M.A., Alderson, H.L., Hunter, S., Neilson, E., Liewald, D.C.M., Auyeung, B., Whalley, H.C., Lawrie, S.M., Gale, C.R., Bastin, M.E., McIntosh, A.M., & Deary, I.J. (2018) Sex Differences in the Adult Human Brain: Evidence from 5216 UK Biobank Participants. Cereb Cortex, 28, 2959–2975.

Rousselet, G.A., Pernet, C.R., & Wilcox, R.R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience. European Journal of Neuroscience, 46, 1738–1748.

Wilcox, R.R. (2002) Comparing the variances of two independent groups. Br J Math Stat Psychol, 55, 169–175.

Wilcox, R.R. (2017) Introduction to Robust Estimation and Hypothesis Testing, 4th edition. edn. Academic Press.

Wilcox, R.R. & Rousselet, G.A. (2023) An Updated Guide to Robust Statistical Methods in Neuroscience. Current Protocols, 3, e719.

Yan, Y. and Genton, M.G. (2019), The Tukey g-and-h distribution. Significance, 16: 12-13. https://doi.org/10.1111/j.1740-9713.2019.01273.x

Zimmerman, D.W. (2004) A note on preliminary tests of equality of variances. Br J Math Stat Psychol, 57, 173–181.

Leave a comment