Fligner-Killeen Test in R: Robust Alternative to Levene's Test
The Fligner-Killeen test checks whether several groups share the same variance, without assuming the data is normally distributed. Use it when residual diagnostics reveal heavy tails, skew, or outliers, and you still need a defensible homogeneity check before ANOVA or a t-test.
Bartlett's test fails badly under non-normality and Levene's test handles it acceptably; Fligner-Killeen handles it best, and it ships with base R via fligner.test().
What does the Fligner-Killeen test do?
Fligner-Killeen tests whether several groups share the same variance, without assuming the data is normally distributed. That makes it the safest pick when your residuals look skewed or have heavy tails, where Bartlett's test would lie to you. Let's run it on the built-in InsectSprays data, which records insect counts under six different sprays. We expect very different spreads, since some sprays kill nearly everything (low variance) while others miss many plots (high variance).
Pass the formula count ~ spray and the data frame, and the function returns a chi-square statistic and a p-value.
The chi-squared statistic is 14.5 on 5 degrees of freedom, with p = 0.013. Since p is well below 0.05, we reject the null hypothesis of equal variances. The six sprays do not produce the same spread of insect counts, which means a standard one-way ANOVA on this data would violate the equal-variance assumption.
Under the hood, the test centres each observation by subtracting its group's median, takes the absolute deviation, ranks all of those across the full sample, and then asks: do the average ranks differ across groups? If a group has unusually large absolute deviations, its average rank will be higher.
The chi-square test statistic has the form:
$$X^2 = \frac{\sum_{i=1}^{k} n_i (\bar{a}_i - \bar{a})^2}{V^2}$$
Where:
- $k$ = number of groups
- $n_i$ = sample size in group $i$
- $\bar{a}_i$ = mean rank score in group $i$
- $\bar{a}$ = overall mean rank score
- $V^2$ = variance of the rank scores
Try it: Run the Fligner-Killeen test on the chickwts data, which records the weights of chicks fed six different feed types. Use the formula weight ~ feed. Save the result to ex_chick.
Click to reveal solution
Explanation: p = 0.54 is far above 0.05, so we have no evidence that the variance differs across feed types. A standard one-way ANOVA's equal-variance assumption looks safe here.
How do you run fligner.test() in R?
fligner.test() accepts two interfaces. The first is the formula interface, where you write y ~ g and pass data = your_df. The second is the list interface, where you pass a list of numeric vectors, one per group. Both produce identical results, but the formula form is faster to type when your data is already in a tidy data frame.
Here is the formula interface on PlantGrowth, which compares plant weights under three conditions: ctrl, trt1, trt2.
The p-value of 0.31 is comfortably above 0.05, so we have no evidence that the three groups have different variances. Plant weights spread out roughly equally under control and the two treatments.
Now the same data through the list interface. Use split() to break the numeric column into a list keyed by the grouping factor, then hand that list straight to fligner.test().
The chi-squared statistic, df, and p-value match the formula version exactly. Only the data: line in the printout changes, which reflects how you fed the data in.
weight ~ group reads as "weight by group". Reach for the list interface only when your data already lives as separate vectors and you do not want to pack them into a data frame just to test variances.Try it: Run fligner.test() on the iris dataset to test whether Sepal.Length has equal variance across the three Species. Use the formula interface and save the result to ex_iris_fk.
Click to reveal solution
Explanation: p = 0.50 indicates no evidence of unequal variance across species, so the equal-variance assumption holds for Sepal.Length.
How do you interpret the p-value?
The decision rule is the same as any hypothesis test. Choose a significance level $\alpha$ (usually 0.05). If the p-value is below $\alpha$, reject the null and conclude variances differ. If it is at or above $\alpha$, you have no evidence to reject equal variances. Use that decision to choose between standard ANOVA (assumes equal variance), Welch's ANOVA (does not), or a transformation.
A boxplot makes the InsectSprays result visceral. Sprays C, D, and E hug the floor, while A, B, and F spread widely. The Fligner-Killeen p-value of 0.013 puts a number on what your eyes already see.
The boxplots back up the test. Sprays C and E have boxes only a few units tall, while A and F span a much wider range. Pulling out $p.value from the test object gives you the raw number for use in a report or for chaining into a downstream decision in your script.
Now check the test calibrates correctly under the null. Simulate three groups drawn from the same Normal distribution, where every variance is exactly 1. The p-value should land somewhere in [0, 1] with no bias toward small values.
p = 0.79 is well above 0.05, exactly as it should be when all three variances are truly identical. The means differ, but Fligner-Killeen ignores that; it tests spread, not location.
Try it: Simulate three groups of size 25 from rt(25, df = 3) (heavy-tailed t-distribution), with the second group multiplied by 3 to inflate its scale. Run fligner.test() and check whether it detects the inflated variance. Save the test object to ex_sim.
Click to reveal solution
Explanation: The p-value is 0.0001, comfortably below 0.05. Fligner-Killeen catches the 9x variance inflation in group B even though the data has heavy t-distributed tails, where Bartlett's test would have produced a misleading result driven by the tails rather than the true scale difference.
When should you use it instead of Bartlett's or Levene's?
R offers three tests for equal variance across groups: bartlett.test(), car::leveneTest(), and fligner.test(). They differ in what they assume and how they fail. The table below summarises the trade-offs.
| Test | Assumes normality? | Robust to outliers? | Power on Normal data | Recommended when |
|---|---|---|---|---|
bartlett.test() |
Yes (strong) | No | High | Data is clearly Normal and outlier-free |
car::leveneTest() |
No (centres on mean or median) | Moderate | Moderate | Mild departures from normality |
fligner.test() |
No (rank-based, median-centred) | High | Moderate | Heavy tails, skew, or outliers |
The simulation below shows the difference. Build three groups from a t-distribution with df = 3 (heavy tails) and identical scale. Bartlett's test should incorrectly flag a difference, because tails inflate the sample variance unevenly across groups. Fligner-Killeen should not.
Bartlett's test returns p = 0.01, falsely declaring unequal variances. The three groups all have the same true scale; Bartlett is fooled by the heavy tails. Levene (p = 0.32) and Fligner-Killeen (p = 0.41) correctly fail to reject. On heavy-tailed data, Bartlett's test inflates the false-positive rate, which is exactly the failure mode Fligner-Killeen was designed to fix.
center = "median"), so outliers still pull the centre. Fligner-Killeen works on the ranks of those deviations, where an outlier just becomes "the largest rank" and stops mattering. That is why it survives the heaviest tails.Try it: Use airquality to test whether Ozone has equal variance across Month. Run all three tests and save the three p-values into a named vector called ex_air. (Hint: filter out NAs first with na.omit().)
Click to reveal solution
Explanation: All three tests agree the variances differ across months, but Bartlett's p-value is dramatically smaller than the other two. That gap is a fingerprint of skewed data: Bartlett over-rejects because Ozone is right-skewed. The Fligner-Killeen p (0.0098) is the most defensible number to report.
What if variances are unequal?
A small Fligner-Killeen p-value tells you not to use a standard one-way ANOVA, because its F-test assumes equal variances and will produce wrong p-values otherwise. You have three good options.
Welch's ANOVA is the simplest fix. It modifies the F-test to account for unequal variances and gives you back a valid p-value with very little effort. Use oneway.test(..., var.equal = FALSE). A transformation like log() or sqrt() can stabilise variance when the unequal spread is driven by skew. Kruskal-Wallis is the right call when both location and spread differ, since it tests for any distributional difference between groups.
Here is Welch's ANOVA on the InsectSprays data, where Fligner-Killeen rejected equal variance.
p = 8e-12 says the spray means really do differ. Welch handled the unequal variance flagged by Fligner-Killeen automatically, by adjusting the denominator degrees of freedom from the standard ANOVA's 66 down to 30, which keeps the test honest. You did not need to transform the data or switch to a rank test, just flip one argument.
kruskal.test(), which makes no shape assumption. The decision is laid out in detail in When to Use Nonparametric Tests in R, the parent post for this article.Try it: PlantGrowth passed the Fligner-Killeen check earlier, so a standard ANOVA is fine. Still, run Welch's ANOVA on it as a robust default and save the test object to ex_pg_welch. The p-value should be similar to a regular aov().
Click to reveal solution
Explanation: p = 0.017, similar to what aov() would produce here. Welch's adjustment is mild because Fligner-Killeen confirmed the variances were already comparable. Welch makes a great default for one-way comparisons, since it costs you almost nothing when variances are equal but saves you when they are not.
Practice Exercises
Exercise 1: Fligner-Killeen vs Bartlett under skew
Generate three groups of size 50 from a chi-squared distribution with 2 df (right-skewed, all the same true scale). Run both Bartlett and Fligner-Killeen. Save Bartlett's p-value to my_bart_p and Fligner-Killeen's p-value to my_flig_p. Bartlett should over-reject because of the skew.
Click to reveal solution
Explanation: Bartlett's p = 0.072 is borderline and would flip to "significant" with a slightly different seed. Fligner's p = 0.81 is rock-solid: it correctly says the variances are equal, despite the skew. On any real-world right-skewed data (waiting times, counts, incomes), Bartlett's false-positive rate is unsafe and Fligner is the right default.
Exercise 2: Real-data workflow with Welch fallback
Use airquality. Test whether Ozone variance differs by Month with Fligner-Killeen. If the test rejects equal variance, run Welch's ANOVA. Save the Fligner p-value to my_flig_air and the Welch p-value to my_welch_p.
Click to reveal solution
Explanation: Fligner-Killeen flags unequal variance (p = 0.010), so a standard aov() is unsafe. Welch's ANOVA returns p < 0.0001, which is a defensible result that does not assume equal variances. This two-step workflow is the standard pattern: check variance, then either keep ANOVA or switch to Welch.
Complete Example
End-to-end on mtcars: does fuel efficiency (mpg) have equal variance across cylinder counts (cyl)? Test, decide, then choose the right group-mean comparison.
Three lines of decision-relevant output. Fligner-Killeen rejects equal variance at p = 0.036, so we move from aov() to oneway.test(..., var.equal = FALSE). Welch returns p ≈ 1.7e-5, an emphatic rejection of equal mpg means across cylinder counts. The whole workflow, from assumption check to final inference, fits in three function calls.
fligner.test() drops missing values silently. The default na.action = na.omit removes any row with NA in either the response or the grouping variable before testing. That is usually what you want, but if you have heavy missingness, run summary(your_df) first to confirm you are not dropping a meaningful chunk of the sample.Summary
| Question | Answer |
|---|---|
| What does it test? | Whether the variances of two or more groups are equal. |
| What does it assume? | Independent observations within and across groups. No normality required. |
| When should I use it? | When residuals are non-normal, heavy-tailed, or have outliers, where Bartlett over-rejects. |
| How do I run it? | fligner.test(y ~ g, data = your_df) or fligner.test(list_of_vectors). |
| How do I interpret the p-value? | p < 0.05 → variances differ. p ≥ 0.05 → no evidence of difference (not proof of equality). |
| What if variances differ? | Use oneway.test(..., var.equal = FALSE) (Welch ANOVA), or transform, or switch to Kruskal-Wallis. |
The Fligner-Killeen test is your default homogeneity-of-variance check whenever you cannot fully trust the normality of your data. It costs nothing to run, ships with base R, and stays calibrated under the heavy tails that wreck Bartlett's test.
References
- Fligner, M. A., & Killeen, T. J. (1976). "Distribution-free two-sample tests for scale." Journal of the American Statistical Association 71(353), 210–213. Link
- Conover, W. J., Johnson, M. E., & Johnson, M. M. (1981). "A comparative study of tests for homogeneity of variances, with applications to the outer continental shelf bidding data." Technometrics 23(4), 351–361. Link
- R Core Team.
fligner.test()documentation. Link - R Core Team.
oneway.test()documentation (Welch's ANOVA). Link - Fox, J., & Weisberg, S. (2019). An R Companion to Applied Regression, 3rd ed. Sage. Link
- Crawley, M. J. (2013). The R Book, 2nd ed. Wiley. Chapter 11: Analysis of Variance. Link
- Wilcox, R. R. (2022). Introduction to Robust Estimation and Hypothesis Testing, 5th ed. Academic Press.
Continue Learning
- When to Use Nonparametric Tests in R. The parent decision guide. When variance fails and shape differs, this article walks you to the right rank-based test.
- Normality and Variance Tests in R. Pairs Fligner-Killeen with Shapiro-Wilk, Anderson-Darling, and Bartlett in a complete pre-ANOVA assumption-checking workflow.
- Kruskal-Wallis Test in R. The rank-based one-way ANOVA alternative, the right move when groups differ in shape as well as scale.