forcats fct_lump() in R: Collapse Rare Factor Levels
The forcats fct_lump() function collapses rare factor levels in R into a single "Other" category, turning a long-tailed factor into a tidy handful of groups.
fct_lump(x) # auto-lump rare levels into "Other" fct_lump(x, n = 5) # keep the 5 most common levels fct_lump(x, prop = 0.1) # keep levels above 10% frequency fct_lump(x, n = -3) # lump the 3 most common, keep the rest fct_lump(x, w = weights) # rank levels by weighted counts fct_lump(x, other_level = "Misc") # rename the catch-all bucket fct_lump_min(x, min = 10) # keep levels seen at least 10 times
Need explanation? Read on for examples and pitfalls.
What fct_lump() does in one sentence
fct_lump() collapses a factor's infrequent levels into a single "Other" level. It comes from the forcats package, part of the tidyverse. A categorical column with dozens of rare values, such as product names or cities, is hard to chart and model. fct_lump() keeps the levels that matter and sweeps the long tail into one labelled bucket, controlled by a count (n), a proportion (prop), or an automatic heuristic.
Syntax
fct_lump() takes a factor plus one rule for how many levels to keep. The signature is short:
The arguments are:
f: a factor, or any vector that can be coerced to one (character, numeric, or logical).n: keep thenmost common levels and lump the rest. A negativenflips it, lumping thenmost common and keeping the rare tail.prop: keep levels that appear at least proportionpropof the time. A negativeproplumps the common levels instead.w: an optional numeric weight vector, one value per observation, so levels are ranked by summed weight rather than a plain row count.other_level: the name of the catch-all level. Defaults to"Other".ties.method: how to break ties when two levels share a count near thencutoff.
Supply either n or prop, never both. With neither, fct_lump() uses a heuristic that lumps the least common levels as long as the "Other" level stays the smallest.
fct_lump() examples
Example 1 shows the automatic heuristic with no n or prop. Load forcats and call fct_lump() on a vector with a clear long tail.
The rare levels c, d, and e (one each) merge into Other, which totals 3. fct_lump() stopped there because lumping b as well would make Other larger than the levels it kept.
Example 2 keeps a fixed number of levels with n. This is the most common use: a bar chart or model needs a manageable count of categories.
The three busiest manufacturers stay named; the other twelve collapse into Other.
mpg is unchanged. What changes is the levels attribute: twelve sparse categories become one. That is why fct_lump() is safe to drop into a pipeline ahead of count(), ggplot2, or a model formula.Example 3 keeps levels above a frequency threshold with prop. Use prop when the cutoff should scale with the data instead of a hard count.
prop = 0.1 keeps any manufacturer that makes up at least 10 percent of the 234 rows. Four brands clear the bar; the rest become Other.
Example 4 renames the catch-all bucket with other_level. The default label "Other" is generic, so set a name that reads well on a chart axis or in a report.
fct_infreq(fct_lump(x, n = 5)) so the chart shows the five named bars ranked tallest to shortest, with the Other bar wherever its total lands. The two functions are designed to chain.fct_lump() vs the fct_lump_*() family
fct_lump() is the umbrella; the fct_lump_*() variants each expose one rule. Reach for the variant that names the rule you want.
| Function | What it keeps | Example |
|---|---|---|
| fct_lump_n() | The n most common levels |
fct_lump_n(x, n = 5) |
| fct_lump_prop() | Levels above a proportion | fct_lump_prop(x, prop = 0.1) |
| fct_lump_min() | Levels seen at least min times |
fct_lump_min(x, min = 10) |
| fct_lump_lowfreq() | Only the rare tail, automatically | fct_lump_lowfreq(x) |
fct_lump_n(x, 5) says exactly what fct_lump(x, n = 5) does.Common pitfalls
fct_lump() rejects n and prop supplied together. The two arguments are competing rules, so passing both is an error rather than a silent choice.
Pick one rule: use n for a fixed number of bars, prop for a frequency cutoff that scales with the data.
The "Other" bucket can outgrow every level you keep. When you lump aggressively, the catch-all sums many small levels and can become the single largest category.
With n = 2, Other holds 163 of 234 rows and dwarfs the two named brands.
n, switch to fct_lump_min() with a sensible threshold, or lump by prop so the cutoff tracks the data.Try it yourself
Try it: Collapse the manufacturer column of mpg so only the 5 most common brands remain, with everything else in "Other". Save the factor to ex_makers.
Click to reveal solution
Explanation: fct_lump() ranks the levels by count, keeps the five most common manufacturers, and merges the remaining ten into a single "Other" level. The kept levels stay in alphabetical order, with "Other" appended last.
Related forcats functions
These forcats functions pair naturally with fct_lump() for level management.
- fct_other(): keep or drop named levels explicitly, lumping the rest into "Other".
- fct_collapse(): merge groups of levels into named categories by hand.
- fct_recode(): rename levels one by one.
- fct_infreq(): order levels by frequency, a natural follow-up after lumping.
- Categorical Data in R: the full guide to factors.
See the forcats reference for the official documentation.
FAQ
What is the difference between fct_lump() and fct_lump_n()?
They do the same job with different ergonomics. fct_lump(x, n = 5) and fct_lump_n(x, 5) both keep the five most common levels and lump the rest. fct_lump() is the older umbrella function that switches behavior based on which argument you pass. fct_lump_n() is the newer variant that names the rule directly. Recent forcats versions supersede fct_lump() in favor of the explicit fct_lump_n(), fct_lump_prop(), and fct_lump_min() functions, so new code reads better with the variant.
How do I rename the "Other" category created by fct_lump()?
Pass the other_level argument with the label you want, for example fct_lump(x, n = 4, other_level = "Smaller brands"). The catch-all level then carries your name instead of the default "Other". This is worth doing whenever the factor will appear on a chart axis or in a report, where a generic "Other" reads poorly. The argument works the same way across fct_lump_n(), fct_lump_prop(), and the other variants.
Does fct_lump() work on character vectors?
Yes. fct_lump() accepts a factor or any vector that can be coerced to one, including character, numeric, and logical vectors. When you pass a character vector, fct_lump() converts it to a factor first, then returns a factor with the rare levels collapsed. The original values are preserved for the levels that are kept. If you need the result back as plain text, wrap the call in as.character().
How do I keep the rarest levels instead of the most common?
Pass a negative n or prop. The call fct_lump(x, n = -3) lumps the three most common levels into "Other" and keeps the rare tail untouched. Likewise fct_lump(x, prop = -0.1) lumps any level above 10 percent frequency. This inversion is useful when the frequent categories are the noise and the rare ones are the signal, such as flagging unusual error codes or uncommon transactions.