forcats fct_lump_min() in R: Lump Levels by Frequency Floor
The forcats fct_lump_min() function keeps every factor level in R that appears at least min times and collapses the rarer levels into a single "Other" category, so a raw frequency floor decides which categories survive.
fct_lump_min(x, min = 10) # keep levels seen >= 10 times fct_lump_min(x, min = 100, w = wt) # rank levels by summed weight fct_lump_min(x, min = 10, other_level = "Misc") # rename the catch-all bucket fct_lump_min(df$col, min = 5) # works on a data-frame column fct_count(fct_lump_min(x, min = 10)) # tally the lumped result fct_infreq(fct_lump_min(x, min = 10)) # order the lumped result
Need explanation? Read on for examples and pitfalls.
What fct_lump_min() does in one sentence
fct_lump_min() keeps factor levels that clear a count threshold and lumps the rest into "Other". It comes from the forcats package in the tidyverse. You pass a factor and an integer min, and the function keeps every level seen at least min times while merging every rarer level into one labelled bucket. Unlike fct_lump_n(), which keeps a fixed number of levels, fct_lump_min() lets the data decide how many categories survive.
Syntax
fct_lump_min() takes a factor and a minimum count. The signature is short and has no tie-breaking argument:
The arguments are:
f: a factor, or any vector that can be coerced to one (character, numeric, or logical).min: the frequency floor. A level is kept when its count is greater than or equal tomin; anything below the floor is lumped.w: an optional numeric weight vector, one value per observation. Levels are then judged by summed weight rather than a plain row count.other_level: the name of the catch-all level. Defaults to"Other".
fct_lump_min() never drops rows. It only rewrites the levels attribute, so the result has the same length as the input and is safe to drop into a pipeline.
fct_lump_min() examples
Example 1 keeps the well-populated levels of a survey factor. The partyid column of gss_cat, a dataset bundled with forcats, has ten political-affiliation levels with a thin tail.
Seven levels clear the floor of 1,000 rows. The three rare levels fall below it and collapse into a single Other level holding 548 rows.
min encodes that minimum sample size directly. You do not pick a category count; you pick the smallest group you are willing to trust, and the data tells you how many survive.Example 2 makes the threshold concrete on a small vector. A handful of letter grades shows exactly which counts pass and which get lumped.
A, B, and D each appear at least twice and stay named. C and F appear once, so they merge into Other, which holds their two combined rows.
Example 3 ranks levels by a weighted total instead of a row count. Pass a numeric w so a level clears the floor on its summed weight, not how many rows carry it.
By row count, cat and dog clear a floor of 2. By spend, dog totals only 15 and gets lumped, while fox and owl survive on a single high-value row each.
Example 4 renames the bucket and orders the result by frequency. The default "Other" label reads poorly on a chart, and fct_lump_min() does not sort, so chain fct_infreq().
fct_infreq() ranks the bars by frequency in one extra step.fct_lump_min() vs the other fct_lump_*() functions
fct_lump_min() is one of four lumping variants, each naming a different keep rule. Pick the variant whose rule matches the decision you actually want to make.
| Function | Keep rule | Example |
|---|---|---|
| fct_lump_min() | Levels seen at least min times |
fct_lump_min(x, min = 10) |
| fct_lump_n() | The n most common levels |
fct_lump_n(x, n = 5) |
| fct_lump_prop() | Levels above a proportion | fct_lump_prop(x, prop = 0.1) |
| fct_lump_lowfreq() | Only the rare tail, automatically | fct_lump_lowfreq(x) |
Use fct_lump_min() when a raw sample-size floor matters, fct_lump_n() when you need a predictable category count, and fct_lump_prop() when the cutoff should scale with the size of the data.
fct_lump(); recent versions split it into the explicit fct_lump_n(), fct_lump_prop(), fct_lump_min(), and fct_lump_lowfreq() functions. The umbrella still runs, but the named variants state intent more clearly and are the recommended form for new code.Common pitfalls
An existing "Other" level absorbs the lumped rows silently. If the factor already has a level named "Other", fct_lump_min() merges the rare tail into it rather than creating a fresh bucket.
The relig factor already has an "Other" level with 224 rows. After lumping, it holds 553, because six rare levels quietly merged into it. Rename the bucket with other_level if that conflation would mislead a reader.
min above every level count collapses the whole factor into one bucket. If no level clears the floor, fct_lump_min() returns a factor with a single "Other" level and raises no error or warning. A pipeline that expected named categories then breaks downstream. Check the maximum level count before choosing min.The floor is inclusive, so a level exactly at min is kept. fct_lump_min() keeps levels whose count is greater than or equal to min, not strictly greater. A B seen twice survives min = 2 but is lumped at min = 3.
Try it yourself
Try it: Collapse the marital column of gss_cat so only marital statuses seen at least 5,000 times remain, with everything rarer in "Other". Save the factor to ex_marital.
Click to reveal solution
Explanation: fct_lump_min() keeps each marital level whose count reaches 5,000 ("Divorced", "Married", "Never married") and merges the rarer levels ("No answer", "Separated", "Widowed") into a single "Other" level appended last.
Related forcats functions
These forcats functions pair naturally with fct_lump_min() for level management.
- fct_lump_n(): keep a fixed number of levels instead of a frequency floor.
- fct_lump(): the umbrella lumping function that switches on
n,prop, or a heuristic. - fct_infreq(): order levels by frequency, the natural follow-up after lumping.
- fct_other(): keep or drop named levels explicitly instead of by count.
- Categorical Data in R: the full guide to factors.
See the forcats reference for the official documentation.
FAQ
What does fct_lump_min() do in R?
fct_lump_min() collapses the rare levels of a factor into a single "Other" category. You give it a factor and an integer min, and it keeps every level seen at least min times while merging every rarer level into one bucket. The result has the same number of rows as the input; only the levels attribute changes. It suits any case where a category needs enough observations to be trustworthy.
What is the difference between fct_lump_min() and fct_lump_n()?
Both lump rare levels into "Other", but they use different keep rules. fct_lump_min() keeps every level whose count clears a frequency floor, so the number of surviving levels depends on the data. fct_lump_n() keeps a fixed count of levels, the n most common ones, regardless of how frequent they actually are. Use fct_lump_min() when a raw sample-size threshold matters and fct_lump_n() when you need a predictable number of categories.
Does fct_lump_min() keep a level whose count equals min?
Yes. The floor is inclusive: fct_lump_min() keeps levels whose count is greater than or equal to min, not strictly greater. A level seen exactly min times survives, and a level seen min - 1 times is lumped. If you want a strict cutoff, set min one higher than the boundary count you want to exclude. This inclusive behavior is consistent across the forcats lumping variants.
How do I rename the "Other" level created by fct_lump_min()?
Pass the other_level argument with your preferred label, for example fct_lump_min(x, min = 10, other_level = "Rare groups"). The catch-all level then carries that name instead of the default "Other". Renaming is especially important when the factor already has a level named "Other", because fct_lump_min() would otherwise merge the rare tail into that existing level and inflate its count without any warning.
Can fct_lump_min() rank levels by a value other than count?
Yes, through the w argument. Supply a numeric vector the same length as the factor, and fct_lump_min() compares each level's summed weight against min instead of its row count. For example, keeping products whose total revenue reaches a threshold: fct_lump_min(product, min = 1000, w = revenue). A level that appears in few rows but carries large weights can then clear the floor, while a frequent low-weight level falls below it.