dplyr n_distinct() in R: Count Unique Values Fast
The n_distinct() function in dplyr counts the number of unique values in one or more vectors. It is the fast, dplyr-native equivalent of length(unique(x)).
n_distinct(x) # unique values in x n_distinct(x, na.rm = TRUE) # exclude NAs n_distinct(x, y) # unique combinations of x and y df |> summarise(n_unique = n_distinct(col)) df |> group_by(g) |> summarise(n_unique = n_distinct(col)) length(unique(x)) # base R equivalent (slower)
Need explanation? Read on for examples and pitfalls.
What n_distinct() does in one sentence
n_distinct(x, ..., na.rm = FALSE) returns the integer count of unique values in x (or unique combinations across x, ...). It is faster than length(unique(x)) because it avoids materializing the full unique vector.
The standard "how many unique customers / products / categories?" function in dplyr.
Syntax
n_distinct(..., na.rm = FALSE). Pass one or more vectors; counts unique combinations across them.
n_distinct(x) is faster than length(unique(x)). dplyr uses a hash-based approach that avoids materializing the unique vector. On a million-element vector the difference is significant.Five common patterns
1. Unique values in one column
2. Inside summarise
3. Multi-column combinations
Pass multiple vectors to count unique combinations.
4. Excluding NAs
5. Inside mutate (per-group)
n_distinct() differs from n(): it counts UNIQUE values, not ROWS. n() returns the group size; n_distinct(col) returns how many distinct values appear in that column. Easy to confuse but very different semantics.n_distinct() vs length(unique()) vs distinct() vs n()
Four ways to handle "uniqueness" questions in dplyr / R.
| Function | Returns | Best for |
|---|---|---|
n_distinct(x) |
Integer count | Quick count, dplyr summarise |
length(unique(x)) |
Integer count | Base R, equivalent but slower |
dplyr::distinct(df, col) |
Filtered tibble | "Show me the unique rows" |
unique(x) |
Vector of unique values | Inspect what those values are |
n() |
Group size (row count) | Different question |
When to use which:
n_distinct(x)inside summarise/mutate.length(unique(x))for base R; same result, slightly slower.distinct(df, col)to keep one row per unique value.unique(x)to see the actual unique values.
A practical workflow
The "audit" pattern is the most common n_distinct use case in summary tables.
A first-pass dataset audit: how many rows, how many unique users, how many unique items. Tells you the table's shape at a glance.
For per-group audits:
Daily session and unique-user counts.
Common pitfalls
Pitfall 1: NA counted as distinct. Default na.rm = FALSE includes NA in the count. n_distinct(c(1, NA, 1)) returns 2 (1 and NA). Add na.rm = TRUE to exclude.
Pitfall 2: passing multiple cols treats them as combinations. n_distinct(x, y) counts unique (x, y) PAIRS, not unique values in either. To count separately, call twice.
n_distinct() is faster than length(unique(x)) but they CAN differ on edge cases. With factors and NA handling, results may vary. Pick one in a project and stick with it for consistency.Performance note
For very large data, n_distinct() is faster than length(unique(x)) thanks to a hash-based implementation. On vectors with millions of elements the difference can be 2-10x. For everyday data sizes (thousands to hundreds of thousands of rows) both functions feel instant, so pick by readability and consistency. Inside dplyr pipelines, n_distinct is the idiomatic choice. For data.table users, uniqueN() plays the same role with similar performance. The underlying algorithm uses a hash set internally rather than building the full unique vector, which is what saves memory and time on big inputs.
Try it yourself
Try it: For each cyl group in mtcars, count the unique number of gears AND the unique number of carburetors. Save to ex_uniq.
Click to reveal solution
Explanation: n_distinct(gear) counts unique gear values per cyl. Same for carb. Two summary stats per group.
Related dplyr functions
After mastering n_distinct, look at:
n(): row count of current groupcount(df, g): count rows per groupdistinct(df, col): keep one row per unique valueunique(x): base R; show the unique valuessummarise(): standard aggregation contextgroup_by(): per-group n_distinct
For "show me the actual unique rows", distinct(df, col, .keep_all = TRUE) is the cleaner tool.
FAQ
What does n_distinct do in dplyr?
n_distinct(x) returns the count of unique values in x as an integer. Faster than length(unique(x)) and integrates with summarise / mutate.
What is the difference between n_distinct and length(unique())?
Both return the same count. n_distinct is faster (hash-based) and is the dplyr-native idiom. length(unique()) is base R; works anywhere.
How do I exclude NAs from n_distinct?
Pass na.rm = TRUE: n_distinct(x, na.rm = TRUE). Default counts NA as one of the distinct values.
How do I count unique combinations of multiple columns?
n_distinct(x, y, z) counts unique tuples across the three vectors. Pass each column as a separate argument.
Is n_distinct different from n() in dplyr?
Yes. n() counts ROWS in the current group. n_distinct(col) counts UNIQUE values in a column. Different questions.