dplyr distinct() in R: Remove Duplicate Rows
The distinct() function in dplyr removes duplicate rows from a data frame. Without arguments it dedupes on all columns; with column names it dedupes on those columns only. Use .keep_all = TRUE to keep the rest of the row.
distinct(df) # all columns must match distinct(df, cyl) # unique cyl values, returns 1 col distinct(df, cyl, .keep_all = TRUE) # unique by cyl, keep first row distinct(df, cyl, gear) # unique combinations, returns 2 cols distinct(df, cyl, gear, .keep_all = TRUE) # unique by cyl+gear, keep all cols n_distinct(df$cyl) # count unique values df |> count(cyl) |> filter(n > 1) # find duplicates
Need explanation? Read on for examples and pitfalls.
What distinct() does in one sentence
distinct() returns rows where the specified columns form a unique combination. Without arguments, it considers ALL columns. With column names, only those columns determine uniqueness. By default, it returns ONLY the deduplicating columns; use .keep_all = TRUE to keep every column of the first matching row.
Unlike base R unique(), distinct integrates into pipelines, supports .keep_all semantics, and works on grouped data frames. It is also faster than unique() for large data frames.
Syntax
distinct() takes a data frame plus optional dedup columns plus .keep_all flag. Empty arguments dedup on all columns; named arguments dedup on those only.
The full signature:
distinct(.data, ..., .keep_all = FALSE)
... is zero or more columns to dedup on. .keep_all = FALSE (default) returns only the dedup columns; TRUE returns the first occurrence of each unique combination with all columns.
.keep_all = FALSE returns only the dedup columns. distinct(mtcars, cyl) returns a 3-row, 1-column tibble (the unique cyl values). To keep the full row of the first occurrence, set .keep_all = TRUE.Six common patterns
1. Dedup on all columns
The two identical (1, "x") rows collapse to one. Rows must match on EVERY column to be considered duplicates.
2. Unique values in one column
Only the dedup column is returned. Result is a 1-column data frame, not a vector.
3. Unique combinations of multiple columns
The result has one row per UNIQUE COMBINATION of the listed columns.
4. Keep first occurrence with all columns (.keep_all = TRUE)
The FIRST occurrence of each unique cyl value is kept, with all other columns intact.
5. Count duplicates instead of removing them
Sometimes you want to KNOW which rows are duplicated before deciding what to do. count() plus filter(n > 1) lists every duplicate group.
6. Distinct within groups
Globally distinct on value. The (A, 1) and (B, 3) are kept once each (first occurrence), and the duplicate (A, 1) is dropped.
distinct(df, x) and distinct(df, x, .keep_all = TRUE) look similar but return DIFFERENT widths. Without .keep_all, you get only column x. With .keep_all, you get every column. Pick based on what the next step in your pipeline needs.distinct() vs base R unique()
Base R unique() works on vectors and matrices and data frames; distinct() is data-frame specific but more powerful.
| Task | dplyr | Base R |
|---|---|---|
| Dedup full data frame | distinct(df) |
unique(df) |
| Unique values, one column | distinct(df, x) |
unique(df$x) (returns vector) |
| Unique by column, keep all | distinct(df, x, .keep_all=TRUE) |
(multi-step: df[!duplicated(df$x), ]) |
| Multi-col unique combinations | distinct(df, x, y) |
unique(df[, c("x","y")]) |
| Pipeline-friendly | Yes | Awkward |
When to use which:
- Use
distinct()inside any dplyr pipeline. - Use
unique()for vectors and outside the tidyverse.
Common pitfalls
Pitfall 1: forgetting .keep_all = TRUE. distinct(mtcars, cyl) returns a 3-row, 1-column tibble. If you wanted "one row per cyl with all columns", you need .keep_all = TRUE. New users frequently miss this and end up with surprisingly skinny output.
Pitfall 2: distinct does not sort. The result keeps rows in their FIRST-OCCURRENCE order, not alphabetical or numeric order. To sort the result, chain arrange() after.
distinct() treats NA values as identical. Two rows with x = NA are considered duplicates and only one is kept. This matches unique() behavior. If you want NA rows treated as distinct, you need a workaround (e.g., add a row index column, dedup, then drop the index).Pitfall 3: distinct on a grouped frame may behave unexpectedly. group_by(df, g) |> distinct(x) deduplicates WITHIN each group. The result is unique x values per group, not globally unique. Usually .by = g inside distinct is clearer if you need this.
Try it yourself
Try it: From mtcars, find the unique combinations of cyl and am (transmission). Save to ex_combos.
Click to reveal solution
Explanation: distinct(cyl, am) returns one row per unique combination of cyl and am. There are 6 combinations (4 cyl manual, 4 cyl auto, 6 cyl manual, 6 cyl auto, 8 cyl manual, 8 cyl auto). The order matches first-occurrence in the data.
Related dplyr functions
After mastering distinct(), look at:
n_distinct(): count unique values without returning themcount(): count rows per group, useful for finding duplicatesslice(): explicit row selection by position;slice(1, .by = g)keeps first per groupslice_max()/slice_min(): keep top per group by some value (better than distinct for "latest version" cases)- Base R
unique(),duplicated(): vector-level dedup helpers
For "keep the latest record per id", reach for slice_max(df, version, n = 1, by = id) instead of distinct, since you usually want the highest version, not just the first.
FAQ
How do I remove duplicate rows in dplyr?
distinct(df) removes rows that are duplicates across ALL columns. distinct(df, x) deduplicates by column x only and returns just that column. distinct(df, x, .keep_all = TRUE) deduplicates by x but keeps every column from the first occurrence.
What is the difference between distinct and unique in R?
unique() is base R; works on vectors, matrices, and data frames. distinct() is dplyr; data-frame specific but supports .keep_all, integrates with pipelines, and is faster on large data. For a vector, both return the same unique values.
How do I count distinct values in dplyr?
Use n_distinct(x) inside summarise(): summarise(df, n = n_distinct(x)) returns the count of unique values. For per-group counts: summarise(df, n = n_distinct(x), .by = g).
What does .keep_all do in distinct?
.keep_all = TRUE keeps EVERY column of the first matching row, not just the dedup columns. .keep_all = FALSE (default) returns only the columns named in the dedup arguments. The flag matters when you dedup by some columns but want to retain others.
How do I find rows that are duplicated (not remove them)?
df |> count(x, y) |> filter(n > 1) lists every (x, y) combination that appears more than once. To get the actual duplicate ROWS: df |> group_by(x, y) |> filter(n() > 1) or df[duplicated(df[, c("x","y")]) | duplicated(df[, c("x","y")], fromLast = TRUE), ] in base R.