janitor get_dupes() in R: Find Duplicate Rows With Counts
The get_dupes() function in janitor returns every duplicate row in a data frame along with a dupe_count column showing how many times each combination appears. Unlike distinct(), which keeps one of each, get_dupes() surfaces ALL copies so you can inspect what is duplicated and why.
get_dupes(df) # duplicates across all columns get_dupes(df, customer_id) # duplicates by one column get_dupes(df, customer_id, order_date) # duplicates by composite key df |> janitor::get_dupes(email) # pipe-friendly get_dupes(df, starts_with("name")) # tidyselect helpers get_dupes(df, -id, -timestamp) # all columns except these nrow(get_dupes(df, email)) # how many duplicate rows exist
Need explanation? Read on for examples and pitfalls.
What get_dupes() does in one sentence
get_dupes() filters a data frame down to rows whose chosen columns appear more than once, then adds a dupe_count column so you can see how many copies of each combination exist. It always returns ALL copies of every duplicate, not just the extra ones.
Reach for it when a key that is supposed to be unique might not be. Audit a customer table, check a join key before merging, or spot rows loaded twice.
Syntax
get_dupes() takes a data frame and an optional list of columns; if no columns are passed, it uses all of them. The result is a tibble sorted so duplicates of the same key sit next to each other.
The full signature is short:
get_dupes(dat, ...)
dat is the data frame; ... accepts bare column names, tidyselect helpers, or negated columns. The output is always a tibble with one extra column called dupe_count placed first.
get_dupes() straight after every import. Running read_csv("file.csv") |> get_dupes(primary_key) immediately flags whether your supposedly unique key is actually unique. Catching duplication at the door prevents silent fan-out joins downstream that quietly double row counts and inflate every aggregation.Six common patterns
1. Find duplicates across all columns
With no arguments, get_dupes() compares the entire row. Rows 2 and 3 are exact copies (2, B, 2), and rows 4 and 5 are exact copies (3, C, 5). Each group's dupe_count shows the size of the cluster.
2. Duplicates by one column
Passing one column finds rows that share a value there even if other columns differ. Customer 3 appears 3 times because the same id ordered both C and D.
3. Duplicates by composite key
Pass multiple bare column names to define a composite key. This is the most common production use: confirming that (customer_id, order_date) or (user_id, event_id) uniquely identifies a row.
4. Tidyselect helpers
The first-attempt above fails because no columns START with "name". Use a real prefix:
get_dupes() accepts any tidyselect helper: starts_with(), ends_with(), contains(), matches(), where(). Useful when your duplicate-key spans several columns sharing a naming convention.
5. Negate columns to ignore
Negation excludes columns from the duplicate check while keeping them in the output. Common pattern: ignore timestamps or surrogate ids when looking for semantic duplicates (same user, same event, different time).
6. Use it as a test in a pipeline
Because get_dupes() returns a tibble, you can use nrow() to count violations in a script. Adding this check to every ETL step beats hunting doubled metrics later.
get_dupes() returns ALL copies, not just the extras. If a value appears 3 times, you get all 3 rows back, with dupe_count = 3 on each. This is the opposite of duplicated() in base R, which returns FALSE for the first occurrence and TRUE for the rest. The all-copies behavior is what makes get_dupes() an inspection tool rather than a filter.get_dupes() vs distinct() vs duplicated()
Three tools deal with duplicate rows; pick by what answer you need.
| Task | get_dupes() |
distinct() |
duplicated() |
|---|---|---|---|
| List every duplicate row | yes, with count | no | no |
| Keep one row per key | no | yes | manual |
| Flag rows as duplicate | no | no | yes, logical vector |
| Show the dupe_count | yes | no | no |
| Returns input class | tibble | yes | logical vector |
| Best for inspection | yes | no | no |
| Best for cleaning | no | yes | yes |
When to use which:
- Use
get_dupes()to audit and inspect. The output answers "which rows are duplicates, and how many times?" - Use
dplyr::distinct()to deduplicate. The output is the cleaned data with one row per key. - Use base R
duplicated()when you need a logical flag to combine with other conditions, for exampledf[!duplicated(df$id) & df$valid, ].
For a side-by-side walkthrough of the deduplication option, see dplyr distinct() in R.
df[df.duplicated(subset=['key'], keep=False)] followed by a groupby-size to add a count column. Pandas has no single function with get_dupes()'s combined filter-and-count behavior, which is why janitor packs both steps into one call.Common pitfalls
Pitfall 1: NA values are treated as equal. Two rows with NA in the key column count as duplicates of each other. If you want NAs ignored, drop them first with tidyr::drop_na(df, key) |> get_dupes(key).
Pitfall 2: forgetting that all copies are returned. A duplicate group of 5 rows produces 5 rows in the output, not 4 (extras only). If you want only the extras, use get_dupes() to identify the key then anti-join: df |> anti_join(distinct(df, key, .keep_all = TRUE)).
0.1 + 0.2 and 0.3) are NOT equal under the hood, so get_dupes() will miss them. For numeric keys, round to a fixed precision first: df |> mutate(key = round(key, 4)) |> get_dupes(key).Pitfall 3: empty result is normal. If no duplicates exist, get_dupes() returns a zero-row tibble plus a console message: No duplicate combinations found of: <columns>. Treat the message as a pass, not a warning.
Try it yourself
Try it: Use get_dupes() on the built-in iris data frame to find rows where the same Sepal.Length and Species combination appears more than once. Save the result to ex_dupes and count them.
Click to reveal solution
Explanation: Passing the two bare column names defines a composite key. Iris has many repeated Sepal.Length values within each species, so most rows participate in some duplicate group. The 134 returned rows include EVERY copy of each duplicated combination, with dupe_count showing the group size.
Related janitor functions
After mastering get_dupes(), look at:
clean_names(): standardize messy column names before deduplication so keys actually matchremove_empty(): drop fully empty rows or columns that can otherwise show up as duplicatestabyl(): cross-tabulate the duplicated keys to see which values clusterdplyr::distinct(): keep one row per key once you have confirmed the duplicates are realdplyr::count(): count unique values without filtering, complementary toget_dupes()
For a tour of the wider package, see the janitor package guide. The official reference is sfirke.github.io/janitor.
FAQ
What does janitor get_dupes() do?
get_dupes() returns every row in a data frame whose chosen columns appear more than once, with an added dupe_count column showing the size of each duplicate group. It is an inspection tool: you see all copies, not just the extras, so you can decide whether each duplicate is a data-entry mistake, a legitimate repeat event, or a join that fanned out unexpectedly. The function leaves the input unchanged and returns a new tibble.
How is get_dupes() different from distinct()?
distinct() deduplicates: it keeps one row per unique combination and discards the rest, leaving you with clean data. get_dupes() does the opposite: it surfaces the duplicates so you can examine them. Use get_dupes() to audit and understand the problem, then use distinct() to actually remove duplicates once you have decided which version of each row to keep.
Does get_dupes() treat NA values as duplicates?
Yes. Two rows whose key columns are both NA are treated as a duplicate pair and appear in the output with dupe_count = 2. If you do not want NAs to count, filter them out before calling: df |> tidyr::drop_na(key_column) |> get_dupes(key_column). This matches how SQL GROUP BY typically treats NULLs for the purpose of duplicate detection.
Can I use get_dupes() to test a primary key in a script?
Yes, and it is a common pattern. Wrap the call in nrow(): stopifnot(nrow(get_dupes(df, id)) == 0). The assertion passes silently when the key is unique and throws a hard error the moment a duplicate appears, which is exactly what you want in a data pipeline that assumes one row per id.
Why does my numeric column miss duplicates that look identical?
Floating-point values that print the same can differ in their underlying bits, so get_dupes() treats them as distinct. To avoid this, round to a fixed precision before the duplicate check: df |> mutate(amount = round(amount, 2)) |> get_dupes(amount). For string or integer columns the issue does not apply; comparisons there are exact.