janitor get_dupes() in R: Find Duplicate Rows With Counts

The get_dupes() function in janitor returns every duplicate row in a data frame along with a dupe_count column showing how many times each combination appears. Unlike distinct(), which keeps one of each, get_dupes() surfaces ALL copies so you can inspect what is duplicated and why.

By Selva Prabhakaran · Published May 22, 2026 · Last updated May 22, 2026

⚡ Quick Answer

get_dupes(df)                                  # duplicates across all columns
get_dupes(df, customer_id)                     # duplicates by one column
get_dupes(df, customer_id, order_date)         # duplicates by composite key
df |> janitor::get_dupes(email)                # pipe-friendly
get_dupes(df, starts_with("name"))             # tidyselect helpers
get_dupes(df, -id, -timestamp)                 # all columns except these
nrow(get_dupes(df, email))                     # how many duplicate rows exist

Need explanation? Read on for examples and pitfalls.

📊 Is get_dupes() the right tool?

What get_dupes() does in one sentence

get_dupes() filters a data frame down to rows whose chosen columns appear more than once, then adds a dupe_count column so you can see how many copies of each combination exist. It always returns ALL copies of every duplicate, not just the extra ones.

Reach for it when a key that is supposed to be unique might not be. Audit a customer table, check a join key before merging, or spot rows loaded twice.

Syntax

get_dupes() takes a data frame and an optional list of columns; if no columns are passed, it uses all of them. The result is a tibble sorted so duplicates of the same key sit next to each other.

Run live

Run live, no install needed. Every R block on this page runs in your browser. Click Run, edit the code, re-run instantly. No setup.

RLoad janitor and a sample data frame

library(janitor) library(dplyr) orders <- data.frame( customer_id = c(1, 2, 2, 3, 3, 3, 4), product = c("A", "B", "B", "C", "C", "D", "E"), qty = c(1, 2, 2, 5, 5, 1, 3) ) orders #> customer_id product qty #> 1 1 A 1 #> 2 2 B 2 #> 3 2 B 2 #> 4 3 C 5 #> 5 3 C 5 #> 6 3 D 1 #> 7 4 E 3

The full signature is short:

get_dupes(dat, ...)

dat is the data frame; ... accepts bare column names, tidyselect helpers, or negated columns. The output is always a tibble with one extra column called dupe_count placed first.

Tip

Pipe get_dupes() straight after every import. Running read_csv("file.csv") |> get_dupes(primary_key) immediately flags whether your supposedly unique key is actually unique. Catching duplication at the door prevents silent fan-out joins downstream that quietly double row counts and inflate every aggregation.

Six common patterns

1. Find duplicates across all columns

RWhole-row duplicates

orders |> get_dupes() #> # A tibble: 4 x 4 #> customer_id product qty dupe_count #> <dbl> <chr> <dbl> <int> #> 1 2 B 2 2 #> 2 2 B 2 2 #> 3 3 C 5 2 #> 4 3 C 5 2

With no arguments, get_dupes() compares the entire row. Rows 2 and 3 are exact copies (2, B, 2), and rows 4 and 5 are exact copies (3, C, 5). Each group's dupe_count shows the size of the cluster.

2. Duplicates by one column

RDuplicates by single key

orders |> get_dupes(customer_id) #> # A tibble: 5 x 4 #> customer_id dupe_count product qty #> <dbl> <int> <chr> <dbl> #> 1 2 2 B 2 #> 2 2 2 B 2 #> 3 3 3 C 5 #> 4 3 3 C 5 #> 5 3 3 D 1

Passing one column finds rows that share a value there even if other columns differ. Customer 3 appears 3 times because the same id ordered both C and D.

3. Duplicates by composite key

RDuplicates by multiple columns

orders |> get_dupes(customer_id, product) #> # A tibble: 4 x 4 #> customer_id product dupe_count qty #> <dbl> <chr> <int> <dbl> #> 1 2 B 2 2 #> 2 2 B 2 2 #> 3 3 C 2 5 #> 4 3 C 2 5

Pass multiple bare column names to define a composite key. This is the most common production use: confirming that (customer_id, order_date) or (user_id, event_id) uniquely identifies a row.

4. Tidyselect helpers

RChoose columns by pattern

people <- tibble( first_name = c("Ann", "Ann", "Bob", "Cara"), last_name = c("Lee", "Lee", "Tan", "Park"), email = c("a@x", "a@y", "b@x", "c@x") ) people |> get_dupes(starts_with("name")) #> Error: Can't subset columns that don't exist. #> Hint: did you mean starts_with("first") or starts_with("last")?

The first-attempt above fails because no columns START with "name". Use a real prefix:

RTidyselect with valid prefix

people |> get_dupes(ends_with("name")) #> # A tibble: 2 x 4 #> first_name last_name dupe_count email #> <chr> <chr> <int> <chr> #> 1 Ann Lee 2 a@x #> 2 Ann Lee 2 a@y

get_dupes() accepts any tidyselect helper: starts_with(), ends_with(), contains(), matches(), where(). Useful when your duplicate-key spans several columns sharing a naming convention.

5. Negate columns to ignore

RAll columns except some

logs <- tibble( user_id = c(1, 1, 2, 2), event = c("click", "click", "view", "view"), timestamp = c(1, 2, 3, 4) ) logs |> get_dupes(-timestamp) #> # A tibble: 4 x 4 #> user_id event dupe_count timestamp #> <dbl> <chr> <int> <dbl> #> 1 1 click 2 1 #> 2 1 click 2 2 #> 3 2 view 2 3 #> 4 2 view 2 4

Negation excludes columns from the duplicate check while keeping them in the output. Common pattern: ignore timestamps or surrogate ids when looking for semantic duplicates (same user, same event, different time).

6. Use it as a test in a pipeline

RAssert a primary key is unique

n_dupes <- nrow(get_dupes(orders, customer_id, product)) if (n_dupes > 0) { message("Found ", n_dupes, " duplicate (customer_id, product) rows") } else { message("Key is unique") } #> Found 4 duplicate (customer_id, product) rows

Because get_dupes() returns a tibble, you can use nrow() to count violations in a script. Adding this check to every ETL step beats hunting doubled metrics later.

Key Insight

get_dupes() returns ALL copies, not just the extras. If a value appears 3 times, you get all 3 rows back, with dupe_count = 3 on each. This is the opposite of duplicated() in base R, which returns FALSE for the first occurrence and TRUE for the rest. The all-copies behavior is what makes get_dupes() an inspection tool rather than a filter.

get_dupes() vs distinct() vs duplicated()

Three tools deal with duplicate rows; pick by what answer you need.

Task	`get_dupes()`	`distinct()`	`duplicated()`
List every duplicate row	yes, with count	no	no
Keep one row per key	no	yes	manual
Flag rows as duplicate	no	no	yes, logical vector
Show the dupe_count	yes	no	no
Returns input class	tibble	yes	logical vector
Best for inspection	yes	no	no
Best for cleaning	no	yes	yes

When to use which:

Use get_dupes() to audit and inspect. The output answers "which rows are duplicates, and how many times?"
Use dplyr::distinct() to deduplicate. The output is the cleaned data with one row per key.
Use base R duplicated() when you need a logical flag to combine with other conditions, for example df[!duplicated(df$id) & df$valid, ].

For a side-by-side walkthrough of the deduplication option, see dplyr distinct() in R.

Note

Coming from Python pandas? The closest equivalent is df[df.duplicated(subset=['key'], keep=False)] followed by a groupby-size to add a count column. Pandas has no single function with get_dupes()'s combined filter-and-count behavior, which is why janitor packs both steps into one call.

Common pitfalls

Pitfall 1: NA values are treated as equal. Two rows with NA in the key column count as duplicates of each other. If you want NAs ignored, drop them first with tidyr::drop_na(df, key) |> get_dupes(key).

Pitfall 2: forgetting that all copies are returned. A duplicate group of 5 rows produces 5 rows in the output, not 4 (extras only). If you want only the extras, use get_dupes() to identify the key then anti-join: df |> anti_join(distinct(df, key, .keep_all = TRUE)).

Warning

Numeric precision can cause silent false negatives. Two floating-point values that look identical when printed (0.1 + 0.2 and 0.3) are NOT equal under the hood, so get_dupes() will miss them. For numeric keys, round to a fixed precision first: df |> mutate(key = round(key, 4)) |> get_dupes(key).

Pitfall 3: empty result is normal. If no duplicates exist, get_dupes() returns a zero-row tibble plus a console message: No duplicate combinations found of: <columns>. Treat the message as a pass, not a warning.

Try it yourself

Try it: Use get_dupes() on the built-in iris data frame to find rows where the same Sepal.Length and Species combination appears more than once. Save the result to ex_dupes and count them.

RYour turn: iris duplicates

# Try it: find duplicates by Sepal.Length and Species ex_dupes <- iris |> get_dupes(# your code here) nrow(ex_dupes) #> Expected: > 100

Click to reveal solution

RSolution

ex_dupes <- iris |> get_dupes(Sepal.Length, Species) nrow(ex_dupes) #> [1] 134

Explanation: Passing the two bare column names defines a composite key. Iris has many repeated Sepal.Length values within each species, so most rows participate in some duplicate group. The 134 returned rows include EVERY copy of each duplicated combination, with dupe_count showing the group size.

After mastering get_dupes(), look at:

clean_names(): standardize messy column names before deduplication so keys actually match
remove_empty(): drop fully empty rows or columns that can otherwise show up as duplicates
tabyl(): cross-tabulate the duplicated keys to see which values cluster
dplyr::distinct(): keep one row per key once you have confirmed the duplicates are real
dplyr::count(): count unique values without filtering, complementary to get_dupes()

For a tour of the wider package, see the janitor package guide. The official reference is sfirke.github.io/janitor.

FAQ

What does janitor get_dupes() do?

get_dupes() returns every row in a data frame whose chosen columns appear more than once, with an added dupe_count column showing the size of each duplicate group. It is an inspection tool: you see all copies, not just the extras, so you can decide whether each duplicate is a data-entry mistake, a legitimate repeat event, or a join that fanned out unexpectedly. The function leaves the input unchanged and returns a new tibble.

How is get_dupes() different from distinct()?

distinct() deduplicates: it keeps one row per unique combination and discards the rest, leaving you with clean data. get_dupes() does the opposite: it surfaces the duplicates so you can examine them. Use get_dupes() to audit and understand the problem, then use distinct() to actually remove duplicates once you have decided which version of each row to keep.

Does get_dupes() treat NA values as duplicates?

Yes. Two rows whose key columns are both NA are treated as a duplicate pair and appear in the output with dupe_count = 2. If you do not want NAs to count, filter them out before calling: df |> tidyr::drop_na(key_column) |> get_dupes(key_column). This matches how SQL GROUP BY typically treats NULLs for the purpose of duplicate detection.

Can I use get_dupes() to test a primary key in a script?

Yes, and it is a common pattern. Wrap the call in nrow(): stopifnot(nrow(get_dupes(df, id)) == 0). The assertion passes silently when the key is unique and throws a hard error the moment a duplicate appears, which is exactly what you want in a data pipeline that assumes one row per id.

Why does my numeric column miss duplicates that look identical?

Floating-point values that print the same can differ in their underlying bits, so get_dupes() treats them as distinct. To avoid this, round to a fixed precision before the duplicate check: df |> mutate(amount = round(amount, 2)) |> get_dupes(amount). For string or integer columns the issue does not apply; comparisons there are exact.

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

janitor get_dupes() in R: Find Duplicate Rows With Counts

What get_dupes() does in one sentence

Syntax

Six common patterns

1. Find duplicates across all columns

2. Duplicates by one column

3. Duplicates by composite key

4. Tidyselect helpers

5. Negate columns to ignore

6. Use it as a test in a pipeline

get_dupes() vs distinct() vs duplicated()

Common pitfalls

Try it yourself

FAQ

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

janitor get_dupes() in R: Find Duplicate Rows With Counts

What get_dupes() does in one sentence

Syntax

Six common patterns

1. Find duplicates across all columns

2. Duplicates by one column

3. Duplicates by composite key

4. Tidyselect helpers

5. Negate columns to ignore

6. Use it as a test in a pipeline

get_dupes() vs distinct() vs duplicated()

Common pitfalls

Try it yourself

Related janitor functions

FAQ

Related Tutorials