dplyr join_by() in R: Modern Key Specification for Joins

The join_by() helper in dplyr 1.1 specifies join keys using formula-style syntax, supporting equality, inequality, between, and rolling joins. It generalizes the older string-based by argument.

⚡ Quick Answer
left_join(x, y, join_by(id))                   # equality (same as by="id")
left_join(x, y, join_by(id, date))             # multi-column equality
left_join(x, y, join_by(x_id == y_id))         # mapped names
left_join(x, y, join_by(date >= start_date))   # inequality
left_join(x, y, join_by(between(d, start, end))) # range match
left_join(x, y, join_by(closest(date >= ts)))   # rolling join

Need explanation? Read on for examples and pitfalls.

📊 Is join_by() the right tool?
STARTequality joins on common name(s)join_by(id) or by = "id"mapped namesjoin_by(x_id == y_id) or by = c("x_id" = "y_id")inequality / interval / range matchesjoin_by() (NEW capability)closest / rolling joinjoin_by(closest(...))old codeby = "id" still workshelper not needed for simple casesby = "id" is fine

What join_by() does in one sentence

join_by(...) returns a join specification object that the join verbs (left_join, inner_join, etc.) use to determine matching rules; it supports ==, <, <=, >, >=, between(), and closest(). Introduced in dplyr 1.1, it adds inequality and rolling joins to the dplyr toolbox.

Syntax

join_by(...). Each argument is an unquoted column name, equality, or inequality expression involving one column from each side.

Run live
Run live, no install needed. Every R block on this page runs in your browser. Click Run, edit the code, re-run instantly. No setup.
REquality and inequality together
library(dplyr) library(tibble) events <- data.frame(event = c("a","b","c"), time = c(5, 15, 25)) windows <- data.frame(window = c("morning","mid","late"), start = c(0, 10, 20), end = c(10, 20, 30)) left_join(events, windows, join_by(time >= start, time < end)) #> event time window start end #> 1 a 5 morning 0 10 #> 2 b 15 mid 10 20 #> 3 c 25 late 20 30

  
Tip
For pure equality joins, by = "id" is shorter and equally clear. Reach for join_by() when you need inequality, range, or rolling joins (capabilities the old by could not express).

Five common patterns

1. Equality (same as by)

Rjoin_by equivalent of by = 'id'
left_join(customers, orders, join_by(id))

  

2. Mapped column names

Rjoin_by with column mapping
left_join(customers, orders, join_by(id == user_id))

  

Same as by = c("id" = "user_id").

3. Inequality (event in window)

REvent time within a window
events <- data.frame(event = c("a","b"), time = c(5, 15)) windows <- data.frame(start = c(0, 10), end = c(10, 20), window = c("AM","PM")) left_join(events, windows, join_by(time >= start, time < end))

  

Each event matches the window whose start <= time < end. NOT possible with the old by.

4. Range match with between

Rtime falls between [start, end]
left_join(events, windows, join_by(between(time, start, end)))

  

between(x, low, high) is sugar for x >= low & x <= high.

5. Rolling / closest join

RMost recent quote at-or-before each trade
trades <- data.frame(trade_id = 1:3, ts = as.POSIXct(c("2024-01-01 10:00:01", "2024-01-01 10:00:05", "2024-01-01 10:00:09"))) quotes <- data.frame(quote_id = 1:4, ts = as.POSIXct(c("2024-01-01 10:00:00", "2024-01-01 10:00:03", "2024-01-01 10:00:06", "2024-01-01 10:00:10")), price = c(100, 101, 102, 103)) left_join(trades, quotes, join_by(closest(ts >= ts))) #> trade_id ts.x quote_id ts.y price #> 1 1 t1 1 q1 100 #> 2 2 t2 2 q2 101 #> 3 3 t3 3 q3 102

  

closest(ts >= ts) matches each trade with the latest quote whose ts <= the trade's ts. The classic time-series "as-of" join.

Key Insight
join_by() adds INEQUALITY and ROLLING joins to dplyr. Pre-1.1, these required workarounds (cross_join + filter, fuzzyjoin package). Now they are first-class. Inequality joins are the killer feature: "match each event to the active window" in one line.

join_by() vs by argument vs fuzzyjoin

Three ways to specify join keys in dplyr.

Approach Equality Inequality Rolling dplyr version
by = "id" Yes No No All
by = c("x" = "y") Yes (mapped) No No All
join_by(id) Yes Yes Yes 1.1+
fuzzyjoin::*_join Yes Yes Yes (separate package)

When to use which:

  • by = "id" for simple equality joins.
  • join_by(...) for inequality, range, or rolling joins.
  • fuzzyjoin package for fuzzy / regex / interval joins (predates 1.1).

A practical workflow

The "as-of" join is the killer use case for join_by(closest(...)).

RAs-of trade-quote join
trades |> left_join(quotes, join_by(closest(ts >= ts), symbol))

  

For each trade, get the latest quote at-or-before its timestamp, matched on symbol. Standard finance / time-series pattern.

For range joins:

REvent matched to its price period
events |> left_join(price_periods, join_by(between(event_date, start_date, end_date)))

  

Each event matches the price period that contains its date.

Common pitfalls

Pitfall 1: column name ambiguity in inequality. join_by(time >= start) works because time is in events and start is in windows. If both tables had time, you'd need to disambiguate.

Pitfall 2: many-to-many warnings. Inequality joins often produce many-to-many matches. Pass relationship = "many-to-many" to suppress, or "one-to-one" to error if duplicates.

Warning
join_by() requires dplyr 1.1.0 or later (released 2023). Older dplyr installations only support the string by argument. Check packageVersion("dplyr") if confused.

Why join_by matters for time-series work

Before join_by, "match each event to the most recent quote at-or-before its time" required either a self-join with filter (slow on big data) or the fuzzyjoin / data.table packages (extra dependency). join_by(closest(ts >= ts)) makes this a one-line dplyr call. Same applies to "match each transaction to its active price period" using between(). These two patterns (rolling and range) cover most time-series and event-window joins. Range joins also appear in interval-based analytics (matching events to bins, calls to billing periods). Once you internalize join_by, dplyr replaces a lot of clunky multi-step workarounds.

Try it yourself

Try it: Match each mtcars car to its mpg "tier" using interval comparison. Tier = "low" if mpg < 18, "mid" if 18 <= mpg < 25, "high" otherwise. Save to ex_tier.

RYour turn: bin mpg into tiers via join_by
tiers <- data.frame( tier = c("low","mid","high"), lo = c(-Inf, 18, 25), hi = c(18, 25, Inf) ) mtcars_n <- mtcars |> tibble::rownames_to_column("car") ex_tier <- mtcars_n |> # your code here head(ex_tier[, c("car","mpg","tier")]) #> Expected: each car gets its tier label

  
Click to reveal solution
RSolution
ex_tier <- mtcars_n |> left_join(tiers, join_by(mpg >= lo, mpg < hi)) head(ex_tier[, c("car","mpg","tier")]) #> car mpg tier #> 1 Mazda RX4 21.0 mid #> 2 Mazda RX4 Wag 21.0 mid #> 3 Datsun 710 22.8 mid #> 4 Hornet 4 Drive 21.4 mid #> 5 Hornet Sportabout 18.7 mid #> 6 Valiant 18.1 mid

  

Explanation: join_by(mpg >= lo, mpg < hi) matches each car to the tier whose interval contains its mpg.

After mastering join_by, look at:

  • left_join() / inner_join() etc.: use join_by inside
  • between(): range comparison helper
  • closest(): rolling-join helper (within join_by)
  • data.table::merge(roll = TRUE): rolling joins in data.table
  • fuzzyjoin package: fuzzy / interval / regex joins

For very large rolling joins, data.table's roll = TRUE is faster than dplyr 1.1's join_by(closest(...)).

FAQ

What does join_by do in dplyr?

join_by(...) returns a join specification used by left_join / inner_join / etc. Supports equality, inequality, between, and closest expressions.

What is the difference between join_by and the by argument?

by = "id" is the older string-based form, supports only equality. join_by(id) is the new (1.1+) form, supports equality plus inequality, range, and rolling joins.

How do I do an inequality join in dplyr?

Use join_by: left_join(events, windows, join_by(time >= start, time < end)). Each event matches windows whose start <= time < end.

What is a rolling join in dplyr?

A rolling (or "as-of") join matches each row to the closest record in the other table by time or numeric distance. Express with join_by(closest(ts >= ts)).

Do I need dplyr 1.1 for join_by?

Yes. join_by was introduced in dplyr 1.1.0 (Jan 2023). Earlier versions only support the string by argument.