dplyr join_by() in R: Modern Key Specification for Joins
The join_by() helper in dplyr 1.1 specifies join keys using formula-style syntax, supporting equality, inequality, between, and rolling joins. It generalizes the older string-based by argument.
left_join(x, y, join_by(id)) # equality (same as by="id") left_join(x, y, join_by(id, date)) # multi-column equality left_join(x, y, join_by(x_id == y_id)) # mapped names left_join(x, y, join_by(date >= start_date)) # inequality left_join(x, y, join_by(between(d, start, end))) # range match left_join(x, y, join_by(closest(date >= ts))) # rolling join
Need explanation? Read on for examples and pitfalls.
What join_by() does in one sentence
join_by(...) returns a join specification object that the join verbs (left_join, inner_join, etc.) use to determine matching rules; it supports ==, <, <=, >, >=, between(), and closest(). Introduced in dplyr 1.1, it adds inequality and rolling joins to the dplyr toolbox.
Syntax
join_by(...). Each argument is an unquoted column name, equality, or inequality expression involving one column from each side.
by = "id" is shorter and equally clear. Reach for join_by() when you need inequality, range, or rolling joins (capabilities the old by could not express).Five common patterns
1. Equality (same as by)
2. Mapped column names
Same as by = c("id" = "user_id").
3. Inequality (event in window)
Each event matches the window whose start <= time < end. NOT possible with the old by.
4. Range match with between
between(x, low, high) is sugar for x >= low & x <= high.
5. Rolling / closest join
closest(ts >= ts) matches each trade with the latest quote whose ts <= the trade's ts. The classic time-series "as-of" join.
join_by() adds INEQUALITY and ROLLING joins to dplyr. Pre-1.1, these required workarounds (cross_join + filter, fuzzyjoin package). Now they are first-class. Inequality joins are the killer feature: "match each event to the active window" in one line.join_by() vs by argument vs fuzzyjoin
Three ways to specify join keys in dplyr.
| Approach | Equality | Inequality | Rolling | dplyr version |
|---|---|---|---|---|
by = "id" |
Yes | No | No | All |
by = c("x" = "y") |
Yes (mapped) | No | No | All |
join_by(id) |
Yes | Yes | Yes | 1.1+ |
fuzzyjoin::*_join |
Yes | Yes | Yes | (separate package) |
When to use which:
by = "id"for simple equality joins.join_by(...)for inequality, range, or rolling joins.fuzzyjoinpackage for fuzzy / regex / interval joins (predates 1.1).
A practical workflow
The "as-of" join is the killer use case for join_by(closest(...)).
For each trade, get the latest quote at-or-before its timestamp, matched on symbol. Standard finance / time-series pattern.
For range joins:
Each event matches the price period that contains its date.
Common pitfalls
Pitfall 1: column name ambiguity in inequality. join_by(time >= start) works because time is in events and start is in windows. If both tables had time, you'd need to disambiguate.
Pitfall 2: many-to-many warnings. Inequality joins often produce many-to-many matches. Pass relationship = "many-to-many" to suppress, or "one-to-one" to error if duplicates.
join_by() requires dplyr 1.1.0 or later (released 2023). Older dplyr installations only support the string by argument. Check packageVersion("dplyr") if confused.Why join_by matters for time-series work
Before join_by, "match each event to the most recent quote at-or-before its time" required either a self-join with filter (slow on big data) or the fuzzyjoin / data.table packages (extra dependency). join_by(closest(ts >= ts)) makes this a one-line dplyr call. Same applies to "match each transaction to its active price period" using between(). These two patterns (rolling and range) cover most time-series and event-window joins. Range joins also appear in interval-based analytics (matching events to bins, calls to billing periods). Once you internalize join_by, dplyr replaces a lot of clunky multi-step workarounds.
Try it yourself
Try it: Match each mtcars car to its mpg "tier" using interval comparison. Tier = "low" if mpg < 18, "mid" if 18 <= mpg < 25, "high" otherwise. Save to ex_tier.
Click to reveal solution
Explanation: join_by(mpg >= lo, mpg < hi) matches each car to the tier whose interval contains its mpg.
Related dplyr / tidyr functions
After mastering join_by, look at:
left_join()/inner_join()etc.: use join_by insidebetween(): range comparison helperclosest(): rolling-join helper (within join_by)data.table::merge(roll = TRUE): rolling joins in data.tablefuzzyjoinpackage: fuzzy / interval / regex joins
For very large rolling joins, data.table's roll = TRUE is faster than dplyr 1.1's join_by(closest(...)).
FAQ
What does join_by do in dplyr?
join_by(...) returns a join specification used by left_join / inner_join / etc. Supports equality, inequality, between, and closest expressions.
What is the difference between join_by and the by argument?
by = "id" is the older string-based form, supports only equality. join_by(id) is the new (1.1+) form, supports equality plus inequality, range, and rolling joins.
How do I do an inequality join in dplyr?
Use join_by: left_join(events, windows, join_by(time >= start, time < end)). Each event matches windows whose start <= time < end.
What is a rolling join in dplyr?
A rolling (or "as-of") join matches each row to the closest record in the other table by time or numeric distance. Express with join_by(closest(ts >= ts)).
Do I need dplyr 1.1 for join_by?
Yes. join_by was introduced in dplyr 1.1.0 (Jan 2023). Earlier versions only support the string by argument.