dplyr cross_join() in R: Cartesian Product of Two Tables
The cross_join() function in dplyr returns the Cartesian product of two data frames, pairing every row of x with every row of y. Result rows = nrow(x) * nrow(y).
cross_join(x, y) # all combinations cross_join(x, y, suffix = c(".x",".y")) # disambiguate column names tidyr::expand_grid(...) # similar; lighter input dplyr::full_join(x, y, by = character()) # equivalent in older dplyr nrow(cross_join(x, y)) # = nrow(x) * nrow(y)
Need explanation? Read on for examples and pitfalls.
What cross_join() does in one sentence
cross_join(x, y) returns a data frame containing every combination of rows from x and y; the result has nrow(x) * nrow(y) rows. No key is used; no rows are dropped.
This is the SQL CROSS JOIN. Useful for generating all-pair combinations, full grids, or pairwise comparison sets.
Syntax
cross_join(x, y, suffix = c(".x", ".y")). No by argument; every row pairs with every row.
cross_join to generate all combinations for grid expansion or pairwise comparison. For simple vector combinations, tidyr::expand_grid() is more direct.Five common patterns
1. Generate all combinations
3 * 2 = 6 rows.
2. Pairwise distance setup
cross_join then filter for non-redundant pairs is the standard pattern for "all unique pairs".
3. Combine with computation
4. Suffix to disambiguate
5. Older dplyr equivalent
cross_join (added in dplyr 1.1) is the modern, explicit form.
tidyr::expand_grid() (memory-efficient for vector inputs) or generating combinations on the fly.cross_join() vs expand_grid() vs full_join() vs combn()
Four ways to generate combinations in R.
| Function | Input | Output | Best for |
|---|---|---|---|
cross_join(x, y) |
Two data frames | All-pair df | Two existing tables |
tidyr::expand_grid() |
Vectors / lists | All-pair df | Variadic vector input |
full_join(by = character()) |
Two data frames | All-pair df | Pre-1.1 dplyr |
base::combn(x, m) |
Vector | Combinations | Choose m of n |
When to use which:
cross_joinfor two existing data frames.expand_gridwhen starting from vectors.combnfor "choose m" combinations.
A practical workflow
Use cross_join for "what-if" grid analysis.
3 products 4 volumes 3 discounts = 36 scenarios. Compute total_cost for each. Useful for sensitivity analysis.
Common pitfalls
Pitfall 1: row explosion. cross_join of two 10k tables = 100M rows. Always check sizes first.
Pitfall 2: column name conflicts. Both tables having a column named id produces id.x and id.y in the result; rename or pass suffix to customize.
cross_join has NO by argument by design. It is for "no key, all combinations" semantics. If you need a key-based join, use left_join / inner_join instead.When to use cross_join vs filter pattern
The cross_join + filter pattern handles "all valid pairs" computations elegantly but at high memory cost. For pairwise distance, comparison, or compatibility checks, the natural expression is cross_join followed by filter(valid). The cost: the intermediate cartesian table is huge. For n = 1,000 cities, cross_join produces 1M rows before filter reduces it. This is fine for n in the hundreds; problematic for n in the tens of thousands. Alternatives include nested loops with early termination, or the proxy and vegan packages for specialized distance computation. As a rule: if the result of cross_join would exceed a few million rows, design a smarter algorithm.
Try it yourself
Try it: Generate all combinations of 3 cylinder counts and 2 transmission types. Save to ex_grid.
Click to reveal solution
Explanation: Every cyl pairs with every am: 3 * 2 = 6 rows.
Related dplyr / tidyr functions
After mastering cross_join, look at:
tidyr::expand_grid(): vector-input alternativetidyr::expand(): complete grid from existing datatidyr::complete(): fill missing combinationsfull_join(): match by key, all unmatchedcombn()/expand.grid(): base R alternativescrossing(): tidyr; same as expand_grid
For "fill missing combinations in existing data", tidyr::complete() is the right tool.
FAQ
What does cross_join do in dplyr?
cross_join(x, y) returns the Cartesian product: every row of x paired with every row of y. The result has nrow(x) * nrow(y) rows.
What is the difference between cross_join and expand_grid?
cross_join takes two data frames. expand_grid takes vectors or lists (more flexible) and returns a tibble of all combinations.
How do I do a cross join in older dplyr?
full_join(x, y, by = character()) worked before cross_join was added. cross_join was introduced in dplyr 1.1.
Why is my cross_join so slow / large?
Because the result grows as nrow(x) * nrow(y). Two 1,000-row tables produce 1,000,000 rows. Always check sizes before running.
Can cross_join take more than 2 tables?
Not directly, but you can chain: df1 |> cross_join(df2) |> cross_join(df3). Be careful with row count explosion.