data.table merge() in R: Join Two Tables by Key
The data.table merge() function joins two data.tables on one or more shared key columns, supporting inner, left, right and full outer joins through the all.x, all.y and all arguments.
merge(x, y, by = "id") # inner join on id merge(x, y, by = "id", all.x = TRUE) # left join merge(x, y, by = "id", all.y = TRUE) # right join merge(x, y, by = "id", all = TRUE) # full outer join merge(x, y, by.x = "id", by.y = "code") # keys named differently merge(x, y) # join on shared key columns x[y, on = "id"] # data.table-native join
Need explanation? Read on for examples and pitfalls.
What merge() does in one sentence
merge() combines two data.tables into one by matching rows on shared key columns. You pass two data.tables, name the column or columns they have in common, and get back a single data.table where each row pairs a left-side record with its matching right-side record.
By default merge() keeps only rows that match in both tables, which is an inner join. The all.x, all.y and all arguments widen that to left, right and full outer joins. The method is merge.data.table, a data.table-aware version of base R's merge(), so it accepts data.table inputs and returns a data.table.
Syntax
merge() for data.tables mirrors the base R signature with join-type switches. The full call exposes every option you need for column matching and join type.
The arguments you reach for most often are:
x,y: the two data.tables to join.xis the left table,yis the right.by: a character vector of column names present in both tables. If omitted,merge()uses the shared key columns, or the shared column names if neither table is keyed.by.x,by.y: use these when the join columns have different names in each table.all,all.x,all.y: control the join type. All default toFALSE, giving an inner join.sort: ifTRUE(default), the result is sorted by the join columns.suffixes: the tags appended to non-join columns that share a name.
Examples by use case
Start with two small data.tables that share a dept_id column. One holds employees, the other holds department names.
An inner join keeps only rows that match in both tables. Department 30 has no name and department 40 has no employee, so both drop out.
A left join keeps every row of x and fills unmatched right columns with NA. Set all.x = TRUE to keep Dan even though department 30 is missing from dept.
Right and full outer joins keep unmatched rows from the other side. all.y = TRUE keeps every department; all = TRUE keeps every row from both tables.
Use by.x and by.y when the key columns have different names. Here the lookup table calls the key code instead of dept_id.
merge(x, y, by = "id", all.x = TRUE) is x.merge(y, on="id", how="left"). The all.x / all.y / all switches map to pandas how="left" / "right" / "outer".merge() vs the x[y] join
data.table offers a second join style with bracket syntax, x[y, on = ...]. It does the same matching work but reads as a subset and returns columns in x-first order.
The two approaches differ in defaults and intent:
| Aspect | merge(x, y, by =) |
x[y, on =] |
|---|---|---|
| Default join | Inner | Right (every row of y) |
| Result sorted | Yes, by key | No, follows y order |
| Update by reference | No | Yes, with := |
| Reads like | A SQL join | A keyed subset |
The decision rule is short. Use merge() when you want a SQL-style join with explicit all.x / all.y control, or when porting code from base R or dplyr. Use x[y, on = ] when you want speed inside a data.table pipeline or need to update columns in place during the join.
merge() exists so base R and dplyr users have a drop-in path, while x[y, on = ] unlocks data.table-only features like := updates and rolling joins.Common pitfalls
A many-to-many match silently multiplies rows. When a key value repeats on both sides, merge() returns every pairing, so two left rows and two right rows become four.
anyDuplicated(dt, by = "k") on each table. If a key is meant to be unique but is not, the join inflates row counts and quietly corrupts every downstream aggregate. A duplicated key on both sides is the most common cause of a result that is suddenly far larger than expected.Shared non-key columns get .x and .y suffixes. If both tables carry a column with the same name that is not a join key, merge() keeps both and renames them rather than dropping one.
The result is sorted by the key, not left in input order. merge() sorts by the join columns by default. Pass sort = FALSE if you need to preserve the original row order of x.
setkey(emp, dept_id) and setkey(dept, dept_id) lets merge() skip the sort step and run a pure binary join. For a one-off merge it makes little difference, but inside a loop the saved sorts add up fast.Try it yourself
Try it: Join emp to dept so that every employee is kept even when the department is unknown. Save the result to ex_join.
Click to reveal solution
Explanation: all.x = TRUE makes a left join, so all four employees survive. Dan's department (30) is absent from dept, so his dept value is filled with NA instead of dropping the row.
Related data.table functions
merge() works alongside the rest of data.table's combining and joining toolkit. Explore these next:
rbindlist(): stack many data.tables that share columns into one table.setkey(): sort and tag key columns so joins skip the sort step.foverlaps(): join two tables on overlapping numeric or date ranges.setDT(): convert a data.frame or list to a data.table by reference before joining.[.data.table: thex[y, on = ]bracket join, the native alternative tomerge().
See the official merge.data.table reference for the complete argument list.
FAQ
How do you merge two data.tables in R?
Load the data.table package, then call merge(x, y, by = "key") with the name of the column the two tables share. By default this is an inner join, returning only rows whose key value appears in both tables. The result is a new data.table sorted by the key column. Add all.x, all.y or all to switch to a left, right or full outer join.
What is the difference between merge() and the x[y] join in data.table?
Both run the same fast binary join. merge() defaults to an inner join, sorts the result by the key, and reads like a SQL join, which suits code ported from base R or dplyr. The x[y, on = ] bracket form defaults to a right join, keeps y's row order, and can update columns in place with :=. Use the bracket form inside a data.table pipeline.
How do I do a left join in data.table?
Pass all.x = TRUE to merge(): merge(x, y, by = "id", all.x = TRUE). Every row of x is kept, and columns from y that have no match are filled with NA. This matches a SQL LEFT JOIN. The bracket equivalent is y[x, on = "id"], which keeps every row of the table written inside the brackets.
Does data.table merge() need keys to be set?
No. merge() works on unkeyed data.tables; you just name the join columns with by. Setting a key with setkey() is an optimization, not a requirement. It lets merge() skip an internal sort, which speeds up repeated joins on large tables. For a single join on modest data the difference is negligible.
Why does merge() create .x and .y columns?
When both tables contain a column with the same name that is not a join key, merge() cannot keep one name for two columns. It appends the suffixes tags, .x for the left table and .y for the right, so both columns survive. Rename or drop the unwanted one afterward, or change the defaults with the suffixes argument.