data.table dcast() in R: Reshape Long Data to Wide Format
The dcast() function from data.table reshapes a long data.table into wide format by spreading one column's values across new columns named after the levels of another column, with optional aggregation when duplicate id-by-variable combinations exist.
dcast(dt, id ~ variable) # default reshape dcast(dt, id ~ variable, value.var = "value") # specify value column dcast(dt, id ~ variable, fun.aggregate = sum) # aggregate duplicates dcast(dt, id ~ variable, fill = 0) # fill missing combos dcast(dt, id ~ variable, value.var = c("v1", "v2")) # multiple value cols dcast(dt, id + grp ~ variable) # multiple LHS ids dcast(dt, id ~ year + qtr, value.var = "x") # cross multiple RHS dcast(dt, id ~ ., value.var = "x", fun.aggregate = mean) # collapse to one col
Need explanation? Read on for examples and pitfalls.
What dcast() does in one sentence
dcast() spreads a long table into a wide one using a formula. You hand it a long data.table and a formula of the form id_cols ~ category_col, and it returns a wide data.table where each unique value of the category column becomes its own column, populated by the values from value.var.
The reason data.table dcast exists alongside base reshape() is speed and ergonomics. The implementation is written in C and pre-allocates the output, and it accepts an aggregation function inline so a pivot and a summarise step run in a single call.
Syntax
dcast() takes a long data table, a formula, and optional value, aggregator, and fill arguments. Only the data and formula are required; everything else has sensible defaults.
The full signature is dcast(data, formula, fun.aggregate = NULL, ..., margins = NULL, subset = NULL, fill = NULL, drop = TRUE, value.var = guess(data), verbose = getOption("datatable.verbose")). Its arguments are:
data: the longdata.tableto reshape.formula: an expression likeid_cols ~ category_cols. Left side becomes row keys; right side names the column spread into new columns. Use+to combine columns on either side.value.var: the column whose values fill the new wide cells. Pass a character vector to spread several at once.fun.aggregate: a function that collapses duplicate id-by-category rows into one value. Defaults tolengthwith a warning if omitted when duplicates exist.fill: value written into missing id-by-category cells. Defaults toNA.drop: whenTRUE, unused factor levels in the formula are dropped; setFALSEto keep empty columns.
value.var explicitly when more than one non-id column exists. dcast() will guess, and the guess is usually right, but spelling it out makes the call self-documenting and removes the friendly "Aggregate function missing, defaulting to 'length'" warning that hides real aggregation bugs.Reshaping long data: four common patterns
1. Basic pivot with one id and one category
Specifying id ~ category is the minimal pivot. The unique values in the category column become the new column names; the value column fills the cells.
Six long rows collapse into three wide rows. Every unique child value becomes a column; missing combinations show up as NA.
2. Aggregate duplicates with fun.aggregate
When an id-by-category combination appears more than once, dcast() needs an aggregator. Pass any single-value-returning function to fun.aggregate: sum, mean, median, length, or a user-defined function.
Store S1 has two January rows that sum to 150; store S2 has two February rows that sum to 180. Without fun.aggregate, dcast() warns and uses length, which returns counts rather than totals.
3. Spread multiple value columns in one pass
value.var accepts a character vector to spread several measures at once. The result has one set of wide columns per measure, with names prefixed by the measure name.
Both name and gender spread by child, producing four wide columns. This is the inverse of a melt() call with patterns(), in one pre-allocated pass.
4. Fill missing combinations with fill = value
By default, absent id-by-category combinations become NA. Pass fill to substitute a different value, which is handy for counts and incidence tables where zero is the correct answer.
Warehouse B has no gadget row, so without fill = 0 that cell would be NA. For counts, presence-absence matrices, or contingency tables, fill = 0 is usually the right answer.
dcast() formula with fun.aggregate is a pivot table in one line. It collapses duplicates, spreads categories into columns, and fills gaps in a single C pass, where a tidyverse equivalent needs group_by() plus summarise() plus pivot_wider().dcast() vs base reshape() and tidyr pivot_wider()
All three reshape long to wide but differ in speed, syntax, and feature coverage. dcast() is fastest and the only one with built-in aggregation; pivot_wider() reads cleanest; base reshape() needs no extra package.
| Feature | data.table dcast() |
base reshape() |
tidyr pivot_wider() |
|---|---|---|---|
| Speed on large data | Fastest (written in C) | Slow | Moderate |
| Inline aggregation | fun.aggregate = sum |
Not supported | values_fn = sum |
| Multiple value columns | value.var = c("a", "b") |
Awkward | values_from = c(a, b) |
| Fill missing combinations | fill = 0 |
Not supported | values_fill = 0 |
| Return type | data.table |
data.frame |
tibble |
| Available without extras | needs data.table |
base R | needs tidyr |
Pick dcast() when the input is already a data.table, when aggregation belongs in the same step, or when the data is large. Reach for pivot_wider() in a tidyverse pipeline, and use base reshape() only when extra packages are unavailable.
dcast(dt, id ~ var, value.var = "x") is df.pivot(index = "id", columns = "var", values = "x"). To aggregate duplicates, use df.pivot_table(..., aggfunc = "sum").Common pitfalls
Pitfall 1: forgetting fun.aggregate when duplicates exist. If the same id-by-category combination appears more than once and no aggregator is given, dcast() defaults to length and prints a warning. The result is a count, not the value you expected.
The a cell for id 1 shows 2 because two rows match, not 10 or 20. Pass fun.aggregate = sum (or mean, first) for the expected behaviour.
Pitfall 2: value.var is guessed and the guess can be wrong. When several non-id columns exist, dcast() picks the first non-formula column and warns. If you intended a different column, the wide table is silently filled with the wrong values.
Pitfall 3: factor levels in formula columns are dropped by default. drop = TRUE removes columns for unused factor levels. For a full grid (e.g. a monthly report covering months with no data), pass drop = FALSE.
fun.aggregate. Treat the warning as an error during data validation.Try it yourself
Try it: Pivot the long table scores_long below into wide format. Use student as the row id, spread by subject, and fill empty cells with zero. Save the result to ex_wide.
Click to reveal solution
Explanation: student ~ subject lays out one row per student and one column per subject; value.var = "score" tells dcast() which column to spread; fill = 0 writes zero into the cells where a student did not take a subject, so the result is a clean three-by-four wide table.
Related data.table functions
These functions pair naturally with dcast() when reshaping, aggregating, or combining tables:
melt(): the inverse ofdcast(); collapses many columns into one variable and one value column.merge(): joins two tables on a shared key, often used to attach lookups before or after a cast.rbindlist(): stacks a list of tables by row, useful when several long tables need to be cast together.setnames(): renames columns in place, handy for cleaning the auto-generated names after a multi-measure cast.setkey(): sets a sort key on adata.table, which speeds up the merge that often follows a cast.
FAQ
What is the difference between dcast and melt in data.table?
dcast() reshapes long to wide by spreading one column's values across new columns; melt() does the reverse, stacking many measure columns into a variable and value pair. They are inverses: dcast(melt(dt, id.vars = "id"), id ~ variable) returns the original wide table. Use dcast() to get one row per observation with one column per measure.
How do you aggregate while reshaping with data.table dcast?
Pass a function to fun.aggregate. For example, dcast(sales, store ~ month, value.var = "revenue", fun.aggregate = sum) sums duplicate store-by-month rows during the pivot. Any single-value-returning function works: mean, median, min, max, or a user-defined wrapper like function(x) sum(x, na.rm = TRUE). This collapses group-by and pivot into one call.
How do you spread multiple value columns at once with dcast?
Pass a character vector to value.var. For example, dcast(records, family ~ child, value.var = c("name", "gender")) returns one set of wide columns per measure, with the measure name prefixed to the category value (e.g. name_c1, gender_c1). It runs in a single pre-allocated pass.
Is data.table dcast faster than tidyr pivot_wider?
Yes, on large data. dcast() is written in C and pre-allocates the result, so on millions of rows it typically runs several times faster than pivot_wider(). On small tables the difference is negligible. Use dcast() when speed matters or the input is already a data.table; reach for pivot_wider() in a tidyverse pipeline.
For the official argument reference, see the data.table melt and dcast vignette.