data.table Exercises in R: 50 Real Practice Problems
Fifty practice problems on data.table: the i/j/by syntax, in-place modification with :=, joins, special symbols, and performance idioms. Real scenarios with hidden solutions.
Section 1. i, j, by basics (8 problems)
Exercise 1.1: Filter rows in i
Scenario: Return rows where mpg > 25.
Difficulty: Beginner
Click to reveal solution
Explanation: First argument inside [ is i (row filter). Column names work directly without quotes or $.
Exercise 1.2: Select columns in j
Scenario: Return car name and mpg.
Difficulty: Beginner
Click to reveal solution
Explanation: .() is shorthand for list(). Returns a data.table. To get a vector, use mt[, mpg]. To select with a string vector use mt[, c("car","mpg"), with = FALSE] or mt[, .SD, .SDcols = c("car","mpg")].
Exercise 1.3: Filter and select together
Scenario: Return car name and mpg for 4-cyl cars.
Difficulty: Beginner
Click to reveal solution
Explanation: i and j combine: filter then select.
Exercise 1.4: Group with by
Scenario: Mean mpg per cyl group.
Difficulty: Beginner
Click to reveal solution
Explanation: Third argument is by. Result has one row per group.
Exercise 1.5: Multiple stats per group
Scenario: Per cyl: count, mean mpg, sd mpg.
Difficulty: Intermediate
Click to reveal solution
Explanation: .N is the per-group row count. Multiple expressions in .() give multiple output columns.
Exercise 1.6: Group by multiple keys
Scenario: Mean mpg per (cyl, gear).
Difficulty: Intermediate
Click to reveal solution
Explanation: by = .(...) accepts multiple group columns. Same syntax as j.
Exercise 1.7: Order by multiple columns
Scenario: Sort cars by cyl ascending, mpg descending.
Difficulty: Intermediate
Click to reveal solution
Explanation: order() inside i sorts. Negate for descending. data.table's order is fast (uses radix sort).
Exercise 1.8: Chained operations
Scenario: Filter cyl == 6, group by gear, return mean mpg, sort descending.
Difficulty: Intermediate
Click to reveal solution
Explanation: Chaining: append another [...] for additional steps. Avoids intermediate assignments.
Section 2. Modify in place with := (8 problems)
Exercise 2.1: Add a new column
Scenario: Add kpl column = mpg * 0.425.
Difficulty: Beginner
Click to reveal solution
Explanation: := modifies in place, no copy. The dt is changed even though we didn't reassign. Use copy() to avoid side effects on the original.
Exercise 2.2: Update a subset of rows
Scenario: Set mpg = 100 for cyl == 4 rows.
Difficulty: Intermediate
Click to reveal solution
Explanation: i + := updates only matching rows.
Exercise 2.3: Add multiple columns at once
Scenario: Add columns disp_l (disp / 61.024) and weight_kg (wt * 453.592).
Difficulty: Intermediate
Click to reveal solution
Explanation: Backtick syntax :=() lets you assign multiple columns in one call. Equivalent: dt[, c("disp_l","weight_kg") := list(disp / 61.024, wt * 453.592)].
Exercise 2.4: Conditional update with case-like logic
Scenario: Add tier column: "small" if cyl=4, "mid" if cyl=6, "large" if cyl=8.
Difficulty: Intermediate
Click to reveal solution
Explanation: fcase() is data.table's case_when equivalent. Faster than nested ifelse.
Exercise 2.5: Delete a column
Scenario: Remove the vs column.
Difficulty: Beginner
Click to reveal solution
Explanation: Assigning NULL via := deletes the column in place.
Exercise 2.6: Per-group transform
Scenario: Add mpg_z = z-score of mpg WITHIN each cyl group.
Difficulty: Advanced
Click to reveal solution
Explanation: := combined with by computes per-group AND fills back into the original rows. Equivalent to dplyr's group_by + mutate.
Exercise 2.7: Replace NA with column mean
Scenario: A copy of airquality. Replace NA in Ozone with the column mean.
Difficulty: Intermediate
Click to reveal solution
Explanation: Filter rows in i, assign in j. Use aq$Ozone inside mean to compute on the FULL column (not just the filtered subset).
Exercise 2.8: Update by reference (no copy)
Scenario: Demonstrate that := does NOT copy by checking address before/after.
Difficulty: Advanced
Click to reveal solution
Explanation: data.table's killer feature: := updates in place, avoiding the copy that df$col <- ... causes. Critical for working with large tables.
Section 3. Joins (8 problems)
Exercise 3.1: Inner join with X[Y, on=]
Scenario: Two data.tables: products and orders. Join on product_id.
Difficulty: Beginner
Click to reveal solution
Explanation: X[Y, on=] syntax. nomatch = 0 makes it inner. Default (nomatch = NA) is right-outer when looking from Y's side.
Exercise 3.2: Left join
Scenario: All orders, augmented with product info if available.
Difficulty: Intermediate
Click to reveal solution
Explanation: products[orders, on=...] keeps all rows of orders (the right side becomes the driver). Behaves like left-join orders -> products.
Exercise 3.3: Anti-join
Scenario: Find orders with no matching product.
Difficulty: Intermediate
Click to reveal solution
Explanation: ! before the right side performs anti-join. Returns rows of x not in y.
Exercise 3.4: Multi-column join key
Scenario: Join on (region, product).
Difficulty: Intermediate
Click to reveal solution
Explanation: Vector in on= for multiple keys. Both data.tables must have those column names.
Exercise 3.5: Update by join
Scenario: Add price column to sales by lookup, in place.
Difficulty: Advanced
Click to reveal solution
Explanation: Update by reference using a join. i.price refers to the price column from the joined table (the i side). ASIA gets NA because no match. Powerful for "lookup-and-augment" with no copy.
Exercise 3.6: Rolling join
Scenario: Match each transaction time to the most-recent price snapshot at or before it.
Difficulty: Advanced
Click to reveal solution
Explanation: roll = TRUE is a "rolling forward" join: when no exact match, use the most-recent prior row. roll = -Inf rolls backward; roll = N rolls within N units only.
Exercise 3.7: Non-equi join
Scenario: Match each transaction amount to a tier whose min_amount <= amount.
Difficulty: Advanced
Click to reveal solution
Explanation: on = .(col1 <= col2) does an inequality join. mult = "last" picks the highest matching threshold; default (all) returns all matches.
Exercise 3.8: Cross-join (cartesian)
Scenario: Build all combinations of region (3 values) and product (4 values).
Difficulty: Intermediate
Click to reveal solution
Explanation: CJ() (cross-join) returns all combinations as a data.table. Useful for completing missing combos before a left-join.
Section 4. Special symbols and functions (8 problems)
Exercise 4.1: .N, group size
Scenario: Return the count of cars per cyl using .N.
Difficulty: Beginner
Click to reveal solution
Explanation: .N is the row count. Inside by-context: per-group count. Without by: total rows.
Exercise 4.2: .SD, subset of data
Scenario: Compute mean of all numeric columns per cyl group.
Difficulty: Intermediate
Click to reveal solution
Explanation: .SD ("Subset of Data") is each group's data frame in the j-context. lapply applies a function across columns. .SDcols restricts to selected columns.
Exercise 4.3: .SDcols with patterns
Scenario: Compute mean of every column starting with "Sepal" per Species in iris.
Difficulty: Intermediate
Click to reveal solution
Explanation: patterns() accepts regex; selects matching columns for .SDcols.
Exercise 4.4: .I, row indices
Scenario: Get row indices of all 4-cyl cars.
Difficulty: Intermediate
Click to reveal solution
Explanation: .I is the integer vector of row positions. With logical condition, returns indices where TRUE.
Exercise 4.5: First/last row per group
Scenario: First row per cyl using head.
Difficulty: Intermediate
Click to reveal solution
Explanation: head(.SD, 1) takes the first row of each group's subset. Tail for last.
Exercise 4.6: Top N per group
Scenario: Top 2 highest-mpg cars per cyl.
Difficulty: Advanced
Click to reveal solution
Explanation: Sort first, then head per group. Or: mt[, .SD[order(-mpg)][1:2], by = cyl] does it in one step.
Exercise 4.7: .GRP, group counter
Scenario: Add a column group_id numbering each cyl group sequentially.
Difficulty: Advanced
Click to reveal solution
Explanation: .GRP is the group counter (1, 2, 3, ...). Useful to label groups numerically without dealing with factor levels.
Exercise 4.8: Group-wise rolling computation with .SD
Scenario: Per cyl, compute lag of mpg.
Difficulty: Advanced
Click to reveal solution
Explanation: data.table::shift returns a lagged or leading vector. With by, computes per group. Equivalent to dplyr's lag.
Section 5. Performance patterns (8 problems)
Exercise 5.1: Set a key for fast lookup
Scenario: Set cyl as a key on mt and verify with key().
Difficulty: Intermediate
Click to reveal solution
Explanation: setkey sorts the table by the key columns AND records the key. Subsequent dt["4"] is a binary search (much faster than a scan).
Exercise 5.2: Multi-column key
Scenario: Set keys on (cyl, gear).
Difficulty: Intermediate
Click to reveal solution
Explanation: Multiple key columns enable fast lookup on prefixes. dt[.(4, 3)] would binary-search to cyl=4 AND gear=3.
Exercise 5.3: Set indexes for ad-hoc queries
Scenario: Add an index on am without resorting the table.
Difficulty: Advanced
Click to reveal solution
Explanation: Indexes are like keys but don't reorder rows. Cheaper to maintain; subsequent filters on am use the index automatically.
Exercise 5.4: fread for fast import
Scenario: A CSV with 100K rows. Read it with fread; observe speed advantage.
Difficulty: Intermediate
Click to reveal solution
Explanation: fread is the fastest CSV reader in R. Auto-detects separator, types, header. Returns a data.table directly.
Exercise 5.5: fwrite for fast export
Scenario: Write mt to a CSV using fwrite.
Difficulty: Beginner
Click to reveal solution
Explanation: fwrite is parallelized, multi-threaded write. Faster than write.csv on large files.
Exercise 5.6: Avoid implicit copies
Scenario: Demonstrate the difference between dt[, x := x*2] (in place) and dt$x <- dt$x * 2 (copies).
Difficulty: Advanced
Click to reveal solution
Explanation: := updates in place; $<- shallow-copies. For large tables, the difference is dramatic.
Exercise 5.7: Group-by-reference loop pattern
Scenario: For each cyl, compute mean mpg and ASSIGN it back as a column on the original (so each row gets its group mean).
Difficulty: Advanced
Click to reveal solution
Explanation: := + by computes per-group AND broadcasts back to all rows of that group. dplyr equivalent: group_by + mutate.
Exercise 5.8: Memory-efficient column ops
Scenario: Multiply 5 numeric columns by 1000 in place, in a single call.
Difficulty: Advanced
Click to reveal solution
Explanation: (cols) evaluates the variable, assigning to the named columns. lapply over .SD applies the operation to each. One pass, no copies.
Section 6. Advanced and dplyr migration (10 problems)
Exercise 6.1: dplyr filter -> data.table
Scenario: Translate mtcars |> filter(cyl == 6, mpg > 20) to data.table.
Difficulty: Beginner
Click to reveal solution
Explanation: Conditions in i. Comma in i is OR? No: comma is the position separator; & for AND.
Exercise 6.2: dplyr mutate -> data.table
Scenario: Translate mutate(kpl = mpg * 0.425) to data.table.
Difficulty: Beginner
Click to reveal solution
Explanation: := for assignment.
Exercise 6.3: dplyr summarise -> data.table
Scenario: Translate group_by(cyl) |> summarise(n = n(), mpg = mean(mpg)).
Difficulty: Intermediate
Click to reveal solution
Explanation: by in third arg of [], j as the summary list.
Exercise 6.4: dplyr slice_max -> data.table
Scenario: Top 3 highest-mpg cars overall (no grouping).
Difficulty: Intermediate
Click to reveal solution
Explanation: Sort then slice. data.table doesn't have a slice_max equivalent but order + head is concise.
Exercise 6.5: dplyr left_join -> data.table
Scenario: Translate left_join(orders, products, by = "product_id").
Difficulty: Intermediate
Click to reveal solution
Explanation: Reverse the order: in X[Y, on=] think of Y as the driver. products[orders, ...] means "for each row of orders, look up products" -> left-join orders -> products.
Exercise 6.6: dplyr count -> data.table
Scenario: count(diamonds, cut) equivalent.
Difficulty: Beginner
Click to reveal solution
Explanation: .N + by = canonical count. Shorter than dplyr's count.
Exercise 6.7: Multiple summaries with names
Scenario: Per cyl: mean and sd of mpg, hp.
Difficulty: Advanced
Click to reveal solution
Explanation: Spell out each summary in j. For programmatic: lapply over .SD with .SDcols.
Exercise 6.8: Conditional update on multiple columns
Scenario: For all 4-cyl rows, multiply mpg by 1.05 AND set kpl = mpg * 0.425.
Difficulty: Advanced
Click to reveal solution
Explanation: Combine i + multi-column := in one call. Note: kpl uses the UPDATED mpg (data.table evaluates left-to-right within :=()).
Exercise 6.9: Pivot to wide
Scenario: From long table (region, quarter, sales), pivot to wide.
Difficulty: Advanced
Click to reveal solution
Explanation: data.table::dcast is the wide-pivot. Formula syntax: rows ~ cols. value.var holds the cell values.
Exercise 6.10: Pivot to long
Scenario: Reverse the pivot from 6.9.
Difficulty: Advanced
Click to reveal solution
Explanation: data.table::melt is the long-pivot counterpart. id.vars are kept as-is; the rest are stacked into name/value pairs.
What to do next
After 50 exercises, the i/j/by mental model should feel native and := should be your default for any large-table mutation. Natural follow-ups:
- dplyr-Exercises (shipped) for comparison and dplyr-equivalents of every pattern here.
- Data-Wrangling-Exercises (shipped) for cross-package wrangling.
- Performance-Optimization-Exercises (coming) covers profiling, parallel, and memory tricks beyond data.table.