data.table Exercises: 12 High-Performance Data Manipulation Problems
Twelve hands-on exercises to master the data.table package in R — from basic row filtering and by= aggregation to chained operations, joins, rolling joins, reshaping, and update-by-reference with :=. Every exercise has a runnable solution you can execute right in your browser.
Introduction
data.table is the fastest general-purpose data manipulation package in R. It handles millions of rows on a laptop, uses a compact DT[i, j, by] syntax, and modifies data in place to avoid memory copies. But the compactness is a double-edged sword — the syntax rewards practice. Reading one tutorial will not make you fluent. Solving problems will.
These 12 exercises build that fluency. The first four cover the essentials: filtering rows, picking columns, and computing summaries on a single table. The middle four use by= grouping, .SD, chaining, and update-by-reference. The last four tackle joins, rolling joins, reshaping, and group-wise windowed calculations — the moves you use in real analysis work.
Each problem states the task, gives you a starter block to edit, and hides the solution behind a click-to-reveal. Write your own answer first, run it, compare with the reveal, then read the explanation. If you are new to data.table, skim the Quick Reference below before starting. If you want a conceptual refresher on tidy data manipulation, the parent dplyr filter and select tutorial covers the equivalent tidyverse verbs.
All code on this page runs in a shared R session — variables you create in one block are available in the next. Use distinct names like my_dt and my_result in your answers so you do not overwrite variables from earlier blocks.
Quick Reference
Here is a one-screen cheat sheet before you start. Skim it, then jump to Exercise 1.
| Task | Syntax | Example |
|---|---|---|
| Convert a data.frame | as.data.table() |
DT <- as.data.table(mtcars) |
| Filter rows | DT[i, ] |
DT[mpg > 20] |
| Select columns | DT[, j] |
DT[, .(mpg, cyl)] |
| Compute new columns | DT[, .(new = ...)] |
DT[, .(avg = mean(mpg))] |
| Group aggregation | DT[, j, by=] |
DT[, mean(mpg), by=cyl] |
| Update by reference | := |
DT[, kpl := mpg * 0.425] |
| Chain operations | DT[...][...] |
DT[mpg > 20][, .N, by=cyl] |
| Apply to many columns | .SD, .SDcols |
DT[, lapply(.SD, mean), .SDcols=1:3] |
| Row count per group | .N |
DT[, .N, by=cyl] |
| Fast sort | setorder() |
setorder(DT, -mpg) |
| Set join keys | setkey() |
setkey(DT, cyl) |
| Join | X[Y, on=] |
X[Y, on="id"] |
| Reshape wide to long | melt() |
melt(DT, id.vars="id") |
| Reshape long to wide | dcast() |
dcast(DT, id ~ var) |
dt_cars and dt_iris without reloading.Both tables now carry the data.table class (plus data.frame for backward compatibility). dt_cars has 32 rows and 12 columns — note the 12th column model was created from the row names. dt_iris has the classic 150 flower measurements. You are ready to begin.
:= change the original table without assignment. If you want to preserve the original, run copy() first — we cover this in Mistake 1 below.Exercise 1: Convert and filter rows
Convert the built-in airquality dataset to a data.table called my_aq. Then select the rows where Temp is greater than 80 and Month equals 8. Save the filtered result to my_result and print its row count.
Expected output: a data.table with 18 rows and 6 columns.
Click to reveal solution
Explanation: Inside DT[i], bare column names like Temp and Month are evaluated in the table's scope — you do not write my_aq$Temp. The & operator combines two row conditions, and rows where either side is NA are dropped automatically. August (Month == 8) had 18 days where the temperature exceeded 80 degrees.
Exercise 2: Select columns and compute a new one
From dt_cars, create a new data.table my_result that contains only three columns: model, mpg, and a new column kpl (kilometres per litre, computed as mpg * 0.425). Do not modify dt_cars.
Expected output: 32 rows and 3 columns.
Click to reveal solution
Explanation: .() is shorthand for list() inside j. When j returns a list, data.table builds a new table with those columns. Existing columns like model and mpg are referenced by name; new columns get a name via =. Because we assigned the result to my_result, the original dt_cars is untouched — no := was used.
Exercise 3: Compound conditions with %chin%
From dt_cars, find rows where the car is a six-cylinder (cyl == 6) OR has a manual transmission (am == 1) AND has more than 100 horsepower. Then keep only the columns model, cyl, hp, and am. Save to my_result.
Expected output: 20 rows, 4 columns.
Click to reveal solution
Explanation: R evaluates & before |, so without the parentheses your condition would become cyl == 6 | (am == 1 & hp > 100) — a different question entirely. Always parenthesise when mixing | and &. The column list .(model, cyl, hp, am) sits in j; because we also passed an i condition, this is a single filter-then-select in one call.
Exercise 4: Group aggregation with by=
Using dt_cars, compute the mean mpg and the number of cars per cylinder count. Name the columns avg_mpg and n_cars. Save to my_result and sort by cyl ascending.
Expected output: 3 rows (one per cylinder count: 4, 6, 8).
Click to reveal solution
Explanation: .N is a special symbol that holds the number of rows in each group. Combined with by = cyl, data.table splits the table into cylinder groups, computes mean(mpg) and .N per group, and returns a 3-row summary. setorder() sorts the result in place — faster than order() because no copy is made. Four-cylinder cars average 26.7 mpg; eight-cylinder cars only 15.1.
Exercise 5: Multi-column aggregation with .SD
Compute the mean of all four numeric columns (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) in dt_iris, grouped by Species. Save to my_result.
Expected output: 3 rows (one per species), 5 columns.
Click to reveal solution
Explanation: .SD stands for "Subset of Data" — it is a data.table containing the current group's rows for the columns listed in .SDcols. lapply(.SD, mean) applies mean() to each column of that subset. Without .SDcols, .SD would include every non-grouping column, including Species if it were numeric. Virginica irises have petals roughly 3.8x longer than setosa.
Exercise 6: Chain operations with DT[...][...]
Using dt_cars, find the top 3 cars (by hp) among automatic transmissions (am == 0), then keep only model, hp, and mpg. Solve it in one expression using chaining. Save to my_result.
Expected output: 3 rows, 3 columns.
Click to reveal solution
Explanation: Each [] consumes the table returned by the previous step, so you can read chains left to right: "take automatics, sort by horsepower descending, take the first three, keep these columns." The minus sign in order(-hp) sorts descending. This is equivalent to piping in dplyr — one operation per bracket keeps each step readable.
[] returns a fresh data.table that becomes the input for the next bracket. This keeps memory usage low and reads almost like English — filter, then sort, then slice, then pick columns.Exercise 7: Update by reference with :=
Add two new columns to dt_cars in place: kpl (kilometres per litre = mpg * 0.425) and power_weight (hp / wt). Use the multi-column := form. Then print the first 3 rows of the updated table.
Expected output: dt_cars now has 14 columns (was 12).
Click to reveal solution
Explanation: The := operator modifies dt_cars in place — no assignment needed on the left. The multi-column form uses a character vector of names on the left and a list of expressions on the right. Because the update happens in place, it is memory-efficient: data.table does not copy the whole table to add two columns.
dt_cars[, kpl := mpg * 0.425], the column sticks. If you need an untouched copy for later, run safe <- copy(dt_cars) BEFORE the update.Exercise 8: Sort with setorder and setkey
Sort dt_cars in place by cyl ascending and mpg descending using setorder(). Then set cyl as the key using setkey() for fast filtering. Verify the sort order and the key.
Expected output: key(dt_cars) returns "cyl", and the first row is a 4-cylinder with the highest mpg.
Click to reveal solution
Explanation: setorder() sorts a data.table in place (no copy), using a fast radix algorithm. Prefix a column with - to sort it descending. setkey() tells data.table to keep the table indexed on cyl, which makes future filters like dt_cars[.(4)] or joins on cyl much faster. Note setorder() does not create a key, and setkey() also physically reorders rows — they are related but distinct.
Exercise 9: Join two data.tables
Build two small tables from starwars-style data. Join them on id to see each person's planet name. Keep all rows from the people table (left join).
Join people with planets on planet_id so every person keeps their row even if no planet matches. Save to my_result.
Expected output: 4 rows — Chewie's planet is NA because planet 30 is not in planets.
Click to reveal solution
Explanation: In data.table's join syntax X[Y, on=], the outer table X is joined FROM and the rows come from Y. So planets[people, on="planet_id"] returns all four people rows with matched planet columns attached. Chewie's row has NA for planet because no planet with id 30 exists. This is equivalent to dplyr::left_join(people, planets, by="planet_id").
Exercise 10: Rolling join
You have sensor readings and need to match each event to the most recent sensor reading BEFORE the event time. This is a rolling join.
Roll the value from readings forward onto each event, so each event gets the most recent prior reading. Save to my_result.
Expected output: 3 rows — event times 2, 5, 9 get values 10, 30, 70.
Click to reveal solution
Explanation: With roll = TRUE, when an event time does not match a reading time exactly, data.table carries the last prior reading forward. Event at t=2 has no exact reading, so the reading at t=1 (value 10) is used. Event at t=5 gets the reading at t=3 (value 30). This is the standard pattern for matching time-stamped events to price ticks, sensor states, or version history.
Exercise 11: Reshape wide to long with melt
Reshape dt_iris so that the four numeric measurements collapse into two columns: measurement (holding names like "Sepal.Length") and value (holding the number). Keep Species as the id column. Save to my_long.
Expected output: 600 rows (150 flowers x 4 measurements), 3 columns.
Click to reveal solution
Explanation: melt() pivots wide tables into long (tidy) format. id.vars is the column to keep intact (repeated for each measurement). The four measurement columns become the values of measurement, and their numbers become the values of value. The row count multiplies by the number of melted columns: 150 x 4 = 600. This long shape is what ggplot2 expects for facetting.
Exercise 12: Lag and cumulative sum by group
Using dt_iris, create a new data.table my_result that for each Species has:
row= row number within specieslength=Sepal.Lengthprev_length= the previous row'sSepal.Lengthwithin the same speciescum_length= running total ofSepal.Lengthwithin the same species
Keep only species setosa and the first 5 rows for brevity.
Expected output: 5 rows showing lag and cumulative sum.
Click to reveal solution
Explanation: shift() is data.table's lag/lead function — by default it shifts values down by 1, so row i receives row i-1's value. The first row of each group has NA because there is no prior value. cumsum() runs a cumulative sum, resetting at each group boundary because of by = Species. These two are the most common window functions in practice.
type = "lead" for the opposite direction: row i receives row i+1's value. The last row of each group becomes NA.Common Mistakes and How to Fix Them
Mistake 1: Modifying a data.table inside a function without copy()
Functions that use := will silently change the caller's table.
Wrong:
Why it is wrong: data.table passes tables by reference, and := mutates. The caller did not ask for safe to change.
Correct:
Call copy() at the top of any function that uses := on its argument.
Mistake 2: Forgetting the leading comma in DT[, j]
If you want to select columns only, you must keep the comma.
Wrong:
Why it is wrong: DT[.(...)] with no comma is interpreted as a join/key-lookup. With a leading comma DT[, .(mpg, cyl)], the expression is read as j.
Correct:
Mistake 3: Assigning a chain that starts with :=
:= returns the table invisibly — assigning it creates confusion.
Wrong:
Why it is wrong: := mutates and returns the original table. The new result is not a copy; any later edit to one affects the other.
Correct:
Mistake 4: Grouping inside j instead of using by=
Using tapply() or manual subsetting inside j defeats the optimiser.
Wrong:
Why it is wrong: data.table has a fast by= optimiser that skips R-level overhead. Using tapply() inside j bypasses it.
Correct:
Summary
| Concept | Key syntax | Exercises |
|---|---|---|
| Filter rows | DT[i] |
1, 3 |
| Select/compute columns | DT[, .(...)] |
2, 3 |
| Group aggregate | DT[, j, by=] |
4, 5 |
| Multi-column ops | .SD, .SDcols |
5 |
| Chain operations | DT[...][...][...] |
6 |
| Update in place | := |
7 |
| Sort / key | setorder(), setkey() |
8 |
| Join | X[Y, on=] |
9, 10 |
| Reshape | melt(), dcast() |
11 |
| Window / group-wise | shift(), cumsum() with by= |
12 |
If you solved all 12 without peeking, you can handle 90% of real-world data.table tasks.
FAQ
Is data.table really faster than dplyr?
For filter, group-by, and join on large tables (1M+ rows), yes — often 2-10x faster. For small tables, the gap is invisible. The real win is memory: := updates in place, so you can work with tables close to your RAM limit without running out.
How do I install data.table?
install.packages("data.table") in R. It is a zero-dependency package and installs in seconds. No system libraries needed.
Does := return the updated table?
It returns the table invisibly for chaining, but you should not assign the result. Call DT[, col := value] as a standalone statement. Assigning the result creates a second name for the same object, not a copy.
What does .SD mean?
"Subset of Data." Inside j, .SD is a data.table containing the current group's rows for the columns listed in .SDcols. It exists so you can write lapply(.SD, fn) to apply a function across many columns without listing them one by one.
References
- Dowle, M., Srinivasan, A. — data.table package documentation. CRAN. Link
- data.table GitHub repository — Rdatatable/data.table. Link
- data.table vignettes — Introduction, Reference semantics, Keys and fast binary search. Link
- Atrebas — data.table cheat sheet. Link
- Wickham, H., Grolemund, G. — R for Data Science, 2nd edition. Link
- Dowle, M. — data.table FAQ vignette. Link
What's Next?
- dplyr filter() and select() Exercises — practise the tidyverse equivalent of Exercises 1-3 above.
- dplyr group_by() and summarise() Exercises — contrast Exercises 4-5 with the dplyr aggregation pattern.
- R Joins Tutorial — deeper dive into inner, left, right, full, and rolling joins.