data.table Exercises: 12 High-Performance Data Manipulation Problems, Solved Step-by-Step
Twelve hands-on exercises to master the data.table package in R, from row filtering and by= aggregation to chained operations, joins, rolling joins, reshaping, and update-by-reference with :=. Each problem has a starter block, a click-to-reveal solution, and an explanation so you can check your answer and learn the idiom.
The first four exercises cover the essentials: filtering rows, picking columns, and computing summaries on a single table. The middle four use by= grouping, .SD, chaining, and update-by-reference. The last four tackle joins, rolling joins, reshaping, and group-wise windowed calculations, the moves you use in real analysis work. Write your own answer first, run it, compare with the reveal, and read the explanation.
How do you load data.table and build the working datasets?
Every exercise below builds on two tables: dt_cars (the built-in mtcars dataset with row names promoted to a model column) and dt_iris (the classic 150-flower measurements). Run the block once, all code on this page shares one R session, so later exercises can reference both tables without reloading.
dt_cars has 32 rows and 12 columns, the 12th column model was created from the row names. dt_iris has the classic 150 flower measurements across 5 columns. Both tables now carry the data.table class (plus data.frame for backward compatibility, which means any data.frame function still works on them).
:= change the original table without assignment. If you want to preserve the original, run copy() first, we cover this trap in the Common Mistakes section.What data.table syntax do you need to know?
Here is a one-screen cheat sheet before you start. Skim it, then jump to Exercise 1.
| Task | Syntax | Example |
|---|---|---|
| Convert a data.frame | as.data.table() |
DT <- as.data.table(mtcars) |
| Filter rows | DT[i, ] |
DT[mpg > 20] |
| Select columns | DT[, j] |
DT[, .(mpg, cyl)] |
| Compute new columns | DT[, .(new = ...)] |
DT[, .(avg = mean(mpg))] |
| Group aggregation | DT[, j, by=] |
DT[, mean(mpg), by=cyl] |
| Update by reference | := |
DT[, kpl := mpg * 0.425] |
| Chain operations | DT[...][...] |
DT[mpg > 20][, .N, by=cyl] |
| Apply to many columns | .SD, .SDcols |
DT[, lapply(.SD, mean), .SDcols=1:3] |
| Row count per group | .N |
DT[, .N, by=cyl] |
| Fast sort | setorder() |
setorder(DT, -mpg) |
| Set join keys | setkey() |
setkey(DT, cyl) |
| Join | X[Y, on=] |
X[Y, on="id"] |
| Reshape wide to long | melt() |
melt(DT, id.vars="id") |
| Reshape long to wide | dcast() |
dcast(DT, id ~ var) |
The whole package boils down to one mental model, DT[i, j, by]: i picks rows, j operates on columns, and by groups. Every exercise below exercises one corner of that triple. Use distinct names like my_dt and my_result in your answers so you do not overwrite variables from earlier blocks.
Exercise 1: Convert and filter rows
Convert the built-in airquality dataset to a data.table called my_aq. Then select the rows where Temp is greater than 80 and Month equals 8. Save the filtered result to my_result and print its row count.
Expected output: a data.table with 18 rows and 6 columns.
Click to reveal solution
Explanation: Inside DT[i], bare column names like Temp and Month are evaluated in the table's scope, you do not write my_aq$Temp. The & operator combines two row conditions, and rows where either side is NA are dropped automatically. August (Month == 8) had 18 days where the temperature exceeded 80 degrees.
Exercise 2: Select columns and compute a new one
From dt_cars, create a new data.table my_result that contains only three columns: model, mpg, and a new column kpl (kilometres per litre, computed as mpg * 0.425). Do not modify dt_cars.
Expected output: 32 rows and 3 columns.
Click to reveal solution
Explanation: .() is shorthand for list() inside j. When j returns a list, data.table builds a new table with those columns. Existing columns like model and mpg are referenced by name; new columns get a name via =. Because we assigned the result to my_result, the original dt_cars is untouched, no := was used.
Exercise 3: Compound conditions with precedence
From dt_cars, find rows where the car is a six-cylinder (cyl == 6) OR has a manual transmission (am == 1) AND has more than 100 horsepower. Then keep only the columns model, cyl, hp, and am. Save to my_result.
Expected output: 20 rows, 4 columns.
Click to reveal solution
Explanation: R evaluates & before |, so without the parentheses your condition would become cyl == 6 | (am == 1 & hp > 100), a different question entirely. Always parenthesise when mixing | and &. The column list .(model, cyl, hp, am) sits in j; because we also passed an i condition, this is a single filter-then-select in one call.
Exercise 4: Group aggregation with by=
Using dt_cars, compute the mean mpg and the number of cars per cylinder count. Name the columns avg_mpg and n_cars. Save to my_result and sort by cyl ascending.
Expected output: 3 rows (one per cylinder count: 4, 6, 8).
Click to reveal solution
Explanation: .N is a special symbol that holds the number of rows in each group. Combined with by = cyl, data.table splits the table into cylinder groups, computes mean(mpg) and .N per group, and returns a 3-row summary. setorder() sorts the result in place, faster than order() because no copy is made. Four-cylinder cars average 26.7 mpg; eight-cylinder cars only 15.1.
Exercise 5: Multi-column aggregation with .SD
Compute the mean of all four numeric columns (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) in dt_iris, grouped by Species. Save to my_result.
Expected output: 3 rows (one per species), 5 columns.
Click to reveal solution
Explanation: .SD stands for "Subset of Data", it is a data.table containing the current group's rows for the columns listed in .SDcols. lapply(.SD, mean) applies mean() to each column of that subset. Without .SDcols, .SD would include every non-grouping column. Virginica irises have petals roughly 3.8x longer than setosa.
Exercise 6: Chain operations with DT[...][...]
Using dt_cars, find the top 3 cars (by hp) among automatic transmissions (am == 0), then keep only model, hp, and mpg. Solve it in one expression using chaining. Save to my_result.
Expected output: 3 rows, 3 columns.
Click to reveal solution
Explanation: Each [] consumes the table returned by the previous step, so you can read chains left to right: "take automatics, sort by horsepower descending, take the first three, keep these columns." The minus sign in order(-hp) sorts descending. This is equivalent to piping in dplyr, one operation per bracket keeps each step readable.
[] returns a fresh data.table that becomes the input for the next bracket. This keeps memory usage low and reads almost like English, filter, then sort, then slice, then pick columns.Exercise 7: Update by reference with :=
Add two new columns to dt_cars in place: kpl (kilometres per litre = mpg * 0.425) and power_weight (hp / wt). Use the multi-column := form. Then print the first 3 rows of the updated table.
Expected output: dt_cars now has 14 columns (was 12).
Click to reveal solution
Explanation: The := operator modifies dt_cars in place, no assignment needed on the left. The multi-column form uses a character vector of names on the left and a list of expressions on the right. Because the update happens in place, it is memory-efficient: data.table does not copy the whole table to add two columns.
dt_cars[, kpl := mpg * 0.425], the column sticks. If you need an untouched copy for later, run safe <- copy(dt_cars) BEFORE the update.Exercise 8: Sort with setorder and setkey
Sort dt_cars in place by cyl ascending and mpg descending using setorder(). Then set cyl as the key using setkey() for fast filtering. Verify the sort order and the key.
Expected output: key(dt_cars) returns "cyl", and the first row is a 4-cylinder with the highest mpg.
Click to reveal solution
Explanation: setorder() sorts a data.table in place (no copy), using a fast radix algorithm. Prefix a column with - to sort it descending. setkey() tells data.table to keep the table indexed on cyl, which makes future filters like dt_cars[.(4)] or joins on cyl much faster. Note that setkey() also physically reorders rows, setorder() and setkey() are related but distinct.
Exercise 9: Join two data.tables
Build two small tables from starwars-style data. Join them on planet_id to see each person's planet name. Keep all rows from the people table (left join).
Join people with planets on planet_id so every person keeps their row even if no planet matches. Save to my_result.
Expected output: 4 rows, Chewie's planet is NA because planet 30 is not in planets.
Click to reveal solution
Explanation: In data.table's join syntax X[Y, on=], the outer table X is joined onto the rows of Y. So planets[people, on="planet_id"] returns all four people rows with matched planet columns attached. Chewie's row has NA for planet because no planet with id 30 exists. This is equivalent to dplyr::left_join(people, planets, by="planet_id").
Exercise 10: Rolling join
You have sensor readings and need to match each event to the most recent sensor reading BEFORE the event time. This is a rolling join.
Roll the value from readings forward onto each event, so each event gets the most recent prior reading. Save to my_result.
Expected output: 3 rows, event times 2, 5, 9 get values 10, 30, 70.
Click to reveal solution
Explanation: With roll = TRUE, when an event time does not match a reading time exactly, data.table carries the last prior reading forward. Event at t=2 has no exact reading, so the reading at t=1 (value 10) is used. Event at t=5 gets the reading at t=3 (value 30). This is the standard pattern for matching time-stamped events to price ticks, sensor states, or version history.
Exercise 11: Reshape wide to long with melt
Reshape dt_iris so that the four numeric measurements collapse into two columns: measurement (holding names like "Sepal.Length") and value (holding the number). Keep Species as the id column. Save to my_long.
Expected output: 600 rows (150 flowers x 4 measurements), 3 columns.
Click to reveal solution
Explanation: melt() pivots wide tables into long (tidy) format. id.vars is the column to keep intact (repeated for each measurement). The four measurement columns become the values of measurement, and their numbers become the values of value. The row count multiplies by the number of melted columns: 150 x 4 = 600. This long shape is what ggplot2 expects for facetting.
Exercise 12: Lag and cumulative sum by group
Using dt_iris, create a new data.table my_result that for each Species has:
row= row number within specieslength=Sepal.Lengthprev_length= the previous row'sSepal.Lengthwithin the same speciescum_length= running total ofSepal.Lengthwithin the same species
Keep only species setosa and the first 5 rows for brevity.
Expected output: 5 rows showing lag and cumulative sum.
Click to reveal solution
Explanation: shift() is data.table's lag/lead function, by default it shifts values down by 1, so row i receives row i-1's value. The first row of each group has NA because there is no prior value. cumsum() runs a cumulative sum, resetting at each group boundary because of by = Species. These two are the most common window functions in practice.
type = "lead" for the opposite direction: row i receives row i+1's value. The last row of each group becomes NA.What are the most common data.table mistakes?
Four traps catch almost everyone the first week. Walk through each wrong/right pair so you can spot the pattern when you hit it in your own code.
Mistake 1: Modifying a data.table inside a function without copy()
Functions that use := will silently change the caller's table.
Wrong:
Why it is wrong: data.table passes tables by reference, and := mutates. The caller did not ask for safe to change.
Correct:
Call copy() at the top of any function that uses := on its argument.
Mistake 2: Forgetting the leading comma in DT[, j]
If you want to select columns only, you must keep the comma.
Wrong:
Why it is wrong: DT[.(...)] with no comma is interpreted as a join/key-lookup. With a leading comma DT[, .(mpg, cyl)], the expression is read as j.
Correct:
Mistake 3: Assigning a chain that starts with :=
:= returns the table invisibly, assigning it creates confusion.
Wrong:
Why it is wrong: := mutates and returns the original table. The new result is not a copy; any later edit to one affects the other.
Correct:
Mistake 4: Grouping inside j instead of using by=
Using tapply() or manual subsetting inside j defeats the optimiser.
Wrong:
Why it is wrong: data.table has a fast by= optimiser that skips R-level overhead. Using tapply() inside j bypasses it.
Correct:
Complete Example
Now let's stitch several exercises together into a realistic mini-analysis. Imagine you want to answer the question: "Which transmission type is more fuel-efficient, and how does the answer change by cylinder count?" The pipeline below filters, aggregates, joins a label table, sorts, and reshapes for plotting, all in one readable flow.
Manuals beat automatics in every cylinder class where both appear, by 6 mpg for 4-cylinder cars and 1.5 mpg for 6-cylinder. The 8-cylinder cars above 15 mpg are all automatics, so the Manual column is NA there. The final wide table is plot-ready: one row per x-axis category, one column per bar group. This filter → aggregate → join → reshape pattern is the 80% workflow of real data.table analysis.
Summary
| Concept | Key syntax | Exercises |
|---|---|---|
| Filter rows | DT[i] |
1, 3 |
| Select/compute columns | DT[, .(...)] |
2, 3 |
| Group aggregate | DT[, j, by=] |
4, 5 |
| Multi-column ops | .SD, .SDcols |
5 |
| Chain operations | DT[...][...][...] |
6 |
| Update in place | := |
7 |
| Sort / key | setorder(), setkey() |
8 |
| Join | X[Y, on=] |
9, 10 |
| Reshape | melt(), dcast() |
11 |
| Window / group-wise | shift(), cumsum() with by= |
12 |
If you solved all 12 without peeking, you can handle 90% of real-world data.table tasks. If a few stumped you, revisit the relevant exercise and the Common Mistakes section, the traps caught you for a reason, and the second attempt is where the syntax sticks.
References
- Dowle, M., Srinivasan, A., data.table package documentation. CRAN. Link
- data.table GitHub repository, Rdatatable/data.table. Link
- data.table vignettes, Introduction, Reference semantics, Keys and fast binary search. Link
- Atrebas, data.table cheat sheet. Link
- Wickham, H., Grolemund, G., R for Data Science, 2nd edition. Link
- Dowle, M., data.table FAQ vignette. Link
Continue Learning
- dplyr filter() and select() Exercises, practise the tidyverse equivalent of Exercises 1-3 above.
- dplyr group_by() and summarise() Exercises, contrast Exercises 4-5 with the dplyr aggregation pattern.
- R Joins Tutorial, deeper dive into inner, left, right, full, and rolling joins.