DuckDB + duckplyr in R: Query 100M Rows Faster Than pandas on a Laptop
duckplyr is a drop-in replacement for dplyr that runs your existing dplyr code on DuckDB's columnar engine — no SQL needed, no data loaded into memory until you ask for it.
Introduction
You already know dplyr. You can filter, mutate, group, and summarise in your sleep. The problem appears when your data outgrows RAM — a 2 GB CSV that crashes read.csv(), a Parquet file with 100 million rows.
duckplyr solves this by swapping dplyr's in-memory engine for DuckDB's analytical engine. You write the same filter(), mutate(), summarise() code. duckplyr translates it into DuckDB operations that stream through data without loading it all at once. If DuckDB cannot handle a particular operation, duckplyr falls back to regular dplyr automatically.
In this tutorial you will learn what duckplyr does differently from plain dplyr, how to query CSV and Parquet files without loading them into memory, which operations fall back to dplyr, when to use duckplyr vs raw SQL, and how the three approaches compare on speed.
install.packages(c("duckdb", "duckplyr")).What is duckplyr and how does it differ from dplyr?
duckplyr is a tidyverse package that overrides dplyr's core verbs — filter(), mutate(), summarise(), group_by(), select(), arrange(), and joins — so they execute on DuckDB instead of in R's memory. The key difference is lazy materialisation: duckplyr builds a query plan but does not compute anything until you call collect() or print the result.
Loading the package is all it takes. Your existing dplyr code works unchanged.
The output is identical to what plain dplyr would produce. The difference is invisible: DuckDB executed the filter and aggregation using its columnar engine, which processes data in vectorised batches instead of row by row. For small data frames like mtcars, the speed difference is negligible. For millions of rows, it is dramatic.
Try it: Using duckplyr, filter mtcars to rows where hp > 100, group by cyl, and compute the mean mpg. Save to ex_result.
Click to reveal solution
Explanation: The same dplyr chain runs on DuckDB's engine because library(duckplyr) is loaded.
How do you query CSV and Parquet files without loading them into memory?
The biggest payoff of duckplyr is querying files directly on disk. df_from_csv() and df_from_parquet() create lazy references to files — no data enters R memory until you collect().
Everything before collect() is a query plan. DuckDB reads only the columns and rows it needs — if your CSV has 50 columns but you select 3, only 3 are scanned. This is why duckplyr can handle files larger than your available RAM.
Parquet files work the same way but are faster because Parquet stores data in a columnar, compressed format that DuckDB reads natively.
arrow::write_parquet(), then query with df_from_parquet() forever after.Try it: Using df_from_parquet("flights.parquet"), find the 3 destinations (dest) with the most flights in January (month == 1). Save to ex_pq.
Click to reveal solution
Explanation: count(dest, sort = TRUE) groups by destination and counts, sorted descending. The whole pipeline stays lazy until collect().
How does duckplyr handle operations it cannot translate?
duckplyr covers most common dplyr verbs and functions, but not everything. When it encounters an operation it cannot translate — a custom R function inside mutate(), for example — it falls back to regular dplyr. The fallback loads the data into memory and processes it the traditional way.
That ran entirely on DuckDB. Now watch what happens with a custom R function:
The "materializing" message means duckplyr fell back to dplyr for that step. The result is still correct — fallback is a safety net, not an error. But the performance benefit disappears for that operation because the data was pulled into R memory.
options(duckdb.materialize_message = TRUE) during development to see when it happens. If a critical pipeline falls back on every step, you are paying duckplyr's overhead with none of its speed.Try it: Write a duckplyr pipeline that groups mtcars by cyl and computes min_mpg and max_mpg using only built-in functions (no custom R functions). Save to ex_fallback.
Click to reveal solution
Explanation: min() and max() are translated natively by DuckDB, so no fallback occurs.
When should you use duckplyr vs raw DuckDB SQL?
duckplyr and raw DuckDB SQL access the same engine. The choice is about ergonomics, not speed.
The duckplyr version of that same query is the mtcars_result code from the first section. Both produce identical results at identical speed.
| Use duckplyr when... | Use raw SQL when... |
|---|---|
| You already think in dplyr verbs | You need window functions (OVER, PARTITION BY) |
| Your team reads R better than SQL | The query has CTEs or recursive logic |
| You want automatic fallback to dplyr | You are joining files from different formats in one query |
| Prototyping interactively in the console | Performance-tuning a specific query plan |
Try it: Write a duckplyr pipeline on mtcars that filters to cars with wt < 3, groups by gear, and counts the rows. Sort by count descending. Save to ex_duckplyr.
Click to reveal solution
Explanation: count(gear, sort = TRUE) is shorthand for group_by(gear) |> summarise(n = n()) |> arrange(desc(n)). duckplyr translates the whole chain.
How fast is duckplyr compared to dplyr and data.table?
Speed depends on data size, operation type, and whether the data fits in memory. For small data frames (under 100K rows), all three are fast enough. The gap opens on millions of rows, especially for grouped aggregations.
On 1 million rows, duckplyr is roughly 6x faster than dplyr and competitive with data.table. The advantage grows with data size: on 100 million rows or files that exceed RAM, duckplyr can finish where dplyr cannot even start.
Try it: Using big_df from above, write a duckplyr pipeline that filters rows where value > 0, groups by group, and computes total = sum(value). Save to ex_bench.
Click to reveal solution
Explanation: filter(value > 0) keeps roughly half the rows, then group_by() |> summarise(total = sum(value)) aggregates per group — all on DuckDB's engine.
Common Mistakes and How to Fix Them
Mistake 1: Forgetting that library(duckplyr) overrides dplyr
❌ Wrong:
Why it is wrong: Loading duckplyr replaces dplyr's methods for the entire session. Every dplyr verb now goes through DuckDB first, falls back if unsupported. If your pipeline has many custom R functions, you pay overhead on every fallback.
✅ Correct:
Mistake 2: Calling collect() too early
❌ Wrong:
Why it is wrong: collect() materialises the entire dataset into R memory. The subsequent filter() runs on an in-memory data frame — you lost DuckDB's file-scanning advantage.
✅ Correct:
Mistake 3: Assuming all dplyr functions are translated
❌ Wrong:
Why it is wrong: duckplyr translates standard dplyr verbs and base R functions (mean, sd, sum, grepl, etc.) but not tidyverse extension functions like str_detect() or lubridate::ymd().
✅ Correct:
Practice Exercises
Exercise 1: File-to-summary pipeline
Write a complete pipeline that reads flights.csv with df_from_csv(), filters to December flights with departure delay over 30 minutes, groups by carrier, computes the mean and max delay, and collects the result. Sort by mean delay descending.
Click to reveal solution
Explanation: The entire pipeline stays lazy until collect(). DuckDB scans only the month, dep_delay, and carrier columns from the CSV, filters in-engine, aggregates, sorts, and returns just the summary rows to R.
Exercise 2: Benchmark duckplyr vs dplyr on your machine
Generate a 5-million-row data frame with columns id (1:5e6), category (sample of LETTERS), and amount (runif). Benchmark a grouped sum by category using both dplyr and duckplyr. Report which is faster and by how much.
Click to reveal solution
Explanation: The grouped sum on 5 million rows highlights DuckDB's vectorised execution. dplyr processes groups in R's interpreter; duckplyr delegates to DuckDB's compiled C++ engine.
Summary
| Approach | Best for | Limitation |
|---|---|---|
| duckplyr | dplyr users querying large files on disk | Falls back on custom R functions |
| Raw DuckDB SQL | Complex queries (window functions, CTEs, multi-file joins) | Requires SQL knowledge |
| dplyr | Small in-memory data, rapid prototyping | Cannot handle out-of-memory data |
| data.table | Fast in-memory operations with concise syntax | Still requires data to fit in RAM |
Key takeaways:
library(duckplyr)is all you need — it overrides dplyr methods silentlydf_from_csv()anddf_from_parquet()create lazy references to files on disk- Keep pipelines lazy as long as possible — call
collect()at the very end - Use base R functions (
mean,grepl,sum) over tidyverse extensions for best translation coverage - duckplyr and raw SQL share the same engine — pick the interface that fits the task
FAQ
Does duckplyr work with all dplyr verbs?
Most core verbs are supported: filter(), select(), mutate(), summarise(), group_by(), arrange(), slice_head()/slice_tail(), and all join types. Verbs that need R-specific evaluation — like rowwise() or do() — fall back to dplyr. The coverage grows with each duckplyr release.
Can I use duckplyr with remote Parquet files on S3?
Yes. DuckDB's httpfs extension lets you query Parquet files on S3, GCS, or any HTTP URL. Load the extension with duckdb::duckdb_load_extension(con, "httpfs"), then pass the S3 URL to df_from_parquet("s3://bucket/path.parquet"). Authentication uses standard AWS environment variables.
Should I switch all my scripts to duckplyr?
For interactive analysis and new pipelines, yes — library(duckplyr) is a free speed boost with automatic fallback. For production pipelines with heavy custom R functions, profile first: if most steps fall back to dplyr, the overhead may not be worth it. You can always use duckplyr for the I/O-heavy parts and dplyr for the R-heavy parts in the same script.
References
- duckplyr documentation — official tidyverse site. duckplyr.tidyverse.org
- DuckDB Blog — duckplyr announcement. duckdb.org/2024/04/02/duckplyr
- DuckDB R Client documentation. duckdb.org/docs/stable/clients/r
- Appsilon — R dplyr vs DuckDB benchmarks. appsilon.com/post/r-dplyr-vs-duckdb
- duckplyr large data vignette. duckplyr.tidyverse.org/articles/large.html
What's Next?
- DuckDB in R — The parent tutorial covering DuckDB's SQL interface, in-process architecture, and when to use SQL vs dplyr syntax.
- Connect R to Any Database: DBI — Learn the DBI package for connecting to PostgreSQL, MySQL, SQLite, and other databases from R.