DuckDB in R: Query 100 Million Rows on Your Laptop in Under 2 Seconds
DuckDB is an in-process columnar database engine that runs inside your R session — no server, no setup, no memory limits — letting you query millions of rows from CSV, Parquet, or data frames using SQL or dplyr syntax.
Introduction
Your laptop can query 100 million rows in under 2 seconds. No Spark cluster, no cloud database, no special hardware. DuckDB makes this possible by embedding a full analytical database engine directly inside your R process.
DuckDB is an OLAP (Online Analytical Processing) database designed for analytical queries. Unlike row-based databases such as SQLite or PostgreSQL, DuckDB stores data in columns. This columnar layout means aggregations like "average sales by region" only read the columns they need, skipping everything else.
For R users who hit memory walls with dplyr or data.table, DuckDB is a practical escape hatch. It queries files on disk without loading them entirely into RAM. You can run SQL directly, or use familiar dplyr verbs through the duckplyr package. In this tutorial, you will learn both approaches — from connecting and querying data frames to reading CSV and Parquet files, with benchmarks showing when DuckDB outperforms the alternatives.
What Is DuckDB and Why Should R Users Care?
Most databases follow a client-server model. You install a server (PostgreSQL, MySQL), start it, then connect from R. DuckDB takes a different approach. It is an in-process database, meaning the entire engine runs inside your R session as a library. There is nothing to install, configure, or keep running.
The second key difference is how DuckDB stores data. Traditional databases store data row by row. DuckDB stores data column by column. When you run a query like "compute the average of column X grouped by column Y," DuckDB reads only those two columns. A row-based database would read every column in every row.

Figure 1: DuckDB runs inside your R process and queries data frames, CSV, and Parquet files directly.
These two properties — in-process and columnar — make DuckDB extremely fast for analytical workloads. Published benchmarks show DuckDB completing aggregations on 100 million rows in under 4 seconds, while base R takes over 480 seconds and dplyr takes around 110 seconds for the same task [3].
Let's install and create your first connection. You need two packages: duckdb (the engine) and DBI (the standard R database interface).
The dbConnect(duckdb()) call creates an in-memory database. Everything happens in RAM, which is the fastest option for interactive analysis. You now have a full SQL engine ready to accept queries.
How Do You Connect to DuckDB from R?
DuckDB offers two connection modes. In-memory databases live entirely in RAM and vanish when you close R. File-based databases persist to disk and survive between sessions. The choice depends on your workflow.
For exploratory analysis where you are querying data and discarding results, in-memory is ideal. For projects where you build a database over time, file-based connections let you pick up where you left off.
Once connected, you interact with DuckDB using standard DBI functions. Let's create a table, insert some data, and query it. This is pure SQL, exactly like working with SQLite or PostgreSQL through DBI.
The result comes back as a regular R data frame. DuckDB ran the GROUP BY aggregation internally using its columnar engine, then handed back the four summary rows. Notice how this follows the same DBI workflow as any other database — the only difference is duckdb() as the driver.
How Do You Query R Data Frames with DuckDB?
One of DuckDB's most powerful features is querying R data frames directly with SQL. The duckdb_register() function exposes an existing data frame as a virtual table inside DuckDB. No data is copied — DuckDB reads the data frame's memory directly.
This means you can write SQL queries against mtcars, iris, or any data frame you have loaded, and DuckDB will execute them using its optimized columnar engine.

Figure 2: The DuckDB query pipeline: connect, register data, write SQL, and get results as an R data frame.
DuckDB processed mtcars without copying it. The 4-cylinder cars average 26.7 mpg with 83 hp, while 8-cylinder cars average only 15.1 mpg but deliver 209 hp. This is the same result you would get with dplyr's group_by() and summarise(), but executed by DuckDB's query optimizer.
You can run more complex queries too. Let's find the most efficient car in each cylinder group.
Window functions like ROW_NUMBER() OVER() are native to DuckDB's SQL engine. They would require complex workarounds in base R but are straightforward in SQL. The most fuel-efficient 4-cylinder car gets 33.9 mpg, while the best 8-cylinder car manages only 19.2 mpg.
How Do You Query CSV and Parquet Files Directly?
DuckDB can query CSV and Parquet files on disk without loading them into R first. This is the key feature for large data. You write SQL with read_csv_auto() or read_parquet() in the FROM clause, and DuckDB handles the rest.
For CSV files, DuckDB automatically detects column types, delimiters, and headers. For Parquet files, DuckDB pushes filters and column selections down into the file scan, reading only what the query needs.
Let's create a sample CSV file and query it directly.
DuckDB read the CSV, detected the column types, grouped by category, and returned the summary — all in one SQL statement. The CSV file never entered R's memory as a data frame.
You can also push filters into the file scan. DuckDB reads only the rows and columns that match your query.
The WHERE clause filters rows during the scan. DuckDB does not load the entire file and then filter — it skips non-matching rows as it reads. For Parquet files this is even more efficient, because Parquet stores metadata that lets DuckDB skip entire row groups.
How Does duckplyr Replace dplyr for Large Data?
If you prefer dplyr syntax over SQL, the duckplyr package gives you the best of both worlds. It is a drop-in replacement for dplyr that routes computations through DuckDB's engine instead of R's memory.
When you load duckplyr, it overwrites dplyr's methods. Your existing filter(), mutate(), summarise(), and group_by() code runs unchanged — but DuckDB executes it behind the scenes. If DuckDB cannot translate a particular operation, duckplyr falls back to standard dplyr automatically.

Figure 3: Two ways to query DuckDB: SQL via dbGetQuery() or dplyr verbs via duckplyr.
Let's see duckplyr in action. First, you convert a data frame to a DuckDB-backed tibble with as_duckdb_tibble().
The code looks identical to regular dplyr. The only differences are loading duckplyr and using as_duckdb_tibble() to create the input. Behind the scenes, DuckDB's query optimizer planned and executed the filter, mutate, and summarise as a single efficient operation.
The collect() call at the end triggers execution. Until you call collect(), duckplyr builds a lazy query plan without running anything. This is similar to how dbplyr works with remote databases.
How Fast Is DuckDB Compared to dplyr and data.table?
DuckDB's speed advantage depends on your data size and query type. For small data frames already in memory, data.table is hard to beat. But as data grows beyond a few million rows — especially when reading from disk — DuckDB pulls ahead dramatically.
Here are published benchmark results from multiple independent studies.
| Scenario | Base R | dplyr | data.table | DuckDB |
|---|---|---|---|---|
| 100M rows, group + mean (from CSV) [3] | 486 sec | 110 sec | ~15 sec | 3.9 sec |
| 38M rows, group + median (Parquet) [4] | — | 43 sec | — | 2.2 sec |
| 10M rows, aggregate + sort (disk) [5] | — | 34x slower | ~10x slower | 1x (baseline) |
| 1M rows, in-memory group_by | ~2 sec | ~0.5 sec | ~0.1 sec | ~0.3 sec |
For in-memory operations on small to medium data, data.table is typically fastest. Its C-optimized routines and reference semantics make it extremely efficient. But data.table requires loading the entire dataset into RAM first.
DuckDB's advantage comes from three sources: columnar storage that reads only needed columns, parallel execution that uses all CPU cores, and disk-based scanning that avoids loading entire files into memory.
Let's run a simple illustration using mtcars. On a small dataset like this, the differences are negligible — the benchmarks above with millions of rows are where DuckDB shines.
With 32 rows, all three approaches finish in milliseconds. The overhead of DuckDB's query optimizer is not worth it at this scale. The real advantage appears with millions of rows, where DuckDB's columnar scanning and parallel execution leave dplyr and base R far behind.
When Should You Use DuckDB Instead of Other Tools?
DuckDB is not always the right choice. Here is a decision framework based on your data size, storage format, and query patterns.
| Factor | Use DuckDB | Use data.table | Use dplyr |
|---|---|---|---|
| Data size | Millions to billions of rows | Up to ~100M rows (in RAM) | Up to ~10M rows |
| Data location | Files on disk (CSV, Parquet) | Already in R memory | Already in R memory |
| Query type | Aggregations, joins, window functions | Fast row-level operations | Readable pipelines |
| Memory constraint | RAM is limited | RAM is sufficient | RAM is sufficient |
| SQL fluency | Comfortable with SQL | Prefer R syntax | Prefer R syntax |
DuckDB versus SQLite is a common question. SQLite is an OLTP (transactional) database — optimized for inserting, updating, and deleting individual rows. DuckDB is OLAP — optimized for reading and aggregating large batches of rows. If your workload is "insert one row at a time," use SQLite. If your workload is "summarise a million rows," use DuckDB.
DuckDB versus Spark is another comparison. Spark distributes computation across multiple machines. DuckDB runs on a single machine. If your data fits on one machine (even a laptop), DuckDB is simpler and often faster. If you need a cluster, Spark is the right tool.
Common Mistakes and How to Fix Them
Mistake 1: Forgetting to close the DuckDB connection
❌ Wrong:
Why it is wrong: DuckDB holds memory and file locks. Forgetting to disconnect leaks resources and can prevent other processes from accessing file-based databases.
✅ Correct:
Mistake 2: Loading the entire file into R before querying
❌ Wrong:
Why it is wrong: This defeats DuckDB's main advantage. You already consumed all the RAM loading the file. DuckDB can query the file directly from disk without loading it.
✅ Correct:
Mistake 3: Using string concatenation for SQL parameters
❌ Wrong:
Why it is wrong: This opens the door to SQL injection attacks. A malicious input like '; DROP TABLE sales; -- could destroy your data.
✅ Correct:
Mistake 4: Expecting DuckDB to handle multiple simultaneous writers
❌ Wrong:
Why it is wrong: DuckDB allows only one writer at a time. Multiple simultaneous readers are fine, but concurrent writes will fail or produce undefined behavior.
✅ Correct:
Mistake 5: Forgetting to collect duckplyr results
❌ Wrong:
Why it is wrong: duckplyr uses lazy evaluation. The query has not executed yet. Some R functions expect a materialized data frame and will behave unexpectedly on a lazy query plan.
✅ Correct:
Practice Exercises
Exercise 1: Basic DuckDB aggregation
Connect to DuckDB, register the iris data frame, and write a SQL query that calculates the mean and standard deviation of Sepal.Length for each Species. Return results ordered by mean descending.
Click to reveal solution
Explanation: duckdb_register() exposes iris as a SQL table. Column names with dots need quoting in SQL. STDDEV() computes the standard deviation, and we ORDER BY the mean descending so virginica (largest sepals) appears first.
Exercise 2: Create and query a temporary table
Create a DuckDB connection, then use SQL to create a table called "students" with columns name (VARCHAR), grade (INTEGER), and score (DOUBLE). Insert 6 rows of your choice. Write a query that returns the average score per grade, but only for grades with more than 1 student.
Click to reveal solution
Explanation: HAVING filters groups after aggregation. Grade 12 has only 1 student (Eve), so it is excluded. Grades 10 and 11 each have 2+ students and pass the filter.
Exercise 3: Use duckplyr for grouped summarisation
Use duckplyr to convert the starwars dataset (from dplyr) to a DuckDB-backed tibble. Then find the tallest and heaviest character in each species that has at least 2 members. Remove rows with missing height or mass first.
Click to reveal solution
Explanation: The pipeline filters NAs, groups by species, computes max height and mass, then keeps only species with 2+ members. duckplyr routes all operations through DuckDB. The collect() call at the end materializes the result.
Exercise 4: Multi-step SQL pipeline
Write a single SQL query against the mtcars data frame that: (a) computes the average mpg, hp, and qsec for each combination of cyl and am (transmission type), (b) ranks each group by average mpg descending using a window function, and (c) returns only the top 3 groups by average mpg.
Click to reveal solution
Explanation: The inner query groups by cyl and am, then ROW_NUMBER() assigns a rank based on avg_mpg descending. The outer query filters to the top 3. Manual transmission 4-cylinder cars are the most fuel-efficient group at 28.1 mpg.
Putting It All Together
Let's build a complete analysis pipeline. We will create a realistic sales dataset, store it in DuckDB, query it with SQL, and then query the same data with duckplyr — demonstrating both interfaces on the same problem.
Now let's answer a business question with SQL: "Which product-region combination generated the most total revenue in Q4 2025?"
Let's answer the same question with duckplyr for comparison.
Both approaches produce identical results. The SQL version is more explicit and familiar to database users. The duckplyr version reads like standard R tidyverse code. Choose whichever fits your team and workflow.
Summary
| Function | Package | Purpose |
|---|---|---|
| dbConnect(duckdb()) | duckdb, DBI | Create a DuckDB connection (in-memory or file) |
| dbGetQuery(con, sql) | DBI | Run a SQL query and return results as a data frame |
| dbExecute(con, sql) | DBI | Run SQL that does not return results (CREATE, INSERT) |
| duckdb_register(con, name, df) | duckdb | Expose a data frame as a virtual table (zero-copy) |
| duckdb_unregister(con, name) | duckdb | Remove a virtual table registration |
| read_csv_auto(path) | DuckDB SQL | Query a CSV file directly in SQL |
| read_parquet(path) | DuckDB SQL | Query a Parquet file directly in SQL |
| as_duckdb_tibble(df) | duckplyr | Convert a data frame to a DuckDB-backed tibble |
| collect() | dplyr/duckplyr | Materialize a lazy duckplyr query into a tibble |
| dbDisconnect(con, shutdown) | DBI | Close connection and release resources |
Key takeaways:
- DuckDB is an in-process columnar database — no server required
- It queries data frames, CSV, and Parquet files directly using SQL
- duckplyr provides a dplyr-compatible interface backed by DuckDB
- DuckDB excels at analytical queries on large data (millions to billions of rows)
- For small in-memory data, data.table or dplyr may be faster due to lower overhead
- Always close your connection with dbDisconnect(con, shutdown = TRUE)
FAQ
Can DuckDB handle data larger than my RAM?
Yes. DuckDB can query files on disk (CSV, Parquet, JSON) without loading them into memory. It streams data through its columnar engine, processing chunks at a time. For extremely large datasets, use Parquet files — DuckDB can skip entire row groups using metadata, reading only what the query needs.
Is duckplyr a full replacement for dplyr?
Nearly. duckplyr supports most dplyr verbs and automatically falls back to standard dplyr for any operation DuckDB cannot translate. You can load it alongside dplyr without breaking existing code. Use methods_restore() to switch back to pure dplyr if needed.
How does DuckDB compare to SQLite?
SQLite is optimized for transactional workloads (OLTP) — inserting, updating, and reading individual rows. DuckDB is optimized for analytical workloads (OLAP) — aggregating, joining, and scanning large batches of rows. Use SQLite for application databases. Use DuckDB for data analysis.
Can I use DuckDB with Shiny apps?
Yes. DuckDB works well as a backend for Shiny apps that need fast analytical queries. Create the connection in server.R and query data on demand. Since DuckDB is in-process, there is no external database to manage in production.
Does DuckDB support joins across different data sources?
Yes. You can join a registered R data frame with a Parquet file on disk in a single SQL query. DuckDB treats all registered tables and file scans as part of the same database, regardless of where the data physically lives.
References
- DuckDB Foundation — R Client Documentation. Link
- Muller, K. and Raasveldt, M. — duckplyr: A DuckDB-Backed Version of dplyr. tidyverse.org (2024). Link
- Turner, S.D. — "DuckDB vs dplyr vs base R." Paired Ends (2024). Link
- Cooney, D. — "R Dplyr vs DuckDB." Appsilon (2024). Link
- DuckDB Foundation — "The Return of the H2O.ai Database-like Ops Benchmark." (2023). Link
- Posit Team — "duckplyr: dplyr Powered by DuckDB — A High-Level Overview." Posit Blog (2024). Link
- DuckDB Foundation — Official Documentation. Link
What's Next?
- Connect R to Any Database: DBI + RSQLite, RPostgres, and RMySQL — Learn the DBI foundation that DuckDB builds on. DBI provides the universal database interface that dbConnect(), dbGetQuery(), and parameterised queries all come from.
- dplyr group_by and summarise — Master the dplyr aggregation verbs that duckplyr accelerates. Understanding group_by() and summarise() will make your DuckDB pipelines more effective.
- Importing Data in R — Learn all the ways to get data into R, from CSV and Excel to databases and APIs. DuckDB complements these methods for large-file workflows.