dbplyr in R: Write dplyr Code That Runs on Any SQL Database
dbplyr is a dplyr backend that translates your filter(), mutate(), and summarise() calls into SQL and runs them on a database — so you query millions of rows without loading them into R or writing a single line of SQL.
Introduction
You already know dplyr for in-memory data frames. You know DBI for connecting to databases. dbplyr is the bridge between the two: it lets you point dplyr at a database table and write the same verbs you already use. Behind the scenes, dbplyr translates your R code into SQL, sends it to the database, and brings back only the results.
The key idea is laziness. When you call filter() or summarise() on a database table through dbplyr, nothing happens immediately. dbplyr builds up a SQL query. Only when you call collect() does the query actually run and the results enter R memory. This means you can chain ten dplyr verbs together, and the database executes them as a single optimised SQL statement.
All code in this tutorial uses RSQLite, an in-process database that runs directly in your browser. Click Run to execute each block.
How does dbplyr connect dplyr to a database?
The entry point is tbl(). You give it a DBI connection and a table name, and it returns a lazy table reference — an object that looks like a data frame but lives on the database.
Let's set up an in-memory SQLite database, copy mtcars into it, and create a lazy reference.
Notice the header says Source: table<mtcars> and Database: sqlite. This is not an R data frame — it is a promise to query the database when you need the data. The [?? x 11] means dbplyr does not even know how many rows the table has, because it has not counted them yet.
tbl() never loads data into R.** It creates a lightweight reference. Even printing it only fetches the first few rows for display. The full table stays on the database until you call collect().Try it: Create a lazy reference to the iris table (already copied above) and print it. How many columns does it show?
Click to reveal solution
Explanation: tbl(con, "iris") creates the same kind of lazy reference. The 5 columns match the iris dataset: 4 numeric measurements and Species.
How does dbplyr translate dplyr verbs into SQL?
The magic of dbplyr is that your dplyr code becomes SQL. You can see the generated SQL at any time with show_query(). This is invaluable for debugging and for learning SQL by seeing what your dplyr code compiles to.
dbplyr translated filter(mpg > 25) into WHERE (mpg > 25.0) and select(mpg, cyl, hp, wt) into the SELECT clause. The pipeline has not executed — query1 is still a lazy reference.
Now try a grouped aggregation:
group_by(cyl) became GROUP BY cyl, mean(mpg) became AVG(mpg), and n() became COUNT(*). dbplyr knows how to map R aggregation functions to their SQL equivalents.
show_query() to learn SQL.** If you are new to SQL, write your analysis in dplyr and call show_query() to see what the equivalent SQL looks like. It is one of the fastest ways to pick up SQL syntax.Try it: Write a pipeline on mtcars_db that filters to am == 1 (manual transmission), groups by gear, and counts rows. Call show_query() on it.
Click to reveal solution
Explanation: dbplyr maps filter(am == 1) to WHERE, group_by(gear) to GROUP BY, and n() to COUNT(*).
When does dbplyr actually execute the query?
dbplyr is lazy by design. Nothing runs until you force it. There are three ways to force execution:
- **
collect()** — pulls all results into an R data frame - Printing — fetches the first few rows for display
- **
compute()** — executes the query but stores the result as a temporary table on the database (not in R)
The most common pattern is: build a lazy pipeline, verify with show_query(), then collect() the final result.
After collect(), result is a normal tibble in R memory. You can plot it, pass it to model functions, or save it to a file. The database connection is no longer needed for this data.
collect() on a huge table pulls everything into RAM.** Always filter and aggregate on the database first, then collect the small result. A pattern like tbl(con, "big_table") |> collect() |> filter(...) defeats the entire purpose of dbplyr.Try it: Build a lazy pipeline on mtcars_db that selects only mpg and cyl, filters to cyl == 4, and collects the result. How many rows come back?
Click to reveal solution
Explanation: select() and filter() run on the database. collect() pulls only the 11 matching rows into R. The order of select() and filter() does not matter — dbplyr optimises the SQL either way.
Which dplyr verbs does dbplyr support?
dbplyr translates most core dplyr verbs: filter(), select(), mutate(), summarise(), group_by(), arrange(), rename(), distinct(), slice_head(), slice_min(), slice_max(), and all join types (inner_join(), left_join(), right_join(), full_join(), semi_join(), anti_join()).
Here is a join across two database tables:
The join, grouping, and aggregation all happen on the database in one SQL statement. Only the 3-row summary enters R.
mutate(), dbplyr will error rather than silently give wrong results. Stick to arithmetic, base R math functions, and dplyr helpers. Use show_query() to verify translation before collect().Try it: Join mtcars_db with cyl_info_db, then compute the max hp per engine_type. Collect the result.
Click to reveal solution
Explanation: inner_join() + group_by() + summarise(max(...)) all translate to a single SQL INNER JOIN ... GROUP BY ... MAX(...) statement.
How do you mix dbplyr with raw SQL?
Sometimes you need SQL that dbplyr cannot generate — window functions, CTEs, or database-specific syntax. You can pass raw SQL through tbl() using the sql() helper.
The HAVING clause filters groups after aggregation — something dplyr does not directly support. By wrapping raw SQL in tbl(con, sql(...)), the result is still a lazy dbplyr reference that you can chain more dplyr verbs onto.
tbl(con, sql("SELECT ...")) |> filter(...) |> collect() works. dbplyr wraps your SQL as a subquery and adds the filter on top. This lets you combine hand-tuned SQL with dplyr convenience.Try it: Write a raw SQL query inside tbl(con, sql(...)) that selects all columns from mtcars where wt > 4. Collect the result.
Click to reveal solution
Explanation: sql() passes the string directly to the database. tbl() wraps it as a lazy reference, and collect() pulls the 4 matching rows into R.
Common Mistakes and How to Fix Them
Mistake 1: Collecting the entire table before filtering
❌ Wrong:
Why it is wrong: collect() materialises the entire table into R memory before filter() runs. If the table has 100 million rows, R will crash or crawl.
✅ Correct:
Mistake 2: Using R functions that dbplyr cannot translate
❌ Wrong:
Why it is wrong: Not all R functions have SQL equivalents. ifelse() translates on some backends but not all. Check with show_query() before collecting.
✅ Correct:
Mistake 3: Forgetting to disconnect
❌ Wrong:
Why it is wrong: Open connections consume resources and may lock database files. SQLite in particular keeps file locks until disconnection.
✅ Correct:
Practice Exercises
Exercise 1: End-to-end dbplyr pipeline
Connect to a new in-memory SQLite database, copy the airquality dataset into it, create a lazy reference, filter to rows where Month == 7, group by Day, compute mean_temp = mean(Temp), show the generated SQL, and collect the result.
Click to reveal solution
Explanation: The pipeline stays lazy until collect(). The database filters to July, groups by day, and computes the average temperature — all in one SQL statement.
Exercise 2: Join and compare SQL
Using the original con connection (reconnect if needed), create a table am_labels with columns am (0, 1) and trans_type ("automatic", "manual"). Join it with mtcars on the database, compute mean mpg per trans_type, show the SQL, and collect. Compare the generated SQL to what you would write by hand.
Click to reveal solution
Explanation: The join, grouping, and aggregation all compile to a single SQL query. Manual-transmission cars average 7 mpg more than automatics.
Summary
| Concept | How it works |
|---|---|
tbl(con, "table") |
Creates a lazy reference — no data loaded |
show_query() |
Reveals the SQL that dbplyr generated |
collect() |
Executes the query and pulls results into R |
compute() |
Executes the query and stores results as a temp table on the database |
| Joins | inner_join(), left_join(), etc. translate to SQL JOINs |
| Raw SQL | tbl(con, sql("SELECT ...")) passes SQL directly |
| Fallback | Functions dbplyr cannot translate raise an error |
Key rule: filter and aggregate on the database, collect only the small result.
FAQ
Does dbplyr work with PostgreSQL, MySQL, and other databases?
Yes. dbplyr works with any database that has a DBI-compatible driver. The SQL it generates adapts to the backend — SQLite, PostgreSQL, MySQL, SQL Server, BigQuery, and others each get dialect-appropriate SQL. Install the relevant driver package (RPostgres, RMySQL, odbc) and tbl() works the same way.
How do I know which functions dbplyr can translate?
Call show_query() on your pipeline before collect(). If dbplyr cannot translate a function, it raises an error immediately — it will not silently give wrong results. The dbplyr documentation lists all supported translations at dbplyr.tidyverse.org/articles/translation-function.html.
Should I use dbplyr or write SQL directly?
Use dbplyr when your analysis fits naturally into dplyr verbs — filter, group, summarise, join. Switch to raw SQL (via tbl(con, sql(...))) for window functions, CTEs, recursive queries, or database-specific syntax. You can mix both in the same script.
References
- dbplyr documentation — official tidyverse site. dbplyr.tidyverse.org
- dbplyr function translation reference. dbplyr.tidyverse.org/articles/translation-function.html
- Wickham, H. — R for Data Science, 2nd ed., Ch 21 Databases. r4ds.hadley.nz/databases.html
- DBI package documentation. dbi.r-dbi.org
- RSQLite package. rsqlite.r-dbi.org
What's Next?
- Connect R to Any Database: DBI — The parent tutorial covering dbConnect(), dbGetQuery(), parameterised queries, and the DBI interface that dbplyr builds on.
- DuckDB in R — An alternative analytical engine that also speaks dplyr (via duckplyr) and excels at querying Parquet and CSV files directly.