arrow open_dataset() in R: Query Multi-File Datasets
The arrow open_dataset() function points R at a folder of Parquet or CSV files and treats them as one queryable table. It does not load the data, so you can filter and summarise datasets far larger than your computer's memory.
open_dataset("data/") # open a Parquet folder
open_dataset("data/", format = "csv") # open a CSV folder
open_dataset("data/", partitioning = "year") # Hive-partitioned folder
open_dataset(c("a.parquet", "b.parquet")) # open an explicit file list
open_dataset("data/") |> filter(x > 0) |> collect() # query then pull to R
open_dataset("data/") |> nrow() # count rows, no loadNeed explanation? Read on for examples and pitfalls.
What open_dataset() does
open_dataset() builds a lazy view over many files at once. You give it a directory, and it scans the file layout, reads each file's metadata, and returns a Dataset object. No rows are pulled into R. The object simply knows where the data lives and what columns it has.
That laziness is the whole point. A Dataset can span hundreds of Parquet files and hundreds of gigabytes, yet open_dataset() returns in a fraction of a second. You then attach dplyr verbs, and the Apache Arrow engine pushes that work down to the files, reading only the columns and row groups your query touches.
open_dataset() gives you a promise to read files later. Nothing is materialised until you call collect(), so the expensive work happens once, on exactly the rows you asked for.Syntax and key arguments
The first argument is a path, and format tells Arrow how to parse it. Everything else has a sensible default, so most calls are one or two arguments long.
The format argument defaults to "parquet", so reading a CSV folder needs format = "csv" explicitly. The partitioning argument reads Hive-style folder names such as year=2024/ and turns them back into a column. Set unify_schemas = TRUE when files were written at different times and their columns no longer line up.
open_dataset() examples
These examples build a real dataset folder, then query it. Each block runs in order and shares state, so variables created early stay available.
First, write a partitioned dataset so there is something multi-file to open.
Now open that folder. The printed summary confirms the file count and schema without reading a single row.
Attach dplyr verbs and finish with collect() to pull the result into a tibble. Arrow runs the filter and column selection across all three files for you.
For a folder of CSV files, pass format = "csv". The query interface is identical once the dataset is open.
nrow(cars) and schema(cars) answer structural questions instantly because they read metadata, not data. Use them to size a query before you commit to pulling rows.open_dataset() vs read_parquet() and alternatives
Reach for open_dataset() when the data spans many files or exceeds memory. For a single file that fits in RAM, the simpler readers are a better fit.
| Function | Reads | Loads into memory | Best for |
|---|---|---|---|
open_dataset() |
A folder of many files | No, lazy until collect() |
Larger-than-memory, partitioned data |
read_parquet() |
One Parquet file | Yes, immediately | A single file that fits in RAM |
read_csv() |
One CSV file | Yes, immediately | Small to medium text files |
arrow_table() |
An in-memory object | Already in memory | Wrapping data you already have |
The decision rule is simple. If you can name one file and it fits in memory, use read_parquet(). If you have a directory, partitioned data, or a dataset bigger than RAM, use open_dataset() and let Arrow stream it.
Common pitfalls
Three mistakes account for most open_dataset() confusion. Each has a one-line fix.
Forgetting collect() is the most common. A pipeline without it returns an unevaluated query, not a data frame.
Opening a CSV folder without format = "csv" makes Arrow try to parse text as Parquet and fail. Always set format for anything that is not Parquet.
open_dataset() may error or drop rows on collect. Pass unify_schemas = TRUE so Arrow reconciles the columns instead of assuming every file matches the first.Try it yourself
Try it: Open the cars dataset from the examples, keep only 6-cylinder cars, and compute their mean mpg. Save the number to ex_mpg.
Click to reveal solution
Explanation: The filter and summarise run inside the Arrow engine, and collect() brings back the single summary row. pull() then extracts the number from the one-row tibble.
Related arrow functions
These functions pair with open_dataset() in a typical Parquet workflow.
- read_parquet() reads a single Parquet file straight into a data frame.
- write_parquet() saves one data frame to a single Parquet file.
- read_feather() reads an Arrow Feather or IPC file.
write_dataset()writes a data frame back out as a partitioned folder.collect()materialises a lazy Dataset query into an R tibble.
For a fuller tour of the package, see the Apache Arrow in R guide and the official arrow package documentation.
FAQ
Does open_dataset() load the whole dataset into memory?
No. open_dataset() only reads file metadata and returns a lazy Dataset object. Data is read when you call collect(), and even then Arrow reads only the columns and rows your query needs. This is what lets a single laptop query datasets that are far larger than its RAM, because the full table never has to exist in memory at once.
What file formats does open_dataset() support?
It supports Parquet, Arrow IPC and Feather, and delimited text formats including CSV and TSV. Parquet is the default, so other formats need an explicit format argument, such as format = "csv". Parquet is usually the best choice because its columnar layout lets Arrow skip unneeded columns and row groups during a query.
How is open_dataset() different from read_parquet()?
read_parquet() reads exactly one file and loads it fully into memory right away. open_dataset() points at a directory of many files and stays lazy until collect(). Use read_parquet() for a single file that fits in RAM, and open_dataset() for a folder, a partitioned layout, or any dataset too big to load at once.
Can open_dataset() read partitioned folders?
Yes. When folders are named in Hive style, such as year=2024/month=01/, open_dataset() reads those names and turns them back into columns you can filter on. The default partitioning = hive_partition() handles this automatically. Filtering on a partition column lets Arrow skip whole folders, which is the single biggest speedup on large datasets.