Apache Arrow in R: Read Parquet Files & Run Fast In-Memory Analytics
The arrow package brings the Apache Arrow columnar analytics engine to R. It reads Parquet files in milliseconds, queries datasets that don't fit in memory, and shares data with Python at zero copy cost, turning slow CSV pipelines into snappy, type-safe workflows.
If you've ever waited two minutes for a CSV to load, only to discover that R guessed the wrong column types, this tutorial is for you. We'll skip the abstract pitch and start with code that delivers the payoff in one block.
What does the arrow package actually do for R?
The fastest way to feel why Arrow matters is to write a small data frame to Parquet and read it back. Watch what happens to the file size, the column types, and the time it takes, all in one block. We'll generate a 50,000-row tibble, save it as both Parquet and CSV, and compare.
Three things just happened. Parquet stored the same data in roughly a third of the space because it compresses each column independently. The read_parquet() call returned a tibble in one line, no col_types argument, no parsing warnings. And the created column came back as a real Date, not a character string that you'd have to fix with as.Date() afterwards.
arrow package wraps a complete columnar in-memory engine written in C++. Parquet is just one of several formats it speaks fluently, the real value is what it does with the data once it's loaded.Try it: Write arrow_demo to a second Parquet file using snappy compression and report the file size.
Click to reveal solution
Explanation: Snappy is the default codec, so the file size matches parquet_path almost exactly. Try compression = "zstd" if you want smaller files at a small CPU cost.
How do you read Parquet files in R?
The basic call is one line: read_parquet(path). The interesting argument is col_select, which tells Arrow to read only the columns you ask for. For wide tables, think 200-column survey exports, this can turn a 30-second read into a 1-second read because the unread columns never leave disk.
Notice that category and created never entered R's memory. The Parquet file's footer told Arrow exactly where the id and value columns lived on disk, and only those byte ranges were read. CSV cannot do this, it has to parse every comma to find column boundaries.
Type preservation is the other quiet win. Watch what happens to a tibble with four different column types:
Every type round-tripped: integer stayed integer (not coerced to double), the date stayed a Date, and the factor kept its levels in the original order. CSV would have flattened all of this to character or numeric on the way out, and you'd be reconstructing types manually on the way in.
starts_with("date_"), where(is.numeric), or any other tidyselect expression to col_select, not just bare column names. This makes column-level filtering as expressive as dplyr::select().Try it: Read only the created column from parquet_path and confirm it comes back as a Date.
Click to reveal solution
Explanation: Passing a single column name as a string works just as well as the bare-name form. The result is a one-column tibble with the date type preserved.
Why is Parquet so much faster than CSV?
Three reasons stack up. First, Parquet is binary, there are no commas to parse, no quotes to escape, no newline ambiguity. Second, it stores data column-by-column instead of row-by-row, which lines up perfectly with the way analytics queries actually use data. Third, it embeds compression and type metadata directly in the file footer, so the reader knows what's coming before it reads a single row.
| Feature | CSV | Parquet |
|---|---|---|
| Storage layout | Row-by-row text | Column-by-column binary |
| Column types | Guessed on read | Stored in file metadata |
| Compression | None (or gzip whole file) | Per-column (snappy, zstd, gzip) |
| Read subset of columns | Must parse entire file | Reads only requested columns |
| Missing values | The string "NA" | Native null encoding |
| Cross-language | Universal but slow | R, Python, Spark, DuckDB, Java |
Numbers beat tables. Let's actually time a write-and-read round trip on a 100,000-row dataset:
The exact numbers will vary by machine, but the ratio is what matters. A 10-15x read speedup is typical even for a modest 100k-row file. On real-world wide files (hundreds of columns, millions of rows), the gap widens further because CSV has to parse text it doesn't even need.
head or cat the file at the shell. Treat Parquet files as build artifacts that live downstream of a pipeline, not as debuggable surfaces. Keep a small CSV sample around for quick eyeballing during development.Try it: Print only the read-time speedup ratio (Parquet vs CSV) to one decimal place.
Click to reveal solution
Explanation: system.time() returns a named vector, ["elapsed"] is the wall-clock seconds. Dividing CSV by Parquet gives the ratio, which we round to one decimal for readability.
How do you query datasets larger than RAM with open_dataset()?
This is the feature that justifies installing Arrow even on small datasets. open_dataset() does not read data into memory, it returns a lazy reference to a file or a directory of files. You then write dplyr verbs against that reference, and Arrow's C++ engine executes the whole pipeline on disk when you call collect(). Only the result, typically a small summary, ever enters R's memory.
To make this concrete, let's split our arrow_demo tibble into four Parquet files (simulating a partitioned dataset) and run a lazy query against the directory:
The lazy_ds object holds zero rows in R memory, it's a pointer plus a schema. Every dplyr verb you chain against it just records the operation. The work happens at collect(), when Arrow scans the four Parquet files in C++, applies the filter and aggregation, and returns a 4-row tibble. The original 50,000 rows never sit in R memory at the same time.
collect(), the less RAM you need.Try it: Run a group_by(category) |> summarise(mean_value = mean(value)) against lazy_ds (without the filter step) and collect.
Click to reveal solution
Explanation: The full pipeline runs in Arrow's C++ engine. R never holds 50,000 rows at once, only the 4-row aggregated result.
How do you write Parquet files (with partitioning)?
write_parquet() produces a single file. write_dataset() produces a directory of files, optionally partitioned by one or more columns. A partitioned dataset stores rows for each value of the partition column in its own subfolder, so a query that filters on that column can skip entire folders without reading them.
The category=A/ folder layout is called Hive-style partitioning, and it's the standard format that Arrow, Spark, DuckDB, and Presto all understand. When you later open_dataset(part_dir2) and write filter(category == "B"), Arrow inspects the folder names and reads only category=B/part-0.parquet. Three quarters of the data never touches the disk.
WHERE clauses. Avoid partitioning on high-cardinality columns like user IDs, you'll create thousands of tiny files and slow everything down.Try it: List the files inside part_dir2 and confirm there are four (one per category folder).
Click to reveal solution
Explanation: Setting recursive = TRUE walks into the category=*/ subfolders. Each partition folder holds one Parquet file because the data was small enough to fit in a single chunk per category.
When should you use Arrow over data.table or duckdb?
Arrow, data.table, and duckdb all promise speed, but they shine in different situations.
| Tool | Best at | Use when |
|---|---|---|
arrow |
Parquet I/O, larger-than-RAM, cross-language interop | Reading or writing Parquet, querying datasets that don't fit in memory, sharing data with Python or Spark |
data.table |
In-memory speed on data that fits in RAM | You have one big in-memory table and need fast joins, group-bys, or rolling operations |
duckdb |
SQL-style analytics on local files | You think in SQL or want window functions, complex joins, or OLAP-style queries |
The good news is that you don't have to pick one. arrow::to_duckdb() hands a lazy Arrow dataset to DuckDB mid-pipeline, so you can read partitioned Parquet with Arrow and run a complex SQL query with DuckDB without copying the data between them. Similarly, you can as_tibble() an Arrow result and continue with data.table for the in-memory part of the pipeline.
Try it: Given a 5 GB Parquet file from which you only need 3 of 50 columns, which tool would you reach for first, and why?
Click to reveal solution
Explanation: When the bottleneck is "I need a slice of a wide Parquet file," Arrow is the shortest path to the answer. Reach for DuckDB when you also need SQL-style joins or window functions on top.
Practice Exercises
Exercise 1: Partitioned mtcars round trip
Write mtcars as a Parquet dataset partitioned by cyl, open it lazily, filter to rows where mpg > 20, collect the result, and verify the row count matches a pure-dplyr filter on the in-memory mtcars.
Click to reveal solution
Explanation: Partitioning mtcars by cyl creates three folders (cyl=4, cyl=6, cyl=8). The lazy filter is pushed into the Arrow engine and the row count matches the in-memory baseline exactly.
Exercise 2: Lazy aggregation pipeline
Build a 4-step lazy pipeline against the partitioned mtcars dataset from Exercise 1: filter mpg > 15, group by cyl, summarise mean_hp = mean(hp) and mean_wt = mean(wt), arrange by mean_hp descending, collect. Verify the result matches a pure-dplyr baseline run on the in-memory mtcars.
Click to reveal solution
Explanation: Every dplyr verb is pushed into Arrow's C++ engine. Only the 3-row summary returns to R. all.equal() confirms the lazy result is numerically identical to the in-memory baseline.
Complete Example
A realistic end-to-end flow: simulate 100,000 rows of sales data, write it as a Parquet dataset partitioned by region, run a lazy aggregation, and compare against the CSV baseline.
The pipeline above does three things that CSV cannot match. Partitioning by region means the filter(region == "North") step reads only the North folder, a quarter of the data. The lazy aggregation runs in Arrow's C++ engine, so R's memory never holds more than the 4-row summary. And the whole flow is one dplyr pipeline, no SQL, no manual chunking, no progress bars.
Summary
| Function | Purpose |
|---|---|
read_parquet(path) |
Read a Parquet file into a tibble |
read_parquet(path, col_select) |
Read only selected columns |
write_parquet(df, path) |
Write a tibble to a single Parquet file |
open_dataset(path) |
Open a lazy dataset (one file or directory) |
write_dataset(df, path, partitioning) |
Write a partitioned multi-file dataset |
collect() |
Run a lazy Arrow query and return a tibble |
to_duckdb() |
Hand an Arrow dataset to DuckDB for SQL queries |
Three takeaways to carry forward. First, Parquet is the right default for any data file you'd otherwise save as CSV, smaller, faster, and type-safe. Second, open_dataset() plus dplyr verbs lets you query datasets larger than RAM with code that looks identical to in-memory dplyr. Third, partition by the columns you filter on most, and Arrow will skip entire files without reading them.
References
- Apache Arrow R Package, official documentation. Link
- Wickham, H., Çetinkaya-Rundel, M., Grolemund, G., R for Data Science (2e), Chapter 22: Arrow. Link
- Apache Arrow, Working with multi-file data sets. Link
- Apache Parquet, official format specification. Link
- Posit, "Big Data in R with Arrow" workshop materials. Link
- Apache Arrow R Cookbook, recipes for common Parquet tasks. Link
Continue Learning
- Importing Data in R, the parent guide covering all import formats
- readr vs read.csv vs fread, for plain CSV speed comparisons
- data.table in R, the in-memory speed champion that pairs well with Arrow