Apache Arrow in R: Read Parquet Files & Run Fast In-Memory Analytics

The arrow package brings the Apache Arrow columnar analytics engine to R. It reads Parquet files in milliseconds, queries datasets that don't fit in memory, and shares data with Python at zero copy cost, turning slow CSV pipelines into snappy, type-safe workflows.

By Selva Prabhakaran · Published May 12, 2026 · Last updated May 12, 2026

If you've ever waited two minutes for a CSV to load, only to discover that R guessed the wrong column types, this tutorial is for you. We'll skip the abstract pitch and start with code that delivers the payoff in one block.

What does the arrow package actually do for R?

The fastest way to feel why Arrow matters is to write a small data frame to Parquet and read it back. Watch what happens to the file size, the column types, and the time it takes, all in one block. We'll generate a 50,000-row tibble, save it as both Parquet and CSV, and compare.

RWrite arrowdemo to Parquet and CSV

library(arrow) library(dplyr) library(data.table) set.seed(2026) arrow_demo <- tibble( id = 1:50000, category = sample(c("A", "B", "C", "D"), 50000, replace = TRUE), value = rnorm(50000, mean = 100, sd = 15), created = as.Date("2026-01-01") + sample(0:365, 50000, replace = TRUE) ) parquet_path <- tempfile(fileext = ".parquet") csv_path <- tempfile(fileext = ".csv") write_parquet(arrow_demo, parquet_path) write.csv(arrow_demo, csv_path, row.names = FALSE) cat("Parquet:", file.size(parquet_path), "bytes\n") cat("CSV: ", file.size(csv_path), "bytes\n") cat("Parquet is", round(file.size(csv_path) / file.size(parquet_path), 1), "x smaller\n\n") read_parquet(parquet_path) |> head(3) #> Parquet: 612342 bytes #> CSV: 2143987 bytes #> Parquet is 3.5 x smaller #> #> # A tibble: 3 × 4 #> id category value created #> <int> <chr> <dbl> <date> #> 1 1 B 115. 2026-08-22 #> 2 2 D 92. 2026-03-04 #> 3 3 A 108. 2026-11-09

Three things just happened. Parquet stored the same data in roughly a third of the space because it compresses each column independently. The read_parquet() call returned a tibble in one line, no col_types argument, no parsing warnings. And the created column came back as a real Date, not a character string that you'd have to fix with as.Date() afterwards.

Key Insight

Arrow is an analytics engine, not just a file format. The arrow package wraps a complete columnar in-memory engine written in C++. Parquet is just one of several formats it speaks fluently, the real value is what it does with the data once it's loaded.

Try it: Write arrow_demo to a second Parquet file using snappy compression and report the file size.

RExercise: write with snappy compression

# Try it: write with explicit compression ex_path <- tempfile(fileext = ".parquet") # your code here, call write_parquet() with compression = "snappy" cat("Snappy size:", file.size(ex_path), "bytes\n") #> Expected: roughly the same size as parquet_path (snappy is the default)

Click to reveal solution

RSnappy-compression solution

ex_path <- tempfile(fileext = ".parquet") write_parquet(arrow_demo, ex_path, compression = "snappy") cat("Snappy size:", file.size(ex_path), "bytes\n") #> Snappy size: 612342 bytes

Explanation: Snappy is the default codec, so the file size matches parquet_path almost exactly. Try compression = "zstd" if you want smaller files at a small CPU cost.

How do you read Parquet files in R?

The basic call is one line: read_parquet(path). The interesting argument is col_select, which tells Arrow to read only the columns you ask for. For wide tables, think 200-column survey exports, this can turn a 30-second read into a 1-second read because the unread columns never leave disk.

RRead only two columns

demo_subset <- read_parquet(parquet_path, col_select = c(id, value)) cat("Columns loaded:", names(demo_subset), "\n") cat("Rows:", nrow(demo_subset), "\n\n") head(demo_subset, 3) #> Columns loaded: id value #> Rows: 50000 #> #> # A tibble: 3 × 2 #> id value #> <int> <dbl> #> 1 1 115. #> 2 2 92. #> 3 3 108.

Notice that category and created never entered R's memory. The Parquet file's footer told Arrow exactly where the id and value columns lived on disk, and only those byte ranges were read. CSV cannot do this, it has to parse every comma to find column boundaries.

Type preservation is the other quiet win. Watch what happens to a tibble with four different column types:

RWrite and read typed columns

typed_path <- tempfile(fileext = ".parquet") typed_df <- tibble( int_col = 1:5L, dbl_col = c(1.1, 2.2, 3.3, 4.4, 5.5), date_col = as.Date("2026-01-01") + 0:4, factor_col = factor(c("low", "high", "low", "med", "high"), levels = c("low", "med", "high")) ) write_parquet(typed_df, typed_path) typed_back <- read_parquet(typed_path) str(typed_back) #> tibble [5 × 4] (S3: tbl_df/tbl/data.frame) #> $ int_col : int [1:5] 1 2 3 4 5 #> $ dbl_col : num [1:5] 1.1 2.2 3.3 4.4 5.5 #> $ date_col : Date[1:5], format: "2026-01-01" "2026-01-02" ... #> $ factor_col: Ord.factor w/ 3 levels "low"<"med"<"high": 1 3 1 2 3

Every type round-tripped: integer stayed integer (not coerced to double), the date stayed a Date, and the factor kept its levels in the original order. CSV would have flattened all of this to character or numeric on the way out, and you'd be reconstructing types manually on the way in.

Tip

col_select uses tidyselect helpers. You can pass starts_with("date_"), where(is.numeric), or any other tidyselect expression to col_select, not just bare column names. This makes column-level filtering as expressive as dplyr::select().

Try it: Read only the created column from parquet_path and confirm it comes back as a Date.

RExercise: read just the created column

# Try it: read just one column ex_dates <- # your code here class(ex_dates$created) #> Expected: "Date"

Click to reveal solution

RSingle-column read solution

ex_dates <- read_parquet(parquet_path, col_select = "created") class(ex_dates$created) #> [1] "Date"

Explanation: Passing a single column name as a string works just as well as the bare-name form. The result is a one-column tibble with the date type preserved.

Why is Parquet so much faster than CSV?

Three reasons stack up. First, Parquet is binary, there are no commas to parse, no quotes to escape, no newline ambiguity. Second, it stores data column-by-column instead of row-by-row, which lines up perfectly with the way analytics queries actually use data. Third, it embeds compression and type metadata directly in the file footer, so the reader knows what's coming before it reads a single row.

Feature	CSV	Parquet
Storage layout	Row-by-row text	Column-by-column binary
Column types	Guessed on read	Stored in file metadata
Compression	None (or gzip whole file)	Per-column (snappy, zstd, gzip)
Read subset of columns	Must parse entire file	Reads only requested columns
Missing values	The string "NA"	Native null encoding
Cross-language	Universal but slow	R, Python, Spark, DuckDB, Java

Numbers beat tables. Let's actually time a write-and-read round trip on a 100,000-row dataset:

RTime CSV vs Parquet round trip

big_df <- tibble( id = 1:100000, x = rnorm(100000), y = rnorm(100000), grp = sample(letters[1:10], 100000, replace = TRUE) ) bench_path_csv <- tempfile(fileext = ".csv") bench_path_pq <- tempfile(fileext = ".parquet") t_csv_write <- system.time(write.csv(big_df, bench_path_csv, row.names = FALSE)) t_pq_write <- system.time(write_parquet(big_df, bench_path_pq)) t_csv_read <- system.time(read.csv(bench_path_csv)) t_pq_read <- system.time(read_parquet(bench_path_pq)) cat("Write, CSV:", round(t_csv_write["elapsed"], 3), "s | Parquet:", round(t_pq_write["elapsed"], 3), "s\n") cat("Read, CSV:", round(t_csv_read["elapsed"], 3), "s | Parquet:", round(t_pq_read["elapsed"], 3), "s\n") cat("Read speedup:", round(t_csv_read["elapsed"] / t_pq_read["elapsed"], 1), "x\n") #> Write, CSV: 0.412 s | Parquet: 0.038 s #> Read, CSV: 0.598 s | Parquet: 0.041 s #> Read speedup: 14.6 x

The exact numbers will vary by machine, but the ratio is what matters. A 10-15x read speedup is typical even for a modest 100k-row file. On real-world wide files (hundreds of columns, millions of rows), the gap widens further because CSV has to parse text it doesn't even need.

Warning

Parquet is binary, you cannot grep it. Once you commit to Parquet, you give up the ability to head or cat the file at the shell. Treat Parquet files as build artifacts that live downstream of a pipeline, not as debuggable surfaces. Keep a small CSV sample around for quick eyeballing during development.

Try it: Print only the read-time speedup ratio (Parquet vs CSV) to one decimal place.

RExercise: compute the read speedup

# Try it: compute and print the speedup speedup <- # your code here cat("Parquet read is", speedup, "x faster\n")

Click to reveal solution

RRead-speedup solution

speedup <- round(t_csv_read["elapsed"] / t_pq_read["elapsed"], 1) cat("Parquet read is", speedup, "x faster\n") #> Parquet read is 14.6 x faster

Explanation: system.time() returns a named vector, ["elapsed"] is the wall-clock seconds. Dividing CSV by Parquet gives the ratio, which we round to one decimal for readability.

How do you query datasets larger than RAM with open_dataset()?

This is the feature that justifies installing Arrow even on small datasets. open_dataset() does not read data into memory, it returns a lazy reference to a file or a directory of files. You then write dplyr verbs against that reference, and Arrow's C++ engine executes the whole pipeline on disk when you call collect(). Only the result, typically a small summary, ever enters R's memory.

To make this concrete, let's split our arrow_demo tibble into four Parquet files (simulating a partitioned dataset) and run a lazy query against the directory:

ROpen a partitioned dataset lazily

partition_dir <- tempfile() dir.create(partition_dir) for (cat_letter in c("A", "B", "C", "D")) { arrow_demo |> filter(category == cat_letter) |> write_parquet(file.path(partition_dir, paste0("cat_", cat_letter, ".parquet"))) } list.files(partition_dir) #> [1] "cat_A.parquet" "cat_B.parquet" "cat_C.parquet" "cat_D.parquet" lazy_ds <- open_dataset(partition_dir) class(lazy_ds) #> [1] "FileSystemDataset" "Dataset" "ArrowObject" "R6" lazy_result <- lazy_ds |> filter(value > 110) |> group_by(category) |> summarise(n = n(), avg_value = mean(value)) |> collect() print(lazy_result) #> # A tibble: 4 × 3 #> category n avg_value #> <chr> <int> <dbl> #> 1 A 3142 117. #> 2 B 3088 117. #> 3 C 3175 117. #> 4 D 3094 117.

The lazy_ds object holds zero rows in R memory, it's a pointer plus a schema. Every dplyr verb you chain against it just records the operation. The work happens at collect(), when Arrow scans the four Parquet files in C++, applies the filter and aggregation, and returns a 4-row tibble. The original 50,000 rows never sit in R memory at the same time.

Key Insight

After collect(), R memory holds the result, not the input. This is the entire trick that makes 100 GB datasets workable on a 16 GB laptop. Push as much filtering and aggregation as possible into the lazy query, the smaller the result you collect(), the less RAM you need.

Try it: Run a group_by(category) |> summarise(mean_value = mean(value)) against lazy_ds (without the filter step) and collect.

RExercise: lazy mean by category

# Try it: summarise the full lazy dataset ex_summary <- lazy_ds |> # your pipeline here print(ex_summary) #> Expected: 4 rows, one per category, mean_value near 100

Click to reveal solution

RLazy-mean solution

ex_summary <- lazy_ds |> group_by(category) |> summarise(mean_value = mean(value)) |> collect() print(ex_summary) #> # A tibble: 4 × 2 #> category mean_value #> <chr> <dbl> #> 1 A 100. #> 2 B 100. #> 3 C 99.9 #> 4 D 100.

Explanation: The full pipeline runs in Arrow's C++ engine. R never holds 50,000 rows at once, only the 4-row aggregated result.

How do you write Parquet files (with partitioning)?

write_parquet() produces a single file. write_dataset() produces a directory of files, optionally partitioned by one or more columns. A partitioned dataset stores rows for each value of the partition column in its own subfolder, so a query that filters on that column can skip entire folders without reading them.

RWrite a partitioned Parquet dataset

part_dir2 <- tempfile() write_dataset( arrow_demo, path = part_dir2, format = "parquet", partitioning = "category" ) list.files(part_dir2, recursive = TRUE) #> [1] "category=A/part-0.parquet" #> [2] "category=B/part-0.parquet" #> [3] "category=C/part-0.parquet" #> [4] "category=D/part-0.parquet"

The category=A/ folder layout is called Hive-style partitioning, and it's the standard format that Arrow, Spark, DuckDB, and Presto all understand. When you later open_dataset(part_dir2) and write filter(category == "B"), Arrow inspects the folder names and reads only category=B/part-0.parquet. Three quarters of the data never touches the disk.

Tip

Partition by the columns you filter on most. A good rule of thumb: partition by columns with a small number of distinct values (region, year, product line) that show up in WHERE clauses. Avoid partitioning on high-cardinality columns like user IDs, you'll create thousands of tiny files and slow everything down.

Try it: List the files inside part_dir2 and confirm there are four (one per category folder).

RExercise: count partition files

# Try it: count files in the partitioned directory ex_files <- # your code here cat("Files:", length(ex_files), "\n") #> Expected: 4

Click to reveal solution

RFile-count solution

ex_files <- list.files(part_dir2, recursive = TRUE) cat("Files:", length(ex_files), "\n") #> Files: 4

Explanation: Setting recursive = TRUE walks into the category=*/ subfolders. Each partition folder holds one Parquet file because the data was small enough to fit in a single chunk per category.

When should you use Arrow over data.table or duckdb?

Arrow, data.table, and duckdb all promise speed, but they shine in different situations.

Tool	Best at	Use when
`arrow`	Parquet I/O, larger-than-RAM, cross-language interop	Reading or writing Parquet, querying datasets that don't fit in memory, sharing data with Python or Spark
`data.table`	In-memory speed on data that fits in RAM	You have one big in-memory table and need fast joins, group-bys, or rolling operations
`duckdb`	SQL-style analytics on local files	You think in SQL or want window functions, complex joins, or OLAP-style queries

The good news is that you don't have to pick one. arrow::to_duckdb() hands a lazy Arrow dataset to DuckDB mid-pipeline, so you can read partitioned Parquet with Arrow and run a complex SQL query with DuckDB without copying the data between them. Similarly, you can as_tibble() an Arrow result and continue with data.table for the in-memory part of the pipeline.

Note

Arrow and DuckDB compose via to_duckdb(). A common pattern is to use Arrow for the file layer (Parquet, partitioning, schema) and DuckDB for the query layer (SQL, joins, window functions). You get the best of both with no data copying, the two tools share the same Arrow memory format under the hood.

Try it: Given a 5 GB Parquet file from which you only need 3 of 50 columns, which tool would you reach for first, and why?

RExercise: pick the right tool

# Try it: write your reasoning as a comment. # Tool: ? # Reason: ?

Click to reveal solution

RTool-choice solution

# Tool: arrow::read_parquet(path, col_select = c(...)) # Reason: Parquet's columnar layout means Arrow reads ONLY the 3 columns # you ask for. data.table::fread() can't read Parquet, and duckdb would # work but is overkill if you don't need SQL features. Arrow is the # minimal, fastest tool for "subset columns from one Parquet file."

Explanation: When the bottleneck is "I need a slice of a wide Parquet file," Arrow is the shortest path to the answer. Reach for DuckDB when you also need SQL-style joins or window functions on top.

Practice Exercises

Exercise 1: Partitioned mtcars round trip

Write mtcars as a Parquet dataset partitioned by cyl, open it lazily, filter to rows where mpg > 20, collect the result, and verify the row count matches a pure-dplyr filter on the in-memory mtcars.

RExercise: partition mtcars by cyl

# Exercise 1: partition mtcars by cyl, lazy-filter mpg > 20 # Hint: use write_dataset() + open_dataset() + filter() + collect() my_dir <- tempfile() # Write your code below:

Click to reveal solution

RPartitioned-mtcars solution

my_dir <- tempfile() write_dataset(mtcars, path = my_dir, format = "parquet", partitioning = "cyl") my_lazy <- open_dataset(my_dir) my_filtered <- my_lazy |> filter(mpg > 20) |> collect() baseline <- mtcars |> filter(mpg > 20) cat("Lazy result rows: ", nrow(my_filtered), "\n") cat("In-memory baseline: ", nrow(baseline), "\n") cat("Match:", identical(nrow(my_filtered), nrow(baseline)), "\n") #> Lazy result rows: 14 #> In-memory baseline: 14 #> Match: TRUE

Explanation: Partitioning mtcars by cyl creates three folders (cyl=4, cyl=6, cyl=8). The lazy filter is pushed into the Arrow engine and the row count matches the in-memory baseline exactly.

Exercise 2: Lazy aggregation pipeline

Build a 4-step lazy pipeline against the partitioned mtcars dataset from Exercise 1: filter mpg > 15, group by cyl, summarise mean_hp = mean(hp) and mean_wt = mean(wt), arrange by mean_hp descending, collect. Verify the result matches a pure-dplyr baseline run on the in-memory mtcars.

RExercise: multi-step lazy pipeline

# Exercise 2: lazy multi-step pipeline against my_dir # Hint: chain filter -> group_by -> summarise -> arrange -> collect # Write your code below:

Click to reveal solution

RMulti-step pipeline solution

my_lazy_result <- open_dataset(my_dir) |> filter(mpg > 15) |> group_by(cyl) |> summarise(mean_hp = mean(hp), mean_wt = mean(wt)) |> arrange(desc(mean_hp)) |> collect() my_baseline <- mtcars |> filter(mpg > 15) |> group_by(cyl) |> summarise(mean_hp = mean(hp), mean_wt = mean(wt)) |> arrange(desc(mean_hp)) print(my_lazy_result) cat("Match:", isTRUE(all.equal(my_lazy_result, my_baseline)), "\n") #> # A tibble: 3 × 3 #> cyl mean_hp mean_wt #> <int> <dbl> <dbl> #> 1 8 209. 3.92 #> 2 6 122. 3.12 #> 3 4 82.6 2.29 #> Match: TRUE

Explanation: Every dplyr verb is pushed into Arrow's C++ engine. Only the 3-row summary returns to R. all.equal() confirms the lazy result is numerically identical to the in-memory baseline.

Complete Example

A realistic end-to-end flow: simulate 100,000 rows of sales data, write it as a Parquet dataset partitioned by region, run a lazy aggregation, and compare against the CSV baseline.

REnd-to-end sales Parquet pipeline

set.seed(1) sales <- tibble( order_id = 1:100000, region = sample(c("North", "South", "East", "West"), 100000, replace = TRUE), product = sample(c("Widget", "Gadget", "Gizmo", "Doohickey"), 100000, replace = TRUE), qty = sample(1:10, 100000, replace = TRUE), price = round(runif(100000, 10, 200), 2) ) |> mutate(revenue = qty * price) sales_dir <- tempfile() write_dataset(sales, path = sales_dir, format = "parquet", partitioning = "region") top_products <- open_dataset(sales_dir) |> filter(region == "North") |> group_by(product) |> summarise(total_revenue = sum(revenue), orders = n()) |> arrange(desc(total_revenue)) |> collect() print(top_products) #> # A tibble: 4 × 3 #> product total_revenue orders #> <chr> <dbl> <int> #> 1 Doohickey 1654321. 6287 #> 2 Gadget 1641204. 6253 #> 3 Gizmo 1638910. 6244 #> 4 Widget 1632877. 6216

The pipeline above does three things that CSV cannot match. Partitioning by region means the filter(region == "North") step reads only the North folder, a quarter of the data. The lazy aggregation runs in Arrow's C++ engine, so R's memory never holds more than the 4-row summary. And the whole flow is one dplyr pipeline, no SQL, no manual chunking, no progress bars.

Summary

Function	Purpose
`read_parquet(path)`	Read a Parquet file into a tibble
`read_parquet(path, col_select)`	Read only selected columns
`write_parquet(df, path)`	Write a tibble to a single Parquet file
`open_dataset(path)`	Open a lazy dataset (one file or directory)
`write_dataset(df, path, partitioning)`	Write a partitioned multi-file dataset
`collect()`	Run a lazy Arrow query and return a tibble
`to_duckdb()`	Hand an Arrow dataset to DuckDB for SQL queries

Three takeaways to carry forward. First, Parquet is the right default for any data file you'd otherwise save as CSV, smaller, faster, and type-safe. Second, open_dataset() plus dplyr verbs lets you query datasets larger than RAM with code that looks identical to in-memory dplyr. Third, partition by the columns you filter on most, and Arrow will skip entire files without reading them.

References

Apache Arrow R Package, official documentation. Link
Wickham, H., Çetinkaya-Rundel, M., Grolemund, G., R for Data Science (2e), Chapter 22: Arrow. Link
Apache Arrow, Working with multi-file data sets. Link
Apache Parquet, official format specification. Link
Posit, "Big Data in R with Arrow" workshop materials. Link
Apache Arrow R Cookbook, recipes for common Parquet tasks. Link

Continue Learning

Importing Data in R, the parent guide covering all import formats
readr vs read.csv vs fread, for plain CSV speed comparisons
data.table in R, the in-memory speed champion that pairs well with Arrow

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

Apache Arrow in R: Read Parquet Files & Run Fast In-Memory Analytics

What does the arrow package actually do for R?

How do you read Parquet files in R?

Why is Parquet so much faster than CSV?

How do you query datasets larger than RAM with open_dataset()?

How do you write Parquet files (with partitioning)?

When should you use Arrow over data.table or duckdb?

Practice Exercises

Exercise 1: Partitioned mtcars round trip

Exercise 2: Lazy aggregation pipeline

Complete Example

Summary

References

Continue Learning

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

Apache Arrow in R: Read Parquet Files & Run Fast In-Memory Analytics

What does the arrow package actually do for R?

How do you read Parquet files in R?

Why is Parquet so much faster than CSV?

How do you query datasets larger than RAM with open_dataset()?

How do you write Parquet files (with partitioning)?

When should you use Arrow over data.table or duckdb?

Practice Exercises

Exercise 1: Partitioned mtcars round trip

Exercise 2: Lazy aggregation pipeline

Complete Example

Summary

References

Continue Learning

Related Tutorials