Apache Arrow in R: Read Parquet Files & Run Fast In-Memory Analytics
The arrow package reads Parquet files (compressed, columnar, typed) in a fraction of the time CSV takes. It also queries datasets larger than RAM without loading them fully, and transfers data to Python with zero copying.
CSV files are text — slow to parse, untyped, and bloated. A 1 GB CSV typically shrinks to 100–200 MB as Parquet, reads 10x faster, and preserves column types. If you work with large data, Parquet + arrow is the upgrade path from CSV.
Why Parquet Over CSV?
| Feature | CSV | Parquet |
|---|---|---|
| File size (1M rows) | ~100 MB | ~15 MB |
| Read speed | Slow (parse text) | Fast (binary, columnar) |
| Column types | Guessed on read | Stored in file metadata |
| Read subset of columns | Must read entire file | Reads only selected columns |
| Compression | None (or gzip entire file) | Per-column (snappy, zstd, gzip) |
| Missing values | "NA" text string | Native null representation |
| Cross-language | Universal but slow | R, Python, Spark, Rust, Java |
Reading and Writing Parquet
Read Only Specific Columns
For wide datasets, this is a major speed win — unneeded columns are never loaded.
Lazy Queries with open_dataset()
open_dataset() opens a Parquet file (or directory of files) without loading data into memory. You write dplyr-style queries that Arrow's C++ engine executes at collect() time.
For datasets that don't fit in RAM,
open_dataset()+ dplyr verbs +collect()lets you filter and aggregate on disk. Only the final result enters R memory.
Feather Format
Feather is Arrow's uncompressed native format — faster than Parquet for R-to-R transfers (no decompression overhead) but larger files.
When to Use Arrow
| Scenario | Use Arrow? | Why |
|---|---|---|
| Files > 100 MB | Yes | Parquet is faster than CSV |
| Only need a few columns from a wide file | Yes | Columnar reads only needed columns |
| Data shared between R and Python | Yes | Zero-copy via Arrow memory format |
| Datasets larger than RAM | Yes | open_dataset() queries on disk |
| Small files < 10 MB | Optional | read_csv is fine for small files |
| Need human-readable format | No | Parquet is binary |
Practice Exercises
Exercise 1: Round-Trip Parquet
Write iris to Parquet, read back only Petal columns + Species, and verify types are preserved.
Click to reveal solution
```rSummary
| Function | Purpose |
|---|---|
write_parquet(df, path) |
Save as compressed Parquet |
read_parquet(path) |
Read Parquet into R |
read_parquet(path, col_select=) |
Read specific columns only |
open_dataset(path) |
Open lazily for dplyr queries |
collect() |
Execute lazy query, bring into R |
write_feather(df, path) |
Save as uncompressed Feather |
read_feather(path) |
Read Feather into R |
FAQ
Do I need to install anything besides the R package?
No. install.packages("arrow") bundles the C++ Arrow library. It's a large initial install (~100 MB) but fully self-contained — no system dependencies.
Can I read Parquet files created by Python or Spark?
Yes. Parquet is a cross-language standard. Files created by PySpark, pandas, DuckDB, or any Arrow-compatible tool are fully readable in R and vice versa.
How does Arrow compare to DuckDB for large data?
Arrow excels at file I/O and cross-language interop. DuckDB excels at SQL-style analytics. They integrate well — duckdb::tbl() can query Arrow datasets directly. Use Arrow for reading/writing and DuckDB for complex queries.
What's Next?
- Importing Data in R — the parent tutorial covering all formats
- readr vs read.csv vs fread — CSV reader comparison
- Pipe Operator — chain Arrow reads with dplyr