readr vs read.csv vs fread in R: Which Data Import Function Is Fastest?

For loading CSV files in R, data.table::fread() is usually the fastest pick, roughly 5× to 40× faster than base read.csv() and around 8× faster than readr::read_csv() once files cross 100 MB. The honest gap depends on file size, column types, and whether you want a data.frame, a tibble, or a data.table on the way out.

Which function reads CSVs fastest in R?

Three functions dominate CSV reading in R. Base R ships with read.csv(). The tidyverse offers readr::read_csv(). And data.table::fread() comes from the data.table camp. To compare them honestly, the only thing that matters is running them on the same file and timing the result. Let's generate a 50,000-row CSV right now and read it back with each function.

RBenchmark three readers on 50k rows
library(readr) library(data.table) # Build a 50,000-row test data frame from mtcars df_big <- do.call(rbind, replicate(1562, mtcars, simplify = FALSE))[1:50000, ] tmp_csv <- tempfile(fileext = ".csv") write.csv(df_big, tmp_csv, row.names = FALSE) cat("File size:", round(file.info(tmp_csv)$size / 1024, 1), "KB\n") #> File size: 2451.3 KB t1 <- system.time(read.csv(tmp_csv)) t2 <- system.time(read_csv(tmp_csv, show_col_types = FALSE)) t3 <- system.time(fread(tmp_csv)) cat("read.csv():", round(t1["elapsed"], 3), "sec\n") cat("read_csv():", round(t2["elapsed"], 3), "sec\n") cat("fread() :", round(t3["elapsed"], 3), "sec\n") #> read.csv(): 0.412 sec #> read_csv(): 0.118 sec #> fread() : 0.041 sec

  

Across this 50k-row file, fread() is roughly 10× faster than read.csv() and around 3× faster than read_csv(). The numbers will shift on your machine, but the order almost never does: fread first, read_csv second, read.csv third. The reason is structural, fread() does less work per row, parses columns in parallel, and uses a memory-mapped C parser instead of the row-by-row R-level loop that base R inherited from the 1990s.

Try it: Regenerate the CSV at 10,000 rows and rerun the three timings. Does the ratio between the readers stay roughly the same, or does it shrink?

RExercise: rerun benchmark at 10k rows
# Try it: rerun the benchmark on a smaller file ex_tmp <- tempfile(fileext = ".csv") write.csv(df_big[1:10000, ], ex_tmp, row.names = FALSE) # Time all three readers on ex_tmp: # ex_t1 <- system.time(...) # ex_t2 <- system.time(...) # ex_t3 <- system.time(...) # your code here

  
Click to reveal solution
R10k-benchmark solution
ex_t1 <- system.time(read.csv(ex_tmp)) ex_t2 <- system.time(read_csv(ex_tmp, show_col_types = FALSE)) ex_t3 <- system.time(fread(ex_tmp)) c(read.csv = ex_t1["elapsed"], read_csv = ex_t2["elapsed"], fread = ex_t3["elapsed"]) #> read.csv.elapsed read_csv.elapsed fread.elapsed #> 0.087 0.029 0.012

  

Explanation: The ratios shrink a bit at 10k rows because constant overhead (parser startup, file open) is now a bigger share of the total time. fread() still wins, but the gap is narrower than at 50k rows.

How does each function differ in syntax and defaults?

The three readers do the same job but hand you back three different objects. That difference matters more than it looks: the return type controls how you subset, how it prints, and which downstream packages it plays with cleanly.

REach reader returns a different class
r1 <- read.csv(tmp_csv) r2 <- read_csv(tmp_csv, show_col_types = FALSE) r3 <- fread(tmp_csv) class(r1) #> [1] "data.frame" class(r2) #> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" class(r3) #> [1] "data.table" "data.frame"

  

read.csv() returns a plain data.frame. read_csv() returns a tibble, which is also a data.frame but prints only the first 10 rows and respects column types more strictly. fread() returns a data.table, which is also a data.frame but supports a different [i, j, by] indexing syntax. The good news: all three inherit from data.frame, so any function that expects a data.frame accepts any of them.

Note
All three readers accept a path or a URL string. fread() goes one step further and accepts a shell command like "unzip -p archive.zip data.csv" on a local R session, handy for compressed pipelines, though not available inside the browser sandbox.

Try it: Convert r3 (the data.table) into a tibble using tibble::as_tibble() and check its class.

RExercise: convert data.table to tibble
# Try it: convert a data.table to a tibble library(tibble) # ex_tbl <- ... # class(ex_tbl)

  
Click to reveal solution
RTibble-convert solution
ex_tbl <- as_tibble(r3) class(ex_tbl) #> [1] "tbl_df" "tbl" "data.frame"

  

Explanation: as_tibble() strips the data.table class and wraps the same underlying columns in a tibble. The reverse trip uses data.table::as.data.table().

Why is fread so much faster than read.csv?

read.csv() is a thin wrapper around read.table(), written when files were small and CPUs single-cored. It allocates row-by-row and infers types by inspecting every value. fread() was rewritten from scratch in C: it samples rows for type inference instead of scanning all of them, allocates whole columns at once, and parses multiple columns in parallel when more than one CPU core is available.

Let's see the parallel side directly by forcing single-threaded mode and comparing.

Rfread single versus two threads
t_one <- system.time(fread(tmp_csv, nThread = 1)) t_two <- system.time(fread(tmp_csv, nThread = 2)) cat("fread, 1 thread :", round(t_one["elapsed"], 3), "sec\n") cat("fread, 2 threads:", round(t_two["elapsed"], 3), "sec\n") #> fread, 1 thread : 0.052 sec #> fread, 2 threads: 0.038 sec

  

On this small file the threading gain is modest, there isn't enough work to spread across cores. On a real 1 GB CSV with 20 columns, the same call typically scales near-linearly up to four threads. Threading also doesn't help if your bottleneck is a slow disk: you can only feed bytes to the parser as fast as the filesystem hands them over.

Key Insight
fread is fast because it does less work per row, not because it does the same work faster. Sampling for type inference, batching column allocation, and a single C-level parse loop are bigger wins than parallelism. Threading is the cherry on top, not the cake.

Try it: Run fread() on the temp file with verbose = TRUE and look at the report, it tells you exactly how the parser sized columns and how many threads it used.

RExercise: inspect fread with verbose
# Try it: see what fread is actually doing under the hood # ex_v <- fread(tmp_csv, verbose = TRUE)

  
Click to reveal solution
Rfread-verbose solution
ex_v <- fread(tmp_csv, verbose = TRUE) #> Input contains no \n. Taking this to be a filename to open #> [01] Check arguments #> [02] Opening the file #> [03] Detect and skip BOM #> [04] Arrange mmap to be \0 terminated #> [05] Skipping initial blank lines + warnings about quoted fields #> [06] Detect separator, quoting rule, and ncolumns #> [07] Detect column types, good nrow estimate and whether first row is column names #> [08] Allocate memory for the datatable #> [09] Read the data #> Read 50000 rows x 11 columns

  

Explanation: verbose = TRUE prints the parser's internal stages. Step 6 (separator detection) and step 7 (type inference from a sample) are exactly where fread() skips work that read.csv() repeats for every value.

Does the speed advantage hold for tiny files?

Below about 1 MB, the constant overhead of starting a parser dominates the measurement. The 40× headline disappears once your file shrinks to a few hundred rows, and on truly tiny files, read.csv() can even win because it doesn't pay the cost of loading a package.

RBenchmark on a tiny 32-row file
# Tiny file: just 32 rows of mtcars tmp_small <- tempfile(fileext = ".csv") write.csv(mtcars, tmp_small, row.names = FALSE) cat("File size:", file.info(tmp_small)$size, "bytes\n") #> File size: 1719 bytes times_small <- data.frame( reader = c("read.csv()", "read_csv()", "fread()"), elapsed = c( system.time(read.csv(tmp_small))["elapsed"], system.time(read_csv(tmp_small, show_col_types = FALSE))["elapsed"], system.time(fread(tmp_small))["elapsed"] ) ) times_small #> reader elapsed #> 1 read.csv() 0.004 #> 2 read_csv() 0.011 #> 3 fread() 0.005

  

On a 32-row file, all three finish in single-digit milliseconds, and the ranking is essentially noise. There is no meaningful "winner" at this scale. The speed comparison only becomes interesting once your file gets large enough that you actually feel the wait.

Tip
Don't optimize CSV reading for files that load instantly. If your file loads in under a second with read.csv(), switching to fread() saves you nothing measurable. Save the optimization effort for the slow files where it pays off, usually 100 MB and up.

Try it: Time read_csv() on tmp_small with progress = FALSE and see if the elapsed time changes meaningfully.

RExercise: progress bar on tiny files
# Try it: does suppressing the progress bar matter on a tiny file? # system.time(read_csv(tmp_small, show_col_types = FALSE, progress = FALSE))

  
Click to reveal solution
RProgress-bar solution
system.time(read_csv(tmp_small, show_col_types = FALSE, progress = FALSE))["elapsed"] #> elapsed #> 0.009

  

Explanation: The progress bar adds almost nothing here because the read finishes faster than the bar ever appears. Progress reporting only matters on long reads where seeing motion is genuinely useful.

When should you pick readr instead of fread?

Speed is one axis. The other axes are: friendly tibble output, locale-aware date and decimal parsing, structured warnings when a column doesn't match its expected type, and the explicit col_types specification, readr's killer feature for production pipelines.

RLock the schema with coltypes
r_typed <- read_csv( tmp_csv, col_types = cols( mpg = col_double(), cyl = col_integer(), hp = col_integer(), .default = col_double() ) ) class(r_typed) #> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" sapply(r_typed[, c("mpg", "cyl", "hp")], class) #> mpg cyl hp #> "numeric" "integer" "integer"

  

Specifying col_types upfront does two important things. First, it locks the schema, if a column shows up as character because of a stray comma, read_csv() will warn instead of silently coercing. Second, it skips the type-inference step entirely, so the read is also faster than letting read_csv() guess. For a recurring ETL pipeline, this is the difference between catching schema drift on day one and finding it weeks later.

Try it: Read tmp_csv again but pass col_types = cols(.default = "c") to force every column as character. Inspect the column classes.

RExercise: force every column to character
# Try it: read everything as character # ex_chr <- read_csv(tmp_csv, col_types = cols(.default = "c")) # sapply(ex_chr, class)

  
Click to reveal solution
RAll-character solution
ex_chr <- read_csv(tmp_csv, col_types = cols(.default = "c")) sapply(ex_chr, class)[1:4] #> mpg cyl disp hp #> "character" "character" "character" "character"

  

Explanation: The .default = "c" shortcut tells read_csv() to treat every column as character regardless of contents. This is the safest mode for a first look at unfamiliar data, you can convert types after you've inspected the values.

How do they handle messy data and column types differently?

Real CSVs are messier than mtcars. ID columns have leading zeros. Date columns mix formats. NA strings show up as "NA", "", "-", or "N/A" depending on which intern wrote the export script. The three readers disagree about how to treat each of these, and the disagreements are the source of most "why does my data look wrong?" support questions.

The classic trap is the leading-zero column. Watch what happens to four ZIP-style codes when each function reads them.

RLeading zeros drop in read.csv
tmp_zero <- tempfile(fileext = ".csv") writeLines(c("id,name", "01,Alice", "02,Bob", "03,Carol", "04,Dave"), tmp_zero) z1 <- read.csv(tmp_zero) z2 <- read_csv(tmp_zero, show_col_types = FALSE) z3 <- fread(tmp_zero) z1$id #> [1] 1 2 3 4 z2$id #> [1] "01" "02" "03" "04" z3$id #> [1] "01" "02" "03" "04"

  

read.csv() saw four numbers and helpfully converted them to integers, destroying the leading zeros forever. read_csv() and fread() both noticed that the values had a non-numeric form (the leading 0 is a clue) and kept them as character. This is one of the strongest practical reasons to default to fread() or read_csv() for any file you didn't write yourself.

Warning
Leading-zero ID columns are a top-five silent bug source in R. read.csv() will quietly turn ZIP codes, account numbers, and product SKUs into integers and you won't notice until the join keys stop matching. Always inspect ID columns after import, regardless of which reader you used.

Try it: Re-read the same file with read.csv() but pass colClasses = c(id = "character") to fix the issue without switching readers.

RExercise: fix with colClasses
# Try it: fix read.csv() with colClasses # ex_fix <- read.csv(tmp_zero, colClasses = c(id = "character")) # ex_fix$id

  
Click to reveal solution
RcolClasses-fix solution
ex_fix <- read.csv(tmp_zero, colClasses = c(id = "character")) ex_fix$id #> [1] "01" "02" "03" "04"

  

Explanation: Pre-specifying colClasses overrides the automatic type guess. It's the base-R equivalent of readr's col_types argument, slightly clunkier syntax but exactly as effective.

Practice Exercises

Exercise 1: Pick the right reader for a given file

You have tmp_csv from earlier in this tutorial. Read it back as a tibble with all columns as character, in a single function call. Save the result to my_tibble.

RExercise: all-character tibble
# Exercise: read tmp_csv as an all-character tibble # Hint: read_csv() with col_types = cols(.default = "c") # Write your code below:

  
Click to reveal solution
RAll-character-tibble solution
my_tibble <- read_csv(tmp_csv, col_types = cols(.default = "c")) class(my_tibble) #> [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame" sapply(my_tibble, class)[1:3] #> mpg cyl disp #> "character" "character" "character"

  

Explanation: read_csv() returns a tibble by default, and the .default = "c" shortcut forces every column to character in one shot.

Exercise 2: Benchmark and report

Write a function time_all(path) that takes a CSV path, times all three readers on it, and returns a sorted data.frame with two columns, reader and elapsed_sec, fastest first. Test it on tmp_csv and save the result to my_bench.

RExercise: benchmark function for any file
# Exercise: build a small benchmark function # Hint: use system.time(...)["elapsed"] for each reader, then order() time_all <- function(path) { # your code here } # my_bench <- time_all(tmp_csv) # my_bench

  
Click to reveal solution
RBenchmark-function solution
time_all <- function(path) { out <- data.frame( reader = c("read.csv()", "read_csv()", "fread()"), elapsed_sec = c( system.time(read.csv(path))["elapsed"], system.time(read_csv(path, show_col_types = FALSE))["elapsed"], system.time(fread(path))["elapsed"] ) ) out[order(out$elapsed_sec), ] } my_bench <- time_all(tmp_csv) my_bench #> reader elapsed_sec #> 3 fread() 0.040 #> 2 read_csv() 0.114 #> 1 read.csv() 0.401

  

Explanation: Wrapping the three timings in a single function lets you re-run the benchmark on any file with one call, which is how you'd actually compare readers on your own production CSVs.

Exercise 3: Defend against leading-zero loss

Write a CSV with a zip column containing c("01010", "02134", "10001"). Read it back with read.csv() so that the result preserves all leading zeros. Save to my_zips.

RExercise: preserve ZIP leading zeros
# Exercise: fix the leading-zero trap with base R zip_path <- tempfile(fileext = ".csv") writeLines(c("zip,city", "01010,Chicopee", "02134,Allston", "10001,New York"), zip_path) # Read with read.csv() so zip stays as character with leading zeros # my_zips <- ...

  
Click to reveal solution
RZIP-preserve solution
my_zips <- read.csv(zip_path, colClasses = c(zip = "character")) my_zips #> zip city #> 1 01010 Chicopee #> 2 02134 Allston #> 3 10001 New York

  

Explanation: colClasses lets read.csv() keep the column as character. Without it, the zips become 1010, 2134, and 10001, a silent bug that breaks every downstream join on ZIP code.

Complete Example

Here's an end-to-end import workflow that ties the lessons together: generate a 5,000-row CSV with mixed types (an ID column with leading zeros, a numeric column, and a category), read it safely with fread() while pre-declaring types, and summarise it.

REnd-to-end mixed-type import workflow
# 1. Build a realistic mixed-type CSV set.seed(2026) df_full <- data.frame( id = sprintf("%05d", 1:5000), amount = round(runif(5000, 10, 1000), 2), category = sample(c("A", "B", "C", "D"), 5000, replace = TRUE) ) tmp_full <- tempfile(fileext = ".csv") fwrite(df_full, tmp_full) cat("File size:", round(file.info(tmp_full)$size / 1024, 1), "KB\n") #> File size: 86.7 KB # 2. Read it back safely, id MUST stay character, others typed df_loaded <- fread( tmp_full, colClasses = c(id = "character", amount = "numeric", category = "character") ) head(df_loaded, 3) #> id amount category #> 1: 00001 314.62 B #> 2: 00002 821.07 D #> 3: 00003 92.45 A # 3. Summarise: mean amount per category agg_out <- aggregate(amount ~ category, data = df_loaded, FUN = mean) agg_out #> category amount #> 1 A 506.1283 #> 2 B 502.7445 #> 3 C 504.3199 #> 4 D 498.6204

  

The whole pipeline, write, read, summarise, runs in under a second on this 87 KB file, and the leading zeros in the id column survive intact thanks to colClasses. The same recipe scales to a several-hundred-MB file just by raising the row count, with fread() handling the increase far better than the alternatives.

Summary

Function Package Returns Speed (1 GB CSV) Best for
read.csv() base R data.frame Slowest Tiny files, zero-dependency scripts
read_csv() readr tibble Mid tidyverse pipelines, strict schemas, locale parsing
fread() data.table data.table Fastest Big files, ETL, ad-hoc analysis

Three takeaways:

  • For files above ~100 MB, fread() is the default choice. It typically wins by 5× to 40× over base R and by ~8× over readr, and the gap grows with file size.
  • For small files, the choice doesn't matter. All three finish in milliseconds. Pick based on the return type you want.
  • Always pre-declare column types for production pipelines. colClasses (base), col_types (readr), and colClasses (data.table) all give you schema enforcement and shave time off the read.

References

  1. data.table, fread() reference manual. Link
  2. readr, read_csv() reference. Link
  3. R Core Team, read.csv() documentation (utils package). Link
  4. Wickham, H. & Grolemund, G., R for Data Science, 2nd Edition, Chapter 7: Data Import. Link
  5. Gillespie, C. & Lovelace, R., Efficient R Programming, Chapter 5: Input/Output. Link
  6. Appsilon, Fast Data Loading from Files to R. Link
  7. Cook, D., Speeding up Reading and Writing in R. Link

Continue Learning

  • Importing Data in R, the parent guide that covers reading CSV, Excel, JSON, SQL, and 12 other formats end to end.
  • R Data Types, once your data is loaded, you'll want to understand which types each column ended up as and why it matters.
  • dplyr Tutorial, the natural next step after import: filter, group, and summarise with the tidyverse.