readr vs read.csv vs fread in R: Which Data Import Function Is Fastest?
For loading CSV files in R, data.table::fread() is usually the fastest pick, roughly 5× to 40× faster than base read.csv() and around 8× faster than readr::read_csv() once files cross 100 MB. The honest gap depends on file size, column types, and whether you want a data.frame, a tibble, or a data.table on the way out.
Which function reads CSVs fastest in R?
Three functions dominate CSV reading in R. Base R ships with read.csv(). The tidyverse offers readr::read_csv(). And data.table::fread() comes from the data.table camp. To compare them honestly, the only thing that matters is running them on the same file and timing the result. Let's generate a 50,000-row CSV right now and read it back with each function.
Across this 50k-row file, fread() is roughly 10× faster than read.csv() and around 3× faster than read_csv(). The numbers will shift on your machine, but the order almost never does: fread first, read_csv second, read.csv third. The reason is structural, fread() does less work per row, parses columns in parallel, and uses a memory-mapped C parser instead of the row-by-row R-level loop that base R inherited from the 1990s.
Try it: Regenerate the CSV at 10,000 rows and rerun the three timings. Does the ratio between the readers stay roughly the same, or does it shrink?
Click to reveal solution
Explanation: The ratios shrink a bit at 10k rows because constant overhead (parser startup, file open) is now a bigger share of the total time. fread() still wins, but the gap is narrower than at 50k rows.
How does each function differ in syntax and defaults?
The three readers do the same job but hand you back three different objects. That difference matters more than it looks: the return type controls how you subset, how it prints, and which downstream packages it plays with cleanly.
read.csv() returns a plain data.frame. read_csv() returns a tibble, which is also a data.frame but prints only the first 10 rows and respects column types more strictly. fread() returns a data.table, which is also a data.frame but supports a different [i, j, by] indexing syntax. The good news: all three inherit from data.frame, so any function that expects a data.frame accepts any of them.
fread() goes one step further and accepts a shell command like "unzip -p archive.zip data.csv" on a local R session, handy for compressed pipelines, though not available inside the browser sandbox.Try it: Convert r3 (the data.table) into a tibble using tibble::as_tibble() and check its class.
Click to reveal solution
Explanation: as_tibble() strips the data.table class and wraps the same underlying columns in a tibble. The reverse trip uses data.table::as.data.table().
Why is fread so much faster than read.csv?
read.csv() is a thin wrapper around read.table(), written when files were small and CPUs single-cored. It allocates row-by-row and infers types by inspecting every value. fread() was rewritten from scratch in C: it samples rows for type inference instead of scanning all of them, allocates whole columns at once, and parses multiple columns in parallel when more than one CPU core is available.
Let's see the parallel side directly by forcing single-threaded mode and comparing.
On this small file the threading gain is modest, there isn't enough work to spread across cores. On a real 1 GB CSV with 20 columns, the same call typically scales near-linearly up to four threads. Threading also doesn't help if your bottleneck is a slow disk: you can only feed bytes to the parser as fast as the filesystem hands them over.
Try it: Run fread() on the temp file with verbose = TRUE and look at the report, it tells you exactly how the parser sized columns and how many threads it used.
Click to reveal solution
Explanation: verbose = TRUE prints the parser's internal stages. Step 6 (separator detection) and step 7 (type inference from a sample) are exactly where fread() skips work that read.csv() repeats for every value.
Does the speed advantage hold for tiny files?
Below about 1 MB, the constant overhead of starting a parser dominates the measurement. The 40× headline disappears once your file shrinks to a few hundred rows, and on truly tiny files, read.csv() can even win because it doesn't pay the cost of loading a package.
On a 32-row file, all three finish in single-digit milliseconds, and the ranking is essentially noise. There is no meaningful "winner" at this scale. The speed comparison only becomes interesting once your file gets large enough that you actually feel the wait.
read.csv(), switching to fread() saves you nothing measurable. Save the optimization effort for the slow files where it pays off, usually 100 MB and up.Try it: Time read_csv() on tmp_small with progress = FALSE and see if the elapsed time changes meaningfully.
Click to reveal solution
Explanation: The progress bar adds almost nothing here because the read finishes faster than the bar ever appears. Progress reporting only matters on long reads where seeing motion is genuinely useful.
When should you pick readr instead of fread?
Speed is one axis. The other axes are: friendly tibble output, locale-aware date and decimal parsing, structured warnings when a column doesn't match its expected type, and the explicit col_types specification, readr's killer feature for production pipelines.
Specifying col_types upfront does two important things. First, it locks the schema, if a column shows up as character because of a stray comma, read_csv() will warn instead of silently coercing. Second, it skips the type-inference step entirely, so the read is also faster than letting read_csv() guess. For a recurring ETL pipeline, this is the difference between catching schema drift on day one and finding it weeks later.
Try it: Read tmp_csv again but pass col_types = cols(.default = "c") to force every column as character. Inspect the column classes.
Click to reveal solution
Explanation: The .default = "c" shortcut tells read_csv() to treat every column as character regardless of contents. This is the safest mode for a first look at unfamiliar data, you can convert types after you've inspected the values.
How do they handle messy data and column types differently?
Real CSVs are messier than mtcars. ID columns have leading zeros. Date columns mix formats. NA strings show up as "NA", "", "-", or "N/A" depending on which intern wrote the export script. The three readers disagree about how to treat each of these, and the disagreements are the source of most "why does my data look wrong?" support questions.
The classic trap is the leading-zero column. Watch what happens to four ZIP-style codes when each function reads them.
read.csv() saw four numbers and helpfully converted them to integers, destroying the leading zeros forever. read_csv() and fread() both noticed that the values had a non-numeric form (the leading 0 is a clue) and kept them as character. This is one of the strongest practical reasons to default to fread() or read_csv() for any file you didn't write yourself.
read.csv() will quietly turn ZIP codes, account numbers, and product SKUs into integers and you won't notice until the join keys stop matching. Always inspect ID columns after import, regardless of which reader you used.Try it: Re-read the same file with read.csv() but pass colClasses = c(id = "character") to fix the issue without switching readers.
Click to reveal solution
Explanation: Pre-specifying colClasses overrides the automatic type guess. It's the base-R equivalent of readr's col_types argument, slightly clunkier syntax but exactly as effective.
Practice Exercises
Exercise 1: Pick the right reader for a given file
You have tmp_csv from earlier in this tutorial. Read it back as a tibble with all columns as character, in a single function call. Save the result to my_tibble.
Click to reveal solution
Explanation: read_csv() returns a tibble by default, and the .default = "c" shortcut forces every column to character in one shot.
Exercise 2: Benchmark and report
Write a function time_all(path) that takes a CSV path, times all three readers on it, and returns a sorted data.frame with two columns, reader and elapsed_sec, fastest first. Test it on tmp_csv and save the result to my_bench.
Click to reveal solution
Explanation: Wrapping the three timings in a single function lets you re-run the benchmark on any file with one call, which is how you'd actually compare readers on your own production CSVs.
Exercise 3: Defend against leading-zero loss
Write a CSV with a zip column containing c("01010", "02134", "10001"). Read it back with read.csv() so that the result preserves all leading zeros. Save to my_zips.
Click to reveal solution
Explanation: colClasses lets read.csv() keep the column as character. Without it, the zips become 1010, 2134, and 10001, a silent bug that breaks every downstream join on ZIP code.
Complete Example
Here's an end-to-end import workflow that ties the lessons together: generate a 5,000-row CSV with mixed types (an ID column with leading zeros, a numeric column, and a category), read it safely with fread() while pre-declaring types, and summarise it.
The whole pipeline, write, read, summarise, runs in under a second on this 87 KB file, and the leading zeros in the id column survive intact thanks to colClasses. The same recipe scales to a several-hundred-MB file just by raising the row count, with fread() handling the increase far better than the alternatives.
Summary
| Function | Package | Returns | Speed (1 GB CSV) | Best for |
|---|---|---|---|---|
read.csv() |
base R | data.frame | Slowest | Tiny files, zero-dependency scripts |
read_csv() |
readr | tibble | Mid | tidyverse pipelines, strict schemas, locale parsing |
fread() |
data.table | data.table | Fastest | Big files, ETL, ad-hoc analysis |
Three takeaways:
- For files above ~100 MB,
fread()is the default choice. It typically wins by 5× to 40× over base R and by ~8× overreadr, and the gap grows with file size. - For small files, the choice doesn't matter. All three finish in milliseconds. Pick based on the return type you want.
- Always pre-declare column types for production pipelines.
colClasses(base),col_types(readr), andcolClasses(data.table) all give you schema enforcement and shave time off the read.
References
- data.table,
fread()reference manual. Link - readr,
read_csv()reference. Link - R Core Team,
read.csv()documentation (utils package). Link - Wickham, H. & Grolemund, G., R for Data Science, 2nd Edition, Chapter 7: Data Import. Link
- Gillespie, C. & Lovelace, R., Efficient R Programming, Chapter 5: Input/Output. Link
- Appsilon, Fast Data Loading from Files to R. Link
- Cook, D., Speeding up Reading and Writing in R. Link
Continue Learning
- Importing Data in R, the parent guide that covers reading CSV, Excel, JSON, SQL, and 12 other formats end to end.
- R Data Types, once your data is loaded, you'll want to understand which types each column ended up as and why it matters.
- dplyr Tutorial, the natural next step after import: filter, group, and summarise with the tidyverse.