R Data Import Exercises: 10 read_csv(), read_excel() Practice Problems
Ten practical exercises drill read_csv(), read_delim(), column types, missing values, and multi-file imports in R, each with a runnable solution and inline output you can verify in your browser.
Introduction
Reading the readr reference page is one thing. Applying it to a file with stray metadata rows, leading-zero zip codes, and three different spellings of "missing" is another. These ten problems close that gap. Each one targets a specific import skill that trips real analysts on their first messy CSV.
You will start with straight CSV reads, then move to column-type control and NA handling, and finish by combining multiple files with mismatched columns. All solutions run in one shared R session, so use ans1, ans2, ... in your own attempts to avoid overwriting the setup datasets. One exercise uses readxl for Excel files, its code and expected output are shown inline.
If read_csv() and col_types are new to you, skim the parent Importing Data in R tutorial first. Otherwise, run the Setup block and begin.
Setup: The CSV Snippets We Will Use
Instead of scattering raw CSV strings across every exercise, we define them once here. These are tiny on purpose, a few rows each, so you can eyeball the expected output and catch mistakes without scrolling. Run the block below once; every exercise after it reuses these objects.
Take a quick look at the shape of each snippet. csv_basic has 3 rows and 4 columns. csv_na contains five rows with missing values expressed as blank, N/A, -, and -999, a deliberate mess. csv_big is the only snippet with real volume (1,000 rows), and it only exists so Exercise 8 can demonstrate n_max.
Warm-Up: Read CSV Basics (Exercises 1-3)
The first three problems build muscle memory for read_csv() and read_delim(). If you can finish them without peeking at the reveal, you own the fundamentals. Each expected result is stated up front so you can self-check.
Exercise 1: Read a CSV and report the inferred column types
Use read_csv() to parse csv_basic into a data frame, then print the parsed tibble. Save it as ans1. Expected: 3 rows, 4 columns, with price numeric, qty integer, and in_stock logical.
Click to reveal solution
Explanation: read_csv() reads comma-separated text (here, a string passed directly instead of a file path) and infers each column's type from the first 1,000 rows. price becomes <dbl> because it has decimals, qty also becomes <dbl> because readr uses double as its default numeric type, and in_stock becomes <lgl> because its values are all TRUE/FALSE. show_col_types = FALSE hides the auto-spec message, useful in tutorials, but keep it on when debugging real data.
Exercise 2: Read a pipe-delimited file
read_csv() expects commas. For other delimiters, use read_delim(). Parse csv_pipe and save the result as ans2. Expected: 3 rows, 3 columns.
Click to reveal solution
Explanation: read_delim() is the general-purpose reader, you tell it the delimiter and it handles everything else the same way read_csv() does. Pipes (|) are common in database exports because commas appear inside free-text fields, so switching the delimiter avoids the need for quoting. The readr shortcuts read_csv(), read_tsv(), and read_csv2() (semicolon, European convention) are just read_delim() with the delimiter pre-set.
Exercise 3: Preserve leading zeros in ID columns
read_csv() sees 01234 as the number 1234 and drops the leading zero. For zip codes, product SKUs, and phone numbers that is a silent data-loss bug. Parse csv_zip so that zipcode and phone are read as strings. Save as ans3.
Click to reveal solution
Explanation: The col_types argument lets you override read_csv()'s type guesses on a per-column basis. col_character() forces the column to stay as text, so leading zeros survive. You only need to declare the columns you want to override, the rest keep their inferred types. The shorthand col_types = "ccc" (three characters: one letter per column) does the same thing when every column is the same type.
read_csv() looks at the top of the file to guess types. That guess is wrong whenever a column looks numeric but is actually an identifier, zip codes, SKUs, phone numbers, product codes with leading zeros. Always declare col_types for identifier columns, even when the first few rows look clean.Core Challenges: Column Types and Missing Data (Exercises 4-6)
Real CSV files rarely come clean. These three exercises fix the three most common header-to-data quirks: missing values in disguise, dates in regional formats, and report metadata stuffed above the real header row.
Exercise 4: Handle multiple NA representations in one file
csv_na uses four different conventions for missing values, blank, N/A, -, and -999. Read it so that all four become NA in R. Save to ans4 and confirm the total NA count equals 4.
Click to reveal solution
Explanation: read_csv() treats "" and "NA" as missing by default. Anything else, N/A, -, -999, or a custom sentinel, has to be declared through the na argument. Pass a character vector of every spelling you want converted, and readr applies them before type inference runs. That matters: if you forget "-999", the score column would be read as numeric with an outlier, and imputing the mean would quietly poison your analysis.
Exercise 5: Parse dates in MM/DD/YYYY format
US-style dates (03/30/2026) are read as character strings by default. Parse csv_dates so the date column comes back as a real Date. Save to ans5 and confirm class(ans5$date) is "Date".
Click to reveal solution
Explanation: col_date() turns a character column into a true Date at read time, using the strptime format codes (%Y year, %m month, %d day). Parsing dates up front is worth the extra keystrokes, once the column is a Date you get sort, filter, and arithmetic for free. The alternative, parsing later with lubridate::mdy(), works but means your column is wrong until you remember to fix it.
Exercise 6: Skip metadata header rows
csv_meta has three lines of human-readable metadata (Report:, Generated:, ---) before the real header. Read it so the tibble has two rows and three columns. Save to ans6.
Click to reveal solution
Explanation: skip = N tells readr to ignore the first N lines entirely, they never enter the parser, so the next line becomes the header. This works for any fixed-length preamble. For variable-length metadata (for example, "skip until you hit a blank line"), use comment = "#" or read the whole file with read_lines() and filter before parsing. Pair skip with col_names when the file has no header at all.
skip counts lines, not rows. If a metadata block wraps inside quotes or spans multiple logical lines, skip = 3 may cut through the middle of a record. Always preview raw files with read_lines(path, n_max = 10) before guessing how many lines to skip.Advanced: Multiple Files and Real-World Edge Cases (Exercises 7-10)
The final four exercises cover the situations where an import job becomes a data pipeline: filtering columns at read time, peeking at giant files, combining several files into one tibble, and reading a genuine Excel workbook.
Exercise 7: Read only selected columns
Reading every column when you only need three is a waste on wide files. Use col_select to parse csv_big and keep only the x column. Save to ans7 and confirm it has one column and 1,000 rows.
Click to reveal solution
Explanation: col_select supports the full tidyselect vocabulary, the same DSL dplyr::select() uses. You can pass bare names (col_select = x), helpers (col_select = starts_with("value_")), or negation (col_select = !c(notes, metadata)). Columns you drop are never parsed, so on wide files the performance gain is real, not just a post-read convenience.
Exercise 8: Read only the first few rows
When you want a fast preview of a file, schema, first few values, rough shape, you do not need to read the whole thing. Use n_max to read the first 5 rows of csv_big and save to ans8.
Click to reveal solution
Explanation: n_max caps the number of data rows read, not counting the header. Pair it with col_types = cols(.default = col_character()) when you just want to look at raw text and do not care about type inference yet. For really large files, n_max is much faster than reading everything and then calling head(), because read_csv() stops parsing the moment it hits the row limit.
Exercise 9: Combine two CSVs and tag the source
You have two monthly sales files in csv_jan and csv_feb. Read both, add a month column to each showing which file it came from, and stack them into one tibble. Save to ans9. Expected: 4 rows, 3 columns.
Click to reveal solution
Explanation: bind_rows() stacks data frames vertically and aligns columns by name, not position, missing columns become NA, which saves you from silent misalignment bugs. Tagging each piece with its source (month = "Jan") before binding is the key habit: once rows are mixed, you cannot tell them apart. For many files, wrap this pattern in purrr::map_dfr(files, ~read_csv(.x) |> mutate(source = .x)) to read and tag in one pass.
bind_rows() over rbind() for imported data. bind_rows() handles mismatched columns, type promotion, and factor levels gracefully. Base R's rbind() errors on the first column mismatch and silently mangles factor columns when types differ across files.Exercise 10: Read an Excel workbook with readxl
Excel files need a different package: readxl::read_excel(). You pass a file path and (optionally) a sheet name or index. Write the code to read the first sheet of "sales_2026.xlsx" into ans10.
readxl. Run the code below in local RStudio or any desktop R session. The expected output is shown so you can still verify your code against it.Click to reveal solution
Explanation: read_excel() auto-detects .xls vs .xlsx from the file extension, so one function covers both formats. The sheet argument accepts either a number (1-based index) or a sheet name as a string. Use excel_sheets("file.xlsx") first if you want to inspect sheet names before reading. For multi-sheet reads in one call, purrr::map(excel_sheets(path), ~read_excel(path, sheet = .x)) returns a named list of tibbles, one per sheet.
readr and readxl share one mental model: you describe the file, R parses it. Whether the source is CSV, TSV, pipe-delimited, or Excel, the workflow is the same, point at the file, declare the column types you care about, and let the parser handle the rest. Learning one package teaches you most of the other.Common Mistakes and How to Fix Them
Three import habits that quietly break downstream analysis, with fixes you can copy.
Mistake 1: Using base read.csv() instead of read_csv()
Bad:
Good:
Why it matters: Base read.csv() has three historic defaults that bite: it returns a plain data frame (not a tibble), it reads "TRUE" as a character string instead of a logical, and pre-R 4.0 it converted every character column to a factor. read_csv() returns a tibble, parses logicals correctly, and never creates factors. It is also 2-10x faster on large files because its parser is written in C++.
Mistake 2: Losing leading zeros on identifier columns
Bad:
Good:
Why it matters: readr sees digits and guesses "number". That is correct for quantities but wrong for identifiers. The fix is a one-line col_types override. Make it a rule: any column whose values are codes, IDs, account numbers, or phone numbers starts life as col_character(), regardless of what the first few rows look like.
Mistake 3: Forgetting to declare custom NA strings
Bad:
Good:
Why it matters: Analytics code downstream (mean(), sum(), filter()) does not know that -999 means missing. Leaving sentinel values in place contaminates every summary statistic. Check the data dictionary of every new source and pass every missing-value spelling to the na argument before you trust any calculation.
Summary
Quick reference for each exercise and the skill it tests.
| Exercise | Skill | Key function / argument | |
|---|---|---|---|
| 1 | Parse a standard CSV | read_csv() |
|
| 2 | Parse a non-comma delimiter | `read_delim(delim = " | ")` |
| 3 | Preserve leading zeros | col_types = cols(col_character()) |
|
| 4 | Handle multiple NA conventions | na = c(...) |
|
| 5 | Parse dates in MM/DD/YYYY | col_date(format = "%m/%d/%Y") |
|
| 6 | Skip metadata header lines | skip = N |
|
| 7 | Read only selected columns | col_select |
|
| 8 | Preview first N rows of a big file | n_max = N |
|
| 9 | Combine multiple CSVs | bind_rows() + source tag |
|
| 10 | Read Excel workbooks | readxl::read_excel() |
FAQ
When should I use read_csv() instead of base read.csv()?
Always, for new code. read_csv() is faster on large files, returns a tibble, parses logicals and dates correctly, and never silently converts strings to factors. Base read.csv() is fine for legacy scripts that rely on its quirks, but every new import should default to the readr family.
How do I force a column to a specific type during read?
Use the col_types argument with cols(): read_csv("file.csv", col_types = cols(id = col_character(), date = col_date("%Y-%m-%d"))). Any column you do not mention keeps its inferred type. The compact shorthand col_types = "cDd" (one letter per column: c character, D date, d double) works when you want to override every column at once.
What NA strings does read_csv() treat as missing by default?
Only two: an empty string ("") and the literal text "NA". Everything else, N/A, -, -999, NULL, None, #N/A, is read as a normal value. Pass the full set to the na argument whenever you import data from a new source.
How do I read a very large CSV efficiently in R?
For files over a few hundred megabytes, data.table::fread() is usually the fastest option and auto-detects delimiter, header, and column types. If you need a tibble, stay on readr::read_csv() but pre-declare col_types to skip type inference, and use col_select to drop columns you will not use. For files too large for memory, use vroom::vroom() (lazy column reads) or arrow::read_csv_arrow() (Apache Arrow backend), both of which defer I/O until you actually touch each column.
References
- readr documentation,
read_csv()reference. Link - readr documentation,
cols()and column specification. Link - Wickham, H. & Grolemund, G., R for Data Science, 2nd Edition. Chapter 7: Data Import. Link
- Tidyverse blog, readr 2.0.0 release notes. Link
- readxl documentation,
read_excel()reference. Link - R Core Team, R Data Import/Export manual. Link
- data.table documentation,
fread()reference. Link - vroom package, fast delimited file reading. Link
Continue Learning
Now that you can import CSVs, handle column types, and combine files, build on these foundations:
- Importing Data in R, review the parent tutorial if any exercise stumped you
- Tidy Data in R, reshape imported data into long or wide form for analysis
- dplyr Exercises, practise the filter, select, and summarise verbs you will reach for right after an import