janitor Package in R: Clean Messy Data with 5 Lines of Code
The janitor package provides simple functions like clean_names(), tabyl(), and remove_empty() that fix the messiest spreadsheet exports in just a few lines of R code.
Introduction
Spreadsheet exports are messy. Column names arrive with spaces, special characters, and inconsistent capitalization. Rows that looked fine in Excel turn out to be blank padding. Duplicate records hide in plain sight. If you have ever spent 30 minutes wrestling a CSV into shape before the real analysis begins, janitor is for you.
The janitor package is a lightweight CRAN package built specifically for the dirty work of data cleaning. It gives you a small toolkit of focused functions, each solving one common headache. Instead of writing five lines of base R regex to fix column names, you call clean_names(). Instead of manually scanning for duplicates, you call get_dupes().
In this tutorial, you will learn how to use the five most important janitor functions: clean_names(), remove_empty(), get_dupes(), tabyl(), and row_to_names(). Every code block runs directly in your browser, so you can experiment as you read. By the end, you will clean a messy dataset from scratch in five lines.
How Does clean_names() Fix Column Names?
Messy column names are the single most common data quality issue. Spreadsheet authors use spaces, mix upper and lower case, and sprinkle in special characters. R can handle these names, but only if you wrap every reference in backticks. That slows you down and invites typos.
clean_names() converts all column names to a consistent format. The default is snake_case: all lowercase, words separated by underscores, special characters removed.
Let's create a data frame with intentionally ugly column names and clean them.
Every space became an underscore. The dollar sign and percent symbol were converted to readable words. Double spaces collapsed into single separators. Now you can type clean_df$first_name instead of ` messy_df$First Name `.
The case argument controls the naming convention. Here are the most useful options.
Most R users stick with the default snake_case, which matches tidyverse conventions. Use lower_camel if your project follows JavaScript-style naming.
df <- read_csv("data.csv") |> clean_names(). This prevents column-name bugs before they start.Try it: Create a data frame with three columns named "Employee ID", "Start Date!", and "Salary (USD)". Clean the names and print them.
Click to reveal solution
Explanation: clean_names() strips the exclamation mark, converts parentheses and spaces to underscores, and lowercases everything.
How Do You Remove Empty Rows and Columns?
Spreadsheet users love blank rows for visual spacing and blank columns for alignment. When you import that file into R, those blanks become rows and columns full of NA values. They add noise to every summary statistic and break many functions.
remove_empty() strips rows and columns that are entirely NA. You control which dimension to clean with the which argument.
Two all-NA rows and two all-NA columns disappeared. The data that matters stayed intact.
A related function, remove_constant(), drops columns where every value is the same. These columns carry zero information.
The region column had "North" in every row, so remove_constant() dropped it.
"" is not considered empty by remove_empty(). Convert empty strings to NA first with dplyr::na_if(x, "") or mutate(across(where(is.character), ~na_if(.x, ""))).Try it: Create a 4-row data frame where rows 2 and 4 are all NA and one column is entirely NA. Remove the empty rows and columns.
Click to reveal solution
Explanation: remove_empty() with which = c("rows", "cols") strips both dimensions in one call.
How Does get_dupes() Find Duplicate Records?
Duplicate records corrupt aggregations, inflate counts, and break joins. Base R's duplicated() returns a logical vector, which is useful but limited. You still need extra steps to see the actual duplicate rows and understand why they matched.
get_dupes() returns a data frame of the duplicate rows along with a dupe_count column showing how many times each combination appears. You specify which columns to check.
Alice appears twice and Bob appears three times. The dupe_count column tells you the frequency instantly. Notice that Bob's third record has a different email, so checking by name and email together would separate it.
Now Bob's b2@co.com record is excluded because the name-email combination is unique.
duplicated() returns TRUE/FALSE, so you need extra filtering to inspect the actual rows. get_dupes() does the filtering and counting in one step, making it faster to diagnose the problem.Try it: Create a data frame of 5 orders where order_id 101 appears twice and order_id 103 appears twice. Use get_dupes() to find them.
Click to reveal solution
Explanation: Passing order_id to get_dupes() groups by that column and returns only rows where the count exceeds 1.
How Does tabyl() Replace table() for Frequency Tables?
Base R's table() returns an array object that is awkward to manipulate. You cannot easily pipe it into dplyr, add percentage columns, or export it to a report. The result looks cluttered in the console and lacks percentage breakdowns.
tabyl() returns a proper data frame with counts, percentages, and valid percentages (excluding NAs). It plugs directly into tidyverse pipelines and pairs with a suite of adorn_* functions for formatting.
Let's compare the two approaches on the mtcars dataset.
The tabyl() output is already a data frame. You get counts and percentages without any extra work.
For two-way cross-tabulations, tabyl() accepts two variables. The adorn_* functions then layer on totals, percentage formatting, and combined count-percent displays.
In five lines, you built a publication-ready cross-tabulation with row percentages, raw counts in parentheses, a total row, and a combined header. Try replicating that with base R's table() and prop.table().
adorn_totals() before adorn_percentages(), because totals should be computed on raw counts. Then format percentages, append counts with adorn_ns(), and finish with adorn_title().Try it: Create a one-way frequency table of mtcars$gear, then add a total row and format percentages to one decimal place.
Click to reveal solution
Explanation: adorn_totals("row") adds a sum row, and adorn_pct_formatting() converts decimal proportions to readable percentages.
How Does row_to_names() Fix Header-Less Spreadsheets?
Some Excel exports bury the real column headers in row 2, 3, or even deeper. The first rows contain merged title cells or metadata that R reads as data. You end up with column names like X1, X2, X3 and actual headers sitting inside the data frame.
row_to_names() promotes any row to become the column names, then optionally removes the rows above it.
Row 1 (the report metadata) and row 2 (now the header) were both removed. The data starts clean from row 3 onward.
Another common spreadsheet headache is Excel date serial numbers. When a date column reads as 45292 instead of 2024-01-15, excel_numeric_to_date() converts it back.
The serial number 45292 corresponds to January 15, 2024. This function handles the quirky Excel date origin automatically.
remove_rows_above = FALSE. The promoted row itself is always removed from the data.Try it: Create a data frame where row 3 contains the real headers ("City", "Population", "Area"). Use row_to_names() to fix it.
Click to reveal solution
Explanation: row_to_names(row_number = 3) promotes the third row to column names and removes everything above it.
Common Mistakes and How to Fix Them
Mistake 1: Forgetting to reassign after clean_names()
❌ Wrong:
Why it is wrong: clean_names() returns a new data frame. It does not modify the original in place. If you forget to capture the result, bad_df still has the messy names.
✅ Correct:
Mistake 2: Using remove_empty() on columns with empty strings
❌ Wrong:
Why it is wrong: Column b contains empty strings "", not NA. remove_empty() only removes columns that are entirely NA. The column stays.
✅ Correct:
Mistake 3: Expecting get_dupes() to deduplicate
❌ Wrong:
Why it is wrong: get_dupes() reports duplicates -- it does not remove them. Use dplyr::distinct() to actually deduplicate.
✅ Correct:
Mistake 4: Wrong row_number in row_to_names()
❌ Wrong:
Why it is wrong: Row 1 contains metadata, not the real headers. The off-by-one error gives you wrong column names and an NA column.
✅ Correct:
Practice Exercises
Exercise 1: Clean a messy customer dataset
You receive a data frame with ugly column names, two empty rows, one empty column, and duplicate records. Clean it up using a janitor pipeline: fix the names, remove empty rows/columns, and identify duplicates.
Click to reveal solution
Explanation: The pipeline chains clean_names() to fix headers, remove_empty() to drop NA rows/columns, and get_dupes() to surface the duplicate customer 101.
Exercise 2: Build a formatted frequency report
Create a two-way cross-tabulation of mtcars by cyl (rows) and gear (columns). Add column totals, display column percentages formatted to one decimal place, append raw counts, and add a combined title.
Click to reveal solution
Explanation: adorn_percentages("col") computes column-wise percentages. The chain builds a formatted report showing that 80% of 3-gear cars have 8 cylinders.
Exercise 3: Fix a broken spreadsheet export
You have a spreadsheet export where row 3 contains the real headers, the first two rows are metadata, there are empty columns, and a date column contains Excel serial numbers. Fix everything.
Click to reveal solution
Explanation: The pipeline promotes row 3 to headers, strips the blank row and column, cleans the names, and converts Excel serial dates to real dates.
Putting It All Together
Here is the complete "5 lines of code" promise. You start with a realistic messy dataset and clean it end to end.
From a 7-row, 4-column mess to a clean, typed, duplicate-flagged dataset in five lines. Every janitor function did exactly one job and did it well.
Summary
| Function | What It Does | When to Use It |
|---|---|---|
clean_names() |
Converts column names to snake_case | After every file import |
remove_empty() |
Drops all-NA rows and/or columns | Spreadsheet imports with blank padding |
remove_constant() |
Drops columns with a single repeated value | After remove_empty, before analysis |
get_dupes() |
Returns duplicate rows with counts | Data quality audits |
tabyl() |
Tidy frequency tables with percentages | Replacing base R table() |
adorn_*() |
Formats tabyl output (totals, %, counts) | Building report-ready tables |
row_to_names() |
Promotes a data row to column names | Excel files with buried headers |
excel_numeric_to_date() |
Converts Excel serial dates to R dates | Imported date columns showing numbers |
The janitor package does not try to do everything. It solves the eight most common data cleaning problems with simple, composable functions. Pair it with dplyr for transformations and you can handle almost any messy dataset.
FAQ
Is janitor compatible with the tidyverse pipe?
Yes. Every janitor function takes a data frame as its first argument and returns a data frame. This makes it fully compatible with both the native pipe |> and magrittr's %>%. You can chain read_csv() |> clean_names() |> remove_empty() seamlessly.
Can clean_names() handle non-English characters?
Yes. clean_names() transliterates accented characters to ASCII equivalents. For example, "Resume" with an accent becomes "resume", and "StraBe" (German sharp S) becomes "strasse". This uses the snakecase package internally.
Does tabyl() work with more than three variables?
No. tabyl() supports one-way, two-way, and three-way tables (1, 2, or 3 variables). For higher-dimensional frequency tables, use dplyr::count() which handles any number of grouping variables.
How is get_dupes() different from dplyr::distinct()?
They serve opposite purposes. get_dupes() shows you which rows are duplicated and how many times. distinct() removes duplicates and keeps only unique rows. Use get_dupes() to investigate, then distinct() to clean.
Can I use janitor with data.table?
clean_names() works on data.tables directly. However, tabyl() and the adorn_* functions expect data frames or tibbles. Convert with as.data.frame() first, or use data.table's native .[, .N, by = ...] for frequency counts.
References
- Firke, S. — janitor: Simple Tools for Examining and Cleaning Dirty Data. CRAN vignette. Link
- Firke, S. — janitor GitHub repository. Link
- CRAN — janitor package reference manual (v2.2.1). Link
- Wickham, H. & Grolemund, G. — R for Data Science, 2nd Edition. O'Reilly (2023). Chapter 6: Data Tidying. Link
- Rapp, A. — Easy data cleaning with the janitor package. Link
- rdrr.io — clean_names() function reference. Link
- R-bloggers — Easy data cleaning with the janitor package (2024). Link
What's Next?
- Data Quality Checking in R — The parent guide covering the full data quality workflow, from missing values to outlier detection.
- Missing Values in R — How to detect, count, remove, and impute NA values in your datasets.
- dplyr Tutorial — Master the core data transformation verbs that pair perfectly with janitor's cleaning functions.