Tidy Data in R: What It Is & Why Every Analysis Depends on Getting It Right
Tidy data has one variable per column, one observation per row, and one value per cell. Most R tools — ggplot2, dplyr, statistical models — assume tidy data. Getting the shape right is often 80% of the work.
Look at a spreadsheet where months are spread across columns: Jan, Feb, Mar, each holding sales numbers. That's convenient for humans to read but painful for R to analyze. Tidy data would have a single "month" column and a single "sales" column — more rows, but every R function works instantly.
The Three Rules
Here's a tidy dataset. Each variable (country, year, population) lives in its own column. Each row is one observation (one country in one year).
Now here's the same information stored in a messy (wide) layout. The years have become column names — a variable's values are spread across the header row.
messy <- data.frame(
country = c("France", "Germany"),
pop_2024 = c(67.0, 83.0),
pop_2025 = c(67.4, 83.2)
)
cat("Messy (wide):\n")
print(messy)
cat("\nWhy is this messy? 'year' is a variable, but its values (2024, 2025)")
cat("\nare trapped inside column names instead of living in their own column.\n")
Rule
Meaning
Common violation
Each variable is a column
One concept per column
Year values used as column names
Each observation is a row
One measured unit per row
Multiple measurements crammed into one row
Each value is a cell
One data point per cell
"120/80" storing two blood-pressure numbers
Why Tidy Data Matters
With tidy data, dplyr and ggplot2 just work — no gymnastics needed.
Try doing either of those with the wide version — you'd need manual column selection, custom functions, or reshaping first. Tidy data removes that friction.
pivot_longer(): Wide to Long
When variable values are spread across column names, stack them into rows with pivot_longer().
names_prefix strips the leading text. names_transform converts the extracted name to the right type. This avoids a separate mutate step after pivoting.
pivot_wider(): Long to Wide
Sometimes you need the reverse — spread rows into columns for a summary table or a specific visualization.
Pattern 4: Multiple Observational Units in One Table
# Messy: student info + test scores mixed together
# Fix: split into two tables linked by student_id
students <- data.frame(id = 1:3, name = c("Alice","Bob","Carol"), age = c(20,22,21))
scores <- data.frame(student_id = c(1,1,2,2,3,3), test = c("midterm","final","midterm","final","midterm","final"), score = c(88,92,76,81,95,97))
cat("Students table:\n"); print(students)
cat("\nScores table:\n"); print(scores)
cat("\nJoin when you need both:\n")
merge(students, scores, by.x = "id", by.y = "student_id")
Pattern 5: Implicit Missing Values
library(tidyr)
# Messy: Product B has no February row — it's implicitly missing
sales <- data.frame(
product = c("A","A","A","B","B"),
month = c("Jan","Feb","Mar","Jan","Mar"),
revenue = c(100, 120, 110, 200, 180)
)
cat("Implicit NA (B-Feb missing):\n"); print(sales)
# Make it explicit
complete(sales, product, month, fill = list(revenue = 0))
Round-Trip: Wide → Long → Wide
A good test of understanding: pivot to long, then back to wide, and verify the data is unchanged.
library(tidyr)
original <- data.frame(
city = c("NYC", "LA", "Chicago"),
jan = c(33, 58, 26),
feb = c(35, 60, 29),
mar = c(44, 62, 39)
)
long <- pivot_longer(original, cols = -city,
names_to = "month", values_to = "temp")
back_to_wide <- pivot_wider(long, names_from = month, values_from = temp)
cat("Original:\n"); print(original)
cat("\nRound-tripped:\n"); print(back_to_wide)
cat("\nIdentical:", all.equal(original, as.data.frame(back_to_wide)), "\n")
Practice Exercises
Exercise 1: Reshape Weather Data
Convert this wide weather table to tidy (long) format with columns: city, month, temperature.
library(tidyr)
weather <- data.frame(
city = c("NYC", "LA", "Chicago"),
jan_temp = c(33, 58, 26),
feb_temp = c(35, 60, 29),
mar_temp = c(44, 62, 39)
)
# Reshape to long format
# Hint: use names_prefix to strip "_temp"
**Explanation:** `names_from` tells R which column's values become new column names. `values_from` tells it where the cell values come from.
Exercise 3: Fix Mixed-Variable Column
Separate this messy "stat" column into proper tidy format.
library(tidyr)
messy <- data.frame(
country = c("France","France","Germany","Germany"),
stat = c("pop_millions","area_km2","pop_millions","area_km2"),
value = c(67, 551695, 83, 357022)
)
# Convert so "pop_millions" and "area_km2" are separate columns
Click to reveal solution
```r
library(tidyr)
messy <- data.frame(
country = c("France","France","Germany","Germany"),
stat = c("pop_millions","area_km2","pop_millions","area_km2"),
value = c(67, 551695, 83, 357022)
)
pivot_wider(messy, names_from = stat, values_from = value)
**Explanation:** When variables are stored as rows (stat column holds "pop_millions" and "area_km2"), `pivot_wider` turns each unique value into its own column.
Summary
Concept
Function
Direction
Wide → Long
pivot_longer(cols, names_to, values_to)
Column names become values in a new column
Long → Wide
pivot_wider(names_from, values_from)
Row values become new column names
Split one column
separate(col, into, sep)
"120/80" → systolic + diastolic
Combine columns
unite(col, ..., sep)
year + month → "2026-03"
Fill implicit NAs
complete(vars, fill)
Add missing combinations
FAQ
How do I know if my data is tidy?
Ask three questions: (1) Is each variable in its own column? (2) Is each observation in its own row? (3) Is each value in its own cell? If any answer is "no," you need reshaping.
When is wide format actually better?
Wide format is fine for display tables, correlation matrices, and some time series operations. Use tidy (long) format for analysis and plotting. Convert to wide only for final presentation.
What replaced gather() and spread()?
pivot_longer() replaced gather() and pivot_wider() replaced spread() in tidyr 1.0 (2019). The new functions have a clearer API and handle more edge cases. Use them for all new code.
Does pivot_longer always increase the number of rows?
Yes. It trades columns for rows. If you pivot 4 quarter columns, each original row becomes 4 rows. The total data is the same — just reshaped.