Data Wrangling With dplyr
dplyris the most popular R package for data manipulation. It provides a consistent set of verbs that help you solve the most common data manipulation challenges: selecting columns, filtering rows, creating new variables, summarising, and joining datasets.
Why dplyr?
Base R can accomplish most data manipulation tasks, but dplyr makes your code more readable, faster, and easier to debug. The key advantages are:
- Readable syntax — verb-based functions describe what you’re doing
- Pipe operator (
%>%) — chain operations together naturally - Speed — optimized C++ backend for large datasets
- Consistency — same grammar works with data frames, databases, and Spark
Setup
Install and load dplyr (part of the tidyverse):
install.packages("dplyr") # install once
library(dplyr) # load for each sessionWe’ll use the built-in mtcars dataset throughout this tutorial:
head(mtcars)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 11. select() — Pick Columns
Use select() to keep only the columns you need:
# Select specific columns
mtcars %>% select(mpg, cyl, hp) %>% head(3)
#> mpg cyl hp
#> Mazda RX4 21.0 6 110
#> Mazda RX4 Wag 21.0 6 110
#> Datsun 710 22.8 4 93
# Select a range of columns
mtcars %>% select(mpg:hp) %>% head(3)
# Drop columns with minus sign
mtcars %>% select(-cyl, -disp) %>% head(3)
# Select columns by pattern
mtcars %>% select(starts_with("d")) %>% head(3)
#> disp drat
#> Mazda RX4 160 3.90
#> Mazda RX4 Wag 160 3.90
#> Datsun 710 108 3.85Other selection helpers: ends_with(), contains(), matches(), everything().
2. filter() — Pick Rows
Use filter() to keep rows that match conditions:
# Cars with mpg greater than 25
mtcars %>% filter(mpg > 25)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Multiple conditions (AND)
mtcars %>% filter(mpg > 20, cyl == 4) %>% nrow()
#> [1] 10
# OR conditions
mtcars %>% filter(cyl == 4 | cyl == 6) %>% nrow()
#> [1] 18
# Using %in% for multiple values
mtcars %>% filter(cyl %in% c(4, 6)) %>% nrow()
#> [1] 183. mutate() — Create New Columns
Use mutate() to add new columns or transform existing ones:
# Create new columns
mtcars %>%
mutate(
kpl = mpg * 0.4251, # miles per gallon to km per liter
hp_per_cyl = hp / cyl, # horsepower per cylinder
is_efficient = mpg > 20 # logical flag
) %>%
select(mpg, kpl, hp, cyl, hp_per_cyl, is_efficient) %>%
head(4)
#> mpg kpl hp cyl hp_per_cyl is_efficient
#> Mazda RX4 21.0 8.9271 110 6 18.33333 TRUE
#> Mazda RX4 Wag 21.0 8.9271 110 6 18.33333 TRUE
#> Datsun 710 22.8 9.6923 93 4 23.25000 TRUE
#> Hornet 4 Dr 21.4 9.0969 110 6 18.33333 TRUEUse transmute() if you only want to keep the new columns (drops all others).
4. arrange() — Sort Rows
Use arrange() to sort your data:
# Sort by mpg ascending
mtcars %>% arrange(mpg) %>% head(3)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
#> Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
#> Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
# Sort descending
mtcars %>% arrange(desc(mpg)) %>% head(3)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Sort by multiple columns
mtcars %>% arrange(cyl, desc(mpg)) %>% head(5)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 15. summarise() + group_by() — Aggregate Data
The real power of dplyr comes from combining group_by() with summarise():
# Overall summary
mtcars %>%
summarise(
avg_mpg = mean(mpg),
max_hp = max(hp),
n = n()
)
#> avg_mpg max_hp n
#> 1 20.09062 335 32
# Group by cylinder count
mtcars %>%
group_by(cyl) %>%
summarise(
avg_mpg = round(mean(mpg), 1),
avg_hp = round(mean(hp), 1),
count = n()
)
#> # A tibble: 3 x 4
#> cyl avg_mpg avg_hp count
#> <dbl> <dbl> <dbl> <int>
#> 1 4 26.7 82.6 11
#> 2 6 19.7 122.3 7
#> 3 8 15.1 209.2 14Other useful summary functions: median(), sd(), min(), max(), first(), last(), n_distinct().
6. The Pipe Operator %>%
The pipe passes the result of the left side as the first argument to the right side. It turns nested calls into a readable sequence:
# Without pipe (nested, hard to read):
head(arrange(filter(mtcars, cyl == 4), desc(mpg)), 3)
# With pipe (reads top-to-bottom):
mtcars %>%
filter(cyl == 4) %>%
arrange(desc(mpg)) %>%
head(3)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2R 4.1+ native pipe: You can also use |> (built into R, no package needed). It works identically for most cases.
7. Joining Datasets
dplyr provides SQL-style joins for combining two data frames by a common key:
# Create two example data frames
orders <- data.frame(
id = c(1, 2, 3, 4),
product = c("A", "B", "A", "C")
)
prices <- data.frame(
product = c("A", "B", "D"),
price = c(10, 20, 30)
)
# Inner join: only matching rows
inner_join(orders, prices, by = "product")
#> id product price
#> 1 1 A 10
#> 2 3 A 10
#> 3 2 B 20
# Left join: keep all rows from left table
left_join(orders, prices, by = "product")
#> id product price
#> 1 1 A 10
#> 2 2 B 20
#> 3 3 A 10
#> 4 4 C NA
# Anti join: rows in left with NO match in right
anti_join(orders, prices, by = "product")
#> id product
#> 1 4 COther joins: right_join(), full_join(), semi_join().
8. Useful Helper Functions
# count() - shortcut for group_by + summarise(n = n())
mtcars %>% count(cyl, sort = TRUE)
#> cyl n
#> 1 8 14
#> 2 4 11
#> 3 6 7
# slice() - pick rows by position
mtcars %>% slice(1:3)
# distinct() - unique rows
mtcars %>% distinct(cyl, gear)
#> cyl gear
#> 1 6 4
#> 2 4 4
#> 3 8 3
#> 4 6 3
#> 5 4 3
#> 6 4 5
#> 7 8 5
#> 8 6 5
# rename() - rename columns
mtcars %>% rename(miles_per_gallon = mpg) %>% head(2)
# pull() - extract a column as a vector
mtcars %>% filter(cyl == 4) %>% pull(mpg)
#> [1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.49. Putting It All Together
A complete analysis pipeline using all the verbs:
# Question: For each cylinder group, which cars have above-average mpg?
result <- mtcars %>%
group_by(cyl) %>%
mutate(avg_mpg_in_group = mean(mpg)) %>%
filter(mpg > avg_mpg_in_group) %>%
select(mpg, cyl, hp, wt, avg_mpg_in_group) %>%
arrange(cyl, desc(mpg)) %>%
ungroup()
print(result)
#> # A tibble: 15 x 5
#> mpg cyl hp wt avg_mpg_in_group
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 33.9 4 65 1.835 26.7
#> 2 32.4 4 66 2.200 26.7
#> 3 30.4 4 52 1.615 26.7
#> 4 30.4 4 113 1.513 26.7
#> 5 27.3 4 66 1.935 26.7
#> ...Summary
| Verb | Purpose | SQL Equivalent |
|---|---|---|
select() | Pick columns | SELECT |
filter() | Pick rows | WHERE |
mutate() | Create/transform columns | SELECT expr AS col |
arrange() | Sort rows | ORDER BY |
summarise() | Aggregate rows | GROUP BY + aggregation |
group_by() | Group for aggregation | GROUP BY |
*_join() | Combine tables | JOIN |
For more on data manipulation, see the dplyr documentation and the dplyr cheat sheet.