r-statistics.co by Selva Prabhakaran


Data Wrangling With dplyr

dplyr is the most popular R package for data manipulation. It provides a consistent set of verbs that help you solve the most common data manipulation challenges: selecting columns, filtering rows, creating new variables, summarising, and joining datasets.

Why dplyr?

Base R can accomplish most data manipulation tasks, but dplyr makes your code more readable, faster, and easier to debug. The key advantages are:

  1. Readable syntax — verb-based functions describe what you’re doing
  2. Pipe operator (%>%) — chain operations together naturally
  3. Speed — optimized C++ backend for large datasets
  4. Consistency — same grammar works with data frames, databases, and Spark

Setup

Install and load dplyr (part of the tidyverse):

install.packages("dplyr")  # install once
library(dplyr)              # load for each session

We’ll use the built-in mtcars dataset throughout this tutorial:

head(mtcars)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

1. select() — Pick Columns

Use select() to keep only the columns you need:

# Select specific columns
mtcars %>% select(mpg, cyl, hp) %>% head(3)
#>                mpg cyl  hp
#> Mazda RX4     21.0   6 110
#> Mazda RX4 Wag 21.0   6 110
#> Datsun 710    22.8   4  93

# Select a range of columns
mtcars %>% select(mpg:hp) %>% head(3)

# Drop columns with minus sign
mtcars %>% select(-cyl, -disp) %>% head(3)

# Select columns by pattern
mtcars %>% select(starts_with("d")) %>% head(3)
#>                disp drat
#> Mazda RX4       160 3.90
#> Mazda RX4 Wag   160 3.90
#> Datsun 710      108 3.85

Other selection helpers: ends_with(), contains(), matches(), everything().

2. filter() — Pick Rows

Use filter() to keep rows that match conditions:

# Cars with mpg greater than 25
mtcars %>% filter(mpg > 25)
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

# Multiple conditions (AND)
mtcars %>% filter(mpg > 20, cyl == 4) %>% nrow()
#> [1] 10

# OR conditions
mtcars %>% filter(cyl == 4 | cyl == 6) %>% nrow()
#> [1] 18

# Using %in% for multiple values
mtcars %>% filter(cyl %in% c(4, 6)) %>% nrow()
#> [1] 18

3. mutate() — Create New Columns

Use mutate() to add new columns or transform existing ones:

# Create new columns
mtcars %>%
  mutate(
    kpl = mpg * 0.4251,           # miles per gallon to km per liter
    hp_per_cyl = hp / cyl,        # horsepower per cylinder
    is_efficient = mpg > 20      # logical flag
  ) %>%
  select(mpg, kpl, hp, cyl, hp_per_cyl, is_efficient) %>%
  head(4)
#>                mpg     kpl  hp cyl hp_per_cyl is_efficient
#> Mazda RX4     21.0  8.9271 110   6   18.33333         TRUE
#> Mazda RX4 Wag 21.0  8.9271 110   6   18.33333         TRUE
#> Datsun 710    22.8  9.6923  93   4   23.25000         TRUE
#> Hornet 4 Dr   21.4  9.0969 110   6   18.33333         TRUE

Use transmute() if you only want to keep the new columns (drops all others).

4. arrange() — Sort Rows

Use arrange() to sort your data:

# Sort by mpg ascending
mtcars %>% arrange(mpg) %>% head(3)
#>                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
#> Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
#> Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4

# Sort descending
mtcars %>% arrange(desc(mpg)) %>% head(3)
#>                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
#> Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2

# Sort by multiple columns
mtcars %>% arrange(cyl, desc(mpg)) %>% head(5)
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1

5. summarise() + group_by() — Aggregate Data

The real power of dplyr comes from combining group_by() with summarise():

# Overall summary
mtcars %>%
  summarise(
    avg_mpg = mean(mpg),
    max_hp = max(hp),
    n = n()
  )
#>    avg_mpg max_hp  n
#> 1 20.09062    335 32

# Group by cylinder count
mtcars %>%
  group_by(cyl) %>%
  summarise(
    avg_mpg = round(mean(mpg), 1),
    avg_hp = round(mean(hp), 1),
    count = n()
  )
#> # A tibble: 3 x 4
#>     cyl avg_mpg avg_hp count
#>   <dbl>   <dbl>  <dbl> <int>
#> 1     4    26.7   82.6    11
#> 2     6    19.7  122.3     7
#> 3     8    15.1  209.2    14

Other useful summary functions: median(), sd(), min(), max(), first(), last(), n_distinct().

6. The Pipe Operator %>%

The pipe passes the result of the left side as the first argument to the right side. It turns nested calls into a readable sequence:

# Without pipe (nested, hard to read):
head(arrange(filter(mtcars, cyl == 4), desc(mpg)), 3)

# With pipe (reads top-to-bottom):
mtcars %>%
  filter(cyl == 4) %>%
  arrange(desc(mpg)) %>%
  head(3)
#>                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
#> Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2

R 4.1+ native pipe: You can also use |> (built into R, no package needed). It works identically for most cases.

7. Joining Datasets

dplyr provides SQL-style joins for combining two data frames by a common key:

# Create two example data frames
orders <- data.frame(
  id = c(1, 2, 3, 4),
  product = c("A", "B", "A", "C")
)

prices <- data.frame(
  product = c("A", "B", "D"),
  price = c(10, 20, 30)
)

# Inner join: only matching rows
inner_join(orders, prices, by = "product")
#>   id product price
#> 1  1       A    10
#> 2  3       A    10
#> 3  2       B    20

# Left join: keep all rows from left table
left_join(orders, prices, by = "product")
#>   id product price
#> 1  1       A    10
#> 2  2       B    20
#> 3  3       A    10
#> 4  4       C    NA

# Anti join: rows in left with NO match in right
anti_join(orders, prices, by = "product")
#>   id product
#> 1  4       C

Other joins: right_join(), full_join(), semi_join().

8. Useful Helper Functions

# count() - shortcut for group_by + summarise(n = n())
mtcars %>% count(cyl, sort = TRUE)
#>   cyl  n
#> 1   8 14
#> 2   4 11
#> 3   6  7

# slice() - pick rows by position
mtcars %>% slice(1:3)

# distinct() - unique rows
mtcars %>% distinct(cyl, gear)
#>   cyl gear
#> 1   6    4
#> 2   4    4
#> 3   8    3
#> 4   6    3
#> 5   4    3
#> 6   4    5
#> 7   8    5
#> 8   6    5

# rename() - rename columns
mtcars %>% rename(miles_per_gallon = mpg) %>% head(2)

# pull() - extract a column as a vector
mtcars %>% filter(cyl == 4) %>% pull(mpg)
#>  [1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4

9. Putting It All Together

A complete analysis pipeline using all the verbs:

# Question: For each cylinder group, which cars have above-average mpg?
result <- mtcars %>%
  group_by(cyl) %>%
  mutate(avg_mpg_in_group = mean(mpg)) %>%
  filter(mpg > avg_mpg_in_group) %>%
  select(mpg, cyl, hp, wt, avg_mpg_in_group) %>%
  arrange(cyl, desc(mpg)) %>%
  ungroup()

print(result)
#> # A tibble: 15 x 5
#>      mpg   cyl    hp    wt avg_mpg_in_group
#>    <dbl> <dbl> <dbl> <dbl>            <dbl>
#>  1  33.9     4    65 1.835             26.7
#>  2  32.4     4    66 2.200             26.7
#>  3  30.4     4    52 1.615             26.7
#>  4  30.4     4   113 1.513             26.7
#>  5  27.3     4    66 1.935             26.7
#>  ...

Summary

VerbPurposeSQL Equivalent
select()Pick columnsSELECT
filter()Pick rowsWHERE
mutate()Create/transform columnsSELECT expr AS col
arrange()Sort rowsORDER BY
summarise()Aggregate rowsGROUP BY + aggregation
group_by()Group for aggregationGROUP BY
*_join()Combine tablesJOIN

For more on data manipulation, see the dplyr documentation and the dplyr cheat sheet.