dplyr mutate(): Create New Columns, Transform Existing Ones, 8 Real Examples
In dplyr, mutate() adds new columns or modifies existing ones, letting you engineer features, apply conditional logic, and transform many columns in a single expression. It's the verb you'll reach for every time the raw data needs shaping before analysis.
How does mutate() create a new column?
mutate() takes a data frame and any number of new_column = expression arguments. Each expression can reference other columns by bare name and is computed vectorially across all rows.
One line: new column, computed from an existing one. Multiple new columns at once work too, and later ones can reference earlier ones within the same mutate() call.
This left-to-right resolution is a feature, not a bug. It lets you build derived columns in logical order without chaining multiple mutate() calls.
Try it: Add a weight_tons column to mtcars by dividing wt by 2.205 (kips per ton).
Click to reveal solution
wt / 2.205 is computed element-wise across all 32 rows and stored into the new column weight_tons. The select(wt, weight_tons) afterwards just trims the output to the two relevant columns so you can eyeball the conversion side-by-side, the rest of the data frame is unchanged.
How do you modify an existing column in place?
Assign to a column name that already exists and mutate() overwrites it. This is the dplyr way to clean, rescale, or retype data.
Overwriting is safe: the original data frame isn't mutated, and a new one is returned. If you want to keep both the old and new values, use a different name like score_pct.
Try it: Upper-case the Species column of iris.
Click to reveal solution
Because Species is a factor, toupper() coerces it to character first and the mutate() replaces the column with the upper-cased version, the factor levels are lost. If you need to preserve factor structure, wrap the result in factor() afterwards: Species = factor(toupper(Species)).
How do you apply conditional logic with if_else() and case_when()?
if_else() is dplyr's strict version of base ifelse(): it checks that both branches return the same type, so you can't accidentally get a character column back from a numeric operation.
When you need more than two branches, case_when() is cleaner than nested if_else():
The TRUE ~ "large" is the catch-all, every row that didn't match an earlier condition falls through to it. Always include one, or you'll get NAs where no branch matched.
case_when() conditions are evaluated top-to-bottom and the first match wins. Order them from most-specific to most-general, and put the TRUE ~ ... catch-all last.Try it: Use case_when() to add a cylinder_class column: "small" for 4 cyl, "mid" for 6, "large" for 8.
Click to reveal solution
case_when() evaluates each condition top-down and assigns the matching label to a new character column. Because every row's cyl is one of 4, 6, or 8, no row falls through the three branches, you'd only need a TRUE ~ "other" catch-all if a fourth cylinder count were possible.
How do you transform many columns at once with across()?
When you need to apply the same function to several columns, across() is the answer. It plugs into mutate() (and summarise()) and takes two arguments: which columns and what function.
The ~ round(., 1) is a compact anonymous function: . stands for the current column. You can also pass a named function directly: across(where(is.numeric), log).
To run multiple functions at once and get multiple new columns, pass a named list:
The .names = "{.col}_{.fn}" glue pattern controls the output column names, {.col} is the original column name and {.fn} is the function name.
Try it: Use across() to take the log of every numeric column in iris.
Click to reveal solution
across(where(is.numeric), log) walks every column, keeps only the numeric ones, and applies log() element-wise, the factor column Species is left untouched. Passing log as a bare function name is the cleanest form when no extra arguments are needed; reach for the ~ lambda syntax (~ log(.)) only when you need to wire in other arguments.
How do you rank, lag, and lead values within a column?
dplyr bundles a set of window functions specifically for ordering and time-series work. rank(), dense_rank(), row_number(), lag(), and lead() are the ones you'll reach for most.
lag() shifts the vector back by one (introducing NA at the start); lead() shifts forward. rank_rev uses desc() to rank highest-revenue as rank 1. All four are vectorized, no loops needed.
group_by(): df |> group_by(store) |> mutate(day_over_day = revenue - lag(revenue)) computes a per-store change, handling each group independently.Try it: Add a revenue_change column to sales using lag().
Click to reveal solution
lag(revenue) shifts the column one row down, placing NA in row 1 because there's no previous day to compare against. Subtracting the shifted vector from the original produces a day-over-day delta, positive for a jump, negative for a drop. Pair this with group_by() if you need per-group deltas across several stores at once.
How do you drop and rename columns cleanly?
Within mutate(), assigning NULL removes a column. If you only want to rename without adding anything, use rename(), cleaner than a full select().
rename_with() applies a function to rename many columns at once, useful for converting styles en masse (e.g., . to _).
Try it: Use rename_with(toupper) to upper-case every column name in mtcars.
Click to reveal solution
rename_with(toupper) applies the function to every column name at once, no .cols argument means "all columns". The data itself isn't touched; only the column labels change. Pass a tidyselect helper as the second argument (e.g., rename_with(toupper, starts_with("m"))) to target a subset of columns.
What does transmute() do differently?
transmute() is mutate() that drops every column you didn't mention. Use it when you want a clean new data frame with only the computed columns (and any existing columns you explicitly kept).
In modern dplyr (1.0+), mutate(.keep = "none") does the same thing and is slightly more discoverable. Use whichever reads clearly in context.
Try it: Use transmute() to return only mpg and a new log_mpg column from mtcars.
Click to reveal solution
transmute() behaves like mutate() but discards every column you didn't name, the result here has only mpg and the newly computed log_mpg, while the other nine columns of mtcars are dropped. Reach for it when you want a minimal output data frame without chaining a select() afterwards.
Practice Exercises
Exercise 1: Feature engineering
On mtcars, add three features: power_to_weight (hp/wt), displacement_per_cyl (disp/cyl), and fast (TRUE if qsec < 17).
Show solution
Exercise 2: Standardize columns
Rescale every numeric column of iris to z-scores using across().
Show solution
Exercise 3: Conditional transformation
Create an mpg_grade column on mtcars using case_when(): "A" if mpg ≥ 25, "B" if 18-25, "C" if 15-18, "D" otherwise.
Show solution
Putting It All Together
A typical feature-engineering pipeline: load, clean strings, add derived features, transform numeric columns, summarize.
Five new columns, one clean type conversion, one selection, all in one pipeline. That's what idiomatic dplyr feature engineering looks like.
Summary
| Pattern | Verb | Example |
|---|---|---|
| Add column | mutate() |
mutate(kpl = mpg * 0.425) |
| Modify column | mutate() (overwrite) |
mutate(score = score / 100) |
| Drop column | mutate(x = NULL) |
or use select(-x) |
| Two-way condition | if_else() |
if_else(mpg > 25, "high", "low") |
| Multi-way condition | case_when() |
case_when(x > 5 ~ "big", TRUE ~ "small") |
| Same op on many columns | across() |
mutate(across(where(is.numeric), log)) |
| Window function | lag(), lead(), rank() |
mutate(delta = x - lag(x)) |
| Rename one column | rename() |
rename(new = old) |
| Rename many | rename_with() |
rename_with(tolower) |
| Keep only new cols | transmute() or .keep="none" |
transmute(kpl = mpg * 0.425) |
References
- dplyr mutate() reference
- across() reference
- case_when() reference
- R for Data Science, Data Transformation
- dplyr cheat sheet
Continue Learning
- dplyr filter() and select(), the verbs you use before mutate().
- dplyr group_by() + summarise(), aggregate mutated features.
- dplyr arrange(), slice(), top_n(), ordering and top-N queries.