dplyr select() in R: Choose, Rename and Reorder Columns
The select() function in dplyr returns a data frame with only the columns you ask for. You can pick columns by name, position, range, or pattern, and rename or reorder them in the same call.
select(df, name, height, mass) # by name select(df, -films, -vehicles) # drop with minus select(df, name:eye_color) # range with colon select(df, starts_with("hair")) # by prefix select(df, where(is.numeric)) # by type select(df, character = name) # rename inside select select(df, name, mass, everything()) # reorder, keep rest
Need explanation? Read on for examples and pitfalls.
What select() does in one sentence
select() is a column subsetter. You pass a data frame and a list of columns (or rules for choosing columns), and you get back a data frame with just those columns. Unlike base R [, ...], it gives you helpers for pattern matching (starts_with(), contains(), where()), supports negative selection with -, and combines naturally with the pipe |>.
When you reach for select(), you usually have one of three goals: drop the columns you do not need before further analysis, bring forward the columns that matter for a plot or a join, or rename and reorder columns while you pick them. All three are first-class in select(), which is why almost every dplyr pipeline starts or ends with it.
Syntax
select() takes a data frame plus one or more column expressions. The expressions can be bare names, ranges, negative names, helper functions like starts_with(), or rename pairs like new = old.
The full signature is:
select(.data, ...)
.data is a data frame, tibble, or grouped data frame. The ... argument takes one or more column expressions, which can be any of:
- Bare column names:
name, height, mass - Negative names to exclude:
-name - Ranges with the colon operator:
name:mass(everything fromnametomass) - Helper functions:
starts_with("hair"),contains("color"),where(is.numeric) - Renames inside the call:
new_name = old_name
The return value is always a data frame (or tibble) with only the chosen columns, in the order you specified.
select() works with or without |>. The pipe just makes longer pipelines readable. Inside a pipe, the data argument is implicit, so you skip .data.Six common patterns
1. Pick columns by name
2. Drop columns with a minus sign
3. Use ranges with the colon operator
4. Pattern-matching helpers
Tidyselect helpers shine when you have many columns. Use starts_with(), contains(), and where() to match by prefix, substring, or column type without typing every name.
5. Rename inside select()
If you only want to rename without dropping other columns, use rename() instead.
6. Reorder by listing in the order you want
everything() fills in the remaining columns, preserving their original order.
select() vs base R column subsetting
select() wraps tidyselect, the same helper grammar used elsewhere in dplyr. That is the real reason to prefer it over base R bracket subsetting: the same starts_with(), where(), any_of() helpers also work inside summarise(across(...)), pivot_longer(), and mutate(across(...)). Learning them once pays off across the entire tidyverse.
| Task | dplyr | Base R |
|---|---|---|
| Pick by name | select(df, a, b) |
df[, c("a", "b")] |
| Drop by name | select(df, -a) |
df[, !names(df) %in% "a"] |
| Pick by pattern | select(df, starts_with("x")) |
df[, grep("^x", names(df))] |
| Pick by type | select(df, where(is.numeric)) |
df[, sapply(df, is.numeric)] |
| Rename + select | select(df, new = old) |
two steps |
When to use which:
- Use
select()for readable pipelines and pattern matching. - Use base R
[, ...]when you have no other tidyverse code in the project, or when you want zero package dependencies.
select(). Once you learn starts_with(), contains(), where(), any_of(), and all_of(), the same syntax works in mutate(across(...)), summarise(across(...)), pivot_longer(), and other dplyr/tidyr verbs. This is why investing in select() pays off everywhere.Common pitfalls
Pitfall 1: forgetting parentheses on helpers. Helpers are functions, not bare names.
Pitfall 2: confusing select() with filter(). select() picks columns; filter() picks rows. New users sometimes try select(starwars, height > 100), which errors. For row selection by condition, use filter().
select() and filter() is the most common dplyr mistake. If a query errors with "object not found" inside select(), check whether you actually meant filter() for a row condition. Same the other way: if you get all rows back when you expected fewer, you may have used select() where filter() was needed.Pitfall 3: .data quoting in functions. Inside your own function, refer to columns with .data[[col]] or {{ col }} to avoid name lookup surprises. This is the tidy-evaluation rabbit hole; for one-off scripts, the bare-name form is fine.
Try it yourself
Try it: From starwars, select only the columns name, height, and mass, and rename mass to body_mass_kg. Save to ex_compact.
Click to reveal solution
Explanation: Inside select(), the form new_name = old_name renames a column at the same time it picks it. Other columns are dropped because they are not listed.
Related dplyr functions
After mastering select(), look at:
rename(): rename columns without dropping anyrelocate(): move columns to a specific positionpull(): extract one column as a vector instead of a data framemutate(): create or transform columnseverything(),last_col(),any_of(),all_of(): more tidyselect helpers
any_of() and all_of() deserve a special note. Both take a character vector of column names. all_of() errors if any name is missing, while any_of() silently keeps only the names that exist. Use all_of() when missing columns indicate a real bug; use any_of() when columns may legitimately appear in some inputs and not others, for example across different export versions of the same source.
FAQ
How do I select multiple columns by index in dplyr?
Use a vector of positions: select(starwars, c(1, 3, 5)) returns the first, third, and fifth columns. You can also use a range: select(starwars, 1:5) returns the first five columns.
What is the difference between select() and pull() in dplyr?
select() returns a data frame even if you ask for a single column. pull() extracts one column and returns it as a plain vector. Use select(df, x) when downstream code expects a data frame; use pull(df, x) when you want to feed the column to a function that takes a vector.
Can I select columns conditionally?
Yes, with where() for type-based selection: select(starwars, where(is.numeric)). For more complex predicates: select(starwars, where(~ mean(.x, na.rm = TRUE) > 100)) keeps numeric columns whose mean exceeds 100.
How do I rename a column without dropping the others?
Use rename(): starwars |> rename(character = name). rename() keeps every column and only changes the names you specify. select() with renaming would drop columns you didn't list.
Does dplyr select() work with grouped data frames?
Yes. Grouping variables are always retained, even if you don't list them. starwars |> group_by(species) |> select(name) returns species and name.