dplyr contains() in R: Select Columns by Substring
The contains() helper in dplyr selects columns whose names CONTAIN a given substring (anywhere in the name). It is the substring-match tidyselect helper, complementing starts_with and ends_with.
df |> select(contains("score")) # any column with "score" in name
df |> select(contains("Length")) # case-insensitive default
df |> select(contains("X", ignore.case = FALSE))
df |> mutate(across(contains("amt"), ~ .x * 1.1))
df |> select(-contains("temp")) # drop substring-matchedNeed explanation? Read on for examples and pitfalls.
What contains() does in one sentence
contains(match) selects columns whose names contain the literal substring match anywhere. Used inside dplyr verbs that support tidyselect.
Syntax
contains(match, ignore.case = TRUE, vars = NULL). Substring match, not regex.
Five common patterns
1. Substring match
Both "score_a" and "x_score" match.
2. Apply across by substring
3. Drop by substring
4. Case-sensitive
5. Multiple substrings
contains is the most flexible name-based selector. It catches matches anywhere; starts_with and ends_with are stricter. Use contains when you don't know exactly where the token sits in the name.contains() vs starts_with() vs ends_with() vs matches()
| Helper | Matches |
|---|---|
starts_with("x") |
Prefix |
ends_with("y") |
Suffix |
contains("ab") |
Anywhere |
matches("regex") |
Regex |
Use contains when the substring's position varies.
A practical workflow
The "audit" pattern uses contains for fuzzy matching of token names.
NA counts for any column with "amount" in the name. Robust to naming inconsistencies.
For renaming groups of columns:
Uppercase any column with "score" in its name.
Common pitfalls
Pitfall 1: contains is literal, not regex. contains("a.b") matches the literal "a.b" (dot included). For regex, use matches.
Pitfall 2: case-insensitive default surprises. contains("ID") matches "user_id" and "ID_2" because of ignore.case = TRUE. Pass FALSE if strict.
contains() matches MULTIPLE substrings if you pass a vector. contains(c("a","b")) selects names containing either "a" OR "b", NOT both. For "AND" logic, use & between two contains calls.Try it yourself
Try it: Select all iris columns containing "Petal". Save to ex_petal.
Click to reveal solution
Explanation: Two iris columns contain "Petal". Sepal.* columns are excluded.
Related tidyselect helpers
After mastering contains, look at:
starts_with()/ends_with(): stricter position-basedmatches(): regexeverything(): allwhere(): predicateall_of()/any_of(): explicit name vector
For complex patterns, combine helpers with &, |, !.
FAQ
What does contains do in dplyr?
contains(match) selects columns whose names contain the substring match anywhere.
Is contains case-sensitive?
No by default. Pass ignore.case = FALSE for strict matching.
Can contains accept multiple substrings?
Yes. contains(c("a","b")) matches names containing either "a" OR "b" (not both).
What is the difference between contains and matches?
contains is literal substring; matches uses regex. contains(".") matches a literal period; matches(".") is "any character".
How do I require a column to contain BOTH "a" AND "b"?
Combine: contains("a") & contains("b"). Both conditions must match.