stringr str_subset() in R: Filter Strings by Pattern
The str_subset() function in stringr keeps elements of a character vector that match a regex pattern. It is the tidyverse replacement for base R grep(pattern, x, value = TRUE).
str_subset(x, "pattern") # regex (default) str_subset(x, fixed("text")) # literal match str_subset(x, regex("pat", ignore_case = TRUE)) # case-insensitive str_subset(x, "pattern", negate = TRUE) # keep non-matches str_subset(x, "^a") # starts with "a" str_subset(x, "ing$") # ends with "ing" str_subset(na.omit(x), "pattern") # NA-safe filter
Need explanation? Read on for examples and pitfalls.
What str_subset() does in one sentence
str_subset(string, pattern) returns a character vector containing only the elements of string that match pattern. The pattern is a regular expression by default; wrap with fixed() for literal string matching.
It is the workhorse for filtering character vectors. str_subset(x, p) is exactly equivalent to x[str_detect(x, p)] but reads better and is more concise in pipelines.
Syntax
str_subset(string, pattern, negate = FALSE) takes a character vector and a regex pattern, returning the subset that matches.
Five common patterns
1. Basic regex filter
By default the pattern is a regular expression. "run" matches anywhere in each string, so "running", "runs", and "runner" all qualify; "ran" and "walking" are dropped.
2. Literal (non-regex) match with fixed()
fixed("1.5") treats the pattern as a plain string. Without fixed(), the . is regex for "any character" and would also match "1a5".
3. Case-insensitive filter
Wrap the pattern in regex(pattern, ignore_case = TRUE) to ignore case. "BANANA" is excluded; everything else contains "apple" somewhere, regardless of capitalization.
4. Negate (keep non-matching strings)
negate = TRUE flips the filter to keep elements that do NOT match. Here \\d matches any digit, so labels containing a digit are dropped.
5. Anchored patterns for start and end matching
^ anchors the pattern to the start of each string, $ to the end. Combine them (^foo$) to match the whole string exactly.
str_subset() vs alternatives
Four functions cover almost every "filter a character vector by pattern" job. Pick by return type, not by habit: values, logicals, indices, or base R equivalent.
| Function | Returns | Use when | |
|---|---|---|---|
str_subset(x, p) |
character vector of matches | You want the matching strings themselves | |
str_detect(x, p) |
logical vector, same length as x | You need a TRUE/FALSE mask (for filter() or & / \ | ) |
str_which(x, p) |
integer vector of indices | You need positions (for slicing, ordering, joining) | |
grep(p, x, value=TRUE) |
character vector of matches | You are writing base R with no dependencies |
All four are vectorized over x. Pick the one whose return type fits the next step in your pipeline.
Common pitfalls
Pitfall 1: regex special characters treated literally. str_subset(x, "1.5") matches "1a5", "125", "1.5", and so on, because . is regex for "any character". Use fixed("1.5") or escape: "1\\.5".
Pitfall 2: NA elements pass through unchanged. str_subset(c("a", NA, "ab"), "a") returns c("a", NA, "ab"). NAs are kept because the match test returns NA, which subset treats as "include". Filter with str_subset(na.omit(x), "a") if you want NAs dropped first.
filter(df, str_detect(col, "pattern")) with dplyr instead. str_subset() is only for atomic character vectors.Pitfall 3: empty result returns character(0), not NULL. When no strings match, str_subset() returns a zero-length character vector. Check with length(result) > 0 before downstream code that assumes at least one element.
Try it yourself
Try it: Filter the built-in state.name vector to keep only states whose name starts with "New". Save the result to ex_states.
Click to reveal solution
Explanation: The pattern "^New" anchors the match to the start of each state name, so it keeps only states whose name begins with "New". Four states qualify.
Related stringr functions
After mastering str_subset, look at:
str_detect(): TRUE/FALSE mask version of the same logical teststr_which(): integer index version, for slicing or positional joinsstr_extract(): extract the matched substring rather than the whole stringstr_count(): count matches within each stringstr_replace(): replace the matched pattern with new text
For complex patterns, the regex(), fixed(), coll(), and boundary() modifier functions in stringr give precise control over match behavior. Read the stringr regex reference for the full pattern syntax.
FAQ
How do I filter a vector of strings by a pattern in R?
Use stringr::str_subset(x, "pattern"). It returns a character vector containing only the elements of x that match the pattern. The pattern is a regex by default; wrap with fixed() for a literal match. It is the tidyverse equivalent of grep(pattern, x, value = TRUE).
What is the difference between str_subset and str_detect in R?
str_detect() returns a logical vector the same length as input (TRUE where the pattern matches). str_subset() returns only the matching elements. str_subset(x, p) is equivalent to x[str_detect(x, p)]. Use str_subset() when you want the values; str_detect() when you want a mask for further combining.
Is str_subset the same as grep with value=TRUE?
Yes, for simple cases. str_subset(x, "p") and grep("p", x, value = TRUE) return the same character vector. Differences: str_subset() supports the negate argument directly and handles modifiers like fixed() and regex() consistently. grep() requires zero packages and offers perl = TRUE for PCRE.
How do I do a case-insensitive str_subset in R?
Wrap the pattern in regex(pattern, ignore_case = TRUE). Example: str_subset(x, regex("apple", ignore_case = TRUE)) keeps strings matching "apple", "Apple", "APPLE", and so on.
Can str_subset filter rows of a data frame?
No. str_subset() only accepts atomic character vectors. To filter data frame rows by a string column, use dplyr: filter(df, str_detect(col, "pattern")). str_detect() produces the logical mask that filter() needs.