stringr str_extract() in R: Extract Pattern From Strings
The str_extract() function in stringr returns the FIRST match of a regex from each input string. str_extract_all() returns ALL matches as a list. Both are vectorized and pipe-friendly.
str_extract(x, "\\d+") # first run of digits str_extract_all(x, "\\d+") # all matches per string str_extract(x, "(?i)apple") # case-insensitive str_extract(emails, "(?<=@)\\S+") # lookbehind str_match(x, "(\\d+)-(\\d+)") # named/numbered capture groups str_extract_all(x, "\\d+", simplify = TRUE) # matrix output str_extract(x, "[A-Z]+") # ALL CAPS substring
Need explanation? Read on for examples and pitfalls.
What str_extract() does in one sentence
str_extract(string, pattern) returns the FIRST substring matching pattern in each input. Inputs with no match get NA. The output is the same length as the input.
It is the simplest way to extract numeric values, codes, or any structured text from messy strings. For multiple matches per string, use str_extract_all().
Syntax
str_extract(string, pattern). Pattern is regex by default.
str_extract() returns CHARACTER, even for digit patterns. str_extract("price 100", "\\d+") returns "100" (string). Convert to numeric explicitly: as.numeric(str_extract(...)).Five common patterns
1. Extract digits
\\d+ matches one or more digits. The first match is returned per string.
2. Extract all matches
str_extract_all returns a LIST of character vectors. Use simplify = TRUE for a matrix.
3. Capture groups with str_match
str_match returns a matrix: column 1 is the full match, columns 2+ are capture groups.
4. Lookbehind for "after-X" pattern
(?<=@) is a lookbehind: matches what FOLLOWS @ without including it in the result.
5. ALL CAPS substring
Regex character class [A-Z]+ matches uppercase letters.
str_match() is more powerful than str_extract(). It returns capture groups as separate columns, perfect for parsing structured strings like dates, IDs, URLs.str_extract() vs str_match() vs str_detect() vs str_locate()
Four stringr "find a match" functions, each returning different shapes.
| Function | Returns | Best for |
|---|---|---|
str_extract() |
Character vector of matched text | Pull a single substring per row |
str_extract_all() |
List of character vectors | When matches per row vary in count |
str_match() |
Matrix (full match + capture groups) | Parsing structured patterns into parts |
str_detect() |
Logical vector | Filter / boolean checks |
str_locate() |
Integer matrix (start, end positions) | Need character offsets, not text |
str_subset() |
Filtered character vector | Keep only strings that match |
When to use which:
- str_extract is the workhorse: simple, vectorized, returns the matched substring. Most string-mining tasks start here.
- str_match when the pattern has multiple meaningful parts (year + month + day, area code + number).
- str_detect for boolean conditions in
filter()orif. - str_locate for offset-based slicing.
A practical workflow combines them: str_detect to flag rows of interest, str_extract (or str_match) to pull the data, str_replace to clean up the source.
Why regex matters here
str_extract() is essentially a regex engine wrapped in a friendly vectorized API. The function is only as powerful as the patterns you write. Three regex idioms cover most real-world extraction:
- Quantifiers (
+,*,?,{n,m}) control how many characters to match. - Character classes (
[A-Z],\\d,\\w,\\s) say WHAT to match. - Anchors and lookarounds (
^,$,(?<=),(?=)) say WHERE to match.
Combining these lets you extract email addresses, phone numbers, URLs, dates, prices, codes, hashtags, and almost any structured token from messy text. For more advanced patterns, the regex() helper lets you set flags like ignore_case = TRUE or multiline = TRUE.
Common pitfalls
Pitfall 1: pattern matches "" causes empty extracts. str_extract(x, ".*") matches the WHOLE string (greedy). For non-greedy, use .*? or anchored patterns.
Pitfall 2: NA output for non-matches. Strings without a match return NA. Check with is.na() after extracting.
str_extract is CHARACTER even when matching digits. Always cast to the right type after: as.numeric(str_extract(x, "\\d+")). Forgetting causes downstream type errors.Try it yourself
Try it: Extract the YEAR from each date string in dates. Save to ex_years (as integers).
Click to reveal solution
Explanation: \\d{4} matches exactly 4 digits (the year). as.integer() converts the string output to an integer vector.
Related stringr functions
After mastering str_extract, look at:
str_match(): returns capture groups as columnsstr_extract_all(): every match per stringstr_detect(): just check for presencestr_replace(): replace matched partstr_locate(): find character positions of matchstr_subset(): filter strings that match
For complex extraction with many groups, str_match plus naming groups via (?<name>pattern) (Perl regex) is the cleanest pattern.
FAQ
How do I extract a regex match from a string in R?
Use stringr::str_extract(x, "pattern") for the first match. Use str_extract_all(x, "pattern") for all matches. Both accept regex by default; wrap with fixed() for literal text.
What is the difference between str_extract and str_match in R?
str_extract() returns the matched substring as a character vector. str_match() returns a matrix where column 1 is the full match and columns 2+ are capture groups. Use str_match when you need to parse specific parts.
How do I extract numbers from a string in R?
as.numeric(str_extract(x, "\\d+")) extracts the first run of digits and converts to numeric. For all numbers, str_extract_all(x, "\\d+") returns a list.
What does str_extract return for non-matching strings?
NA. Strings with no match get NA in the result. Filter with na.omit() or !is.na() if you need only successful matches.
How do I make str_extract case-insensitive?
Wrap pattern: str_extract(x, regex("apple", ignore_case = TRUE)). Or use inline modifier: (?i)apple.