stringr str_extract_all() in R: Every Regex Match
The str_extract_all() function in stringr returns EVERY regex match per input string, as a list of character vectors (one vector per string). Use it whenever a single string can hold more than one match, such as multiple numbers, hashtags, or tokens.
str_extract_all(x, "\\d+") # all digit runs per string str_extract_all(x, "\\d+", simplify = TRUE) # matrix output, padded with "" unlist(str_extract_all(x, "\\d+")) # flatten to one vector str_extract_all(x, fixed(".")) # literal, not regex str_extract_all(x, regex("a", ignore_case = TRUE)) # case-insensitive str_extract_all(text, "#\\w+") # every hashtag lengths(str_extract_all(x, "\\d+")) # matches per string str_extract_all(x, boundary("word")) # word tokens
Need explanation? Read on for examples and pitfalls.
What str_extract_all() does in one sentence
str_extract_all(string, pattern) returns a list with one character vector per input string, holding every non-overlapping regex match in order. Strings with zero matches return character(0), never NA. The output is always a list, even for a length-one input.
The shape difference from str_extract() is the source of most confusion. str_extract() gives a flat character vector the same length as the input; str_extract_all() gives a list. Code written for one will not work unmodified on the other.
Syntax
str_extract_all(string, pattern, simplify = FALSE) accepts a vector and a regex pattern. Set simplify = TRUE to coerce the list into a character matrix padded with empty strings.
The output is a length-3 list because the input had 3 strings. Element 2 is empty because that string had no digit run.
str_extract_all() whenever your regex uses a quantifier like +, *, or {n,} and you suspect the string may carry more than one hit. The single-match str_extract() silently drops everything past the first match, which is a frequent bug source in token-level work.Five common patterns
1. Get every match as a flat vector
When you do not need to know which input each match came from, flatten the list with unlist() to get one vector you can summarise, sort, or deduplicate.
unlist() drops the string-to-match mapping. For aggregate stats like total count or unique values, this is exactly what you want.
2. Build a clean matrix with simplify = TRUE
For tabular reports, set simplify = TRUE. The result is a character matrix where row i holds the matches from string i, padded with empty strings so every row has the same width.
This shape plugs straight into as.data.frame() or apply(). The empty cells stay as "", not NA, so test with nzchar() rather than is.na().
3. Tidy long-format with unnest_longer
The list output is awkward to summarise inside dplyr. Drop it into a tibble column and call tidyr::unnest_longer() for a tidy long frame, one row per match.
unnest_longer() keeps every parent row, even empty-match rows (the b row becomes NA). Pass keep_empty = FALSE if you want only matched rows.
str_extract_all() paired with unnest_longer() is the cleanest way to go from messy free text to a row-per-match data frame. No loops, no mapply(), no manual binding. The pattern scales from 10 strings to 10 million without changing shape.4. Extract every hashtag, mention, or URL
Social text is a natural fit because counts per string vary. The pattern stays simple; only the regex changes.
The pattern #\\w+ matches a # followed by one or more word characters. Swap to https?://\\S+ for URLs or [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+ for email-like tokens.
5. Word tokens with boundary()
For natural-language tokens, the boundary("word") matcher is locale-aware and handles punctuation cleanly without writing your own regex.
boundary("word") keeps Don't intact, while a naive \\w+ would split it on the apostrophe. Use the boundary matcher whenever you tokenize prose for downstream NLP work.
str_extract_all vs str_extract vs str_match_all
Three stringr matchers, three return shapes. Pick by what your downstream code expects.
| Function | Returns | Best for |
|---|---|---|
str_extract() |
Character vector, length of input | First match per string, NA for none |
str_extract_all() |
List of character vectors | Every match per string, variable counts |
str_extract_all(..., simplify = TRUE) |
Character matrix (padded) | Fixed-shape table for downstream apply |
str_match() |
Matrix: full match + capture groups | First match WITH capture groups |
str_match_all() |
List of matrices | Every match WITH capture groups |
str_count() |
Integer vector | Match count only, no text needed |
When to use which:
- str_extract_all is the workhorse for "give me every hit." Use it for tokenization, multi-hit parsing, and feature extraction.
- str_match_all when each match has parts (year + month + day, area-code + number). It returns capture groups as extra matrix columns.
- str_count when only the count matters; faster and lighter on memory.
str_extract_all(x, p) is [re.findall(p, s) for s in x]. The equivalent of simplify = TRUE has no direct Python analogue; build it with a list comprehension plus padding.Common pitfalls
Pitfall 1: treating the output as a character vector. str_extract_all(x, "a")[1] returns a one-element LIST (the first matrix wrapped in a list), not a character vector. Use [[1]] to unwrap: str_extract_all(x, "a")[[1]].
Pitfall 2: using length() to count matches. length(str_extract_all(x, "a")) returns the input string count, not the total match count. For per-string counts use lengths() (note the s); for the grand total use sum(lengths(...)).
Pitfall 3: assuming character(0) will skip in dplyr mutate. A zero-match element is character(0), which keeps the row but expands to a length-zero column inside unnest_longer(). The result is the row replaced by NA (unless keep_empty = FALSE). Filter empty rows explicitly when this is a problem.
str_extract_all("aaaa", "aa") returns c("aa", "aa"), NOT three matches. stringr scans left-to-right and consumes each match. Use a lookahead pattern such as (?=aa) if overlapping matches are needed; the lookahead matches without consuming.Try it yourself
Try it: Extract every 4-digit year from the sentences in ex_text. Save the FLAT vector of all years as integers to ex_years.
Click to reveal solution
Explanation: \\b\\d{4}\\b matches exactly 4 digits with word boundaries on both sides. unlist() flattens the per-string list into one character vector, and as.integer() coerces to the integer type expected for years.
Related stringr functions
After mastering str_extract_all, look at:
str_extract(): first match per string (sibling, flat-vector return)str_match_all(): every match plus capture groups, returned as a list of matricesstr_locate_all(): start and end positions of every matchstr_count(): number of matches per string when text is not neededstr_replace_all(): replace every match in placestr_split(): split a string at every match
For full regex syntax, see the official stringr regular expressions vignette.
FAQ
How do I extract all regex matches from a string in R?
Use stringr::str_extract_all(string, "pattern"). It returns a list of character vectors, one per input string, with every non-overlapping match in left-to-right order. For a flat vector across all strings, wrap the call in unlist().
What is the difference between str_extract and str_extract_all in R?
str_extract() returns only the FIRST match per string, packaged as a flat character vector the same length as the input. str_extract_all() returns EVERY match per string, packaged as a list of character vectors. Use the all-variant whenever a string may carry more than one match.
How do I extract all numbers from a string in R?
unlist(str_extract_all(x, "\\d+")) extracts every run of digits from every string and returns one flat character vector. Wrap in as.numeric() or as.integer() to convert. Use "\\d*\\.?\\d+" to also catch decimals.
Does str_extract_all return overlapping matches?
No. It scans left-to-right and consumes each match, so str_extract_all("aaaa", "aa") returns two matches, not three. For overlapping matches use a lookahead pattern such as (?=aa), which matches without consuming characters.
How do I count matches per string with str_extract_all?
Use lengths(str_extract_all(x, pattern)) (note the s). It returns an integer vector of match counts per string. If you only need the count, str_count(x, pattern) is shorter and avoids materializing the matches.