R Regular Expressions: Pattern Matching with stringr (20 Examples)
Regular expressions (regex) are compact patterns that describe what text looks like, digits, letters, anchors, and repetitions, so you can match, extract, or replace it without writing loops. In R, stringr wraps the ICU regex engine in consistent, pipe-friendly functions like str_detect(), str_extract(), and str_replace() that behave predictably on character vectors.
How do character classes target specific character types?
Messy text is full of embedded values, order numbers, prices, phone fragments, and character classes are how regex tells them apart. A class like \\d matches any digit, \\w matches any word character (letters, digits, underscore), and \\s matches whitespace. Let's pull every run of digits out of realistic customer text and see the payoff immediately.
Each element of the result is a character vector of every match found in that string. The fourth line has no digits, so it returns character(0). The key detail is the + quantifier: \\d+ matches runs of one or more digits, not individual digits, which is why "1234" comes back as a single token instead of four.
\d, but R's string parser consumes one backslash first, so you always write "\\d" in R code to mean the regex \d. This trips up almost every beginner.Example 2: Detect strings containing only letters
Sometimes you need to check whether a string is "clean", containing only alphabetic characters, nothing else. You combine a character class with anchors to lock the check to the whole string.
The pattern ^[a-zA-Z]+$ says "from the very start to the very end, only letters." Without the ^ and $ anchors, "world123" would match because it contains letters. The anchors force the match to cover the whole string, which is how validation patterns work. We will dig deeper into anchors in the third H2.
Example 3: Replace all non-word characters
Cleaning text often means stripping punctuation and special characters before further processing. The shorthand class \\W matches any non-word character, anything that is not a letter, digit, or underscore.
Every run of non-word characters, colons, hashes, dollar signs, parentheses, dots, gets replaced by a single space. This is a fast way to normalise text before tokenising or searching. The + is doing important work: without it, consecutive specials like ": $" would each produce a separate space instead of collapsing into one.
Example 4: Extract non-whitespace tokens
The opposite approach is sometimes useful: pull everything that is not whitespace. The pattern \\S+ matches one or more non-whitespace characters, which gives you a quick-and-dirty word tokeniser.
Each "word" (including punctuation attached to it) becomes a separate element. The "#" stays glued to "1234" because there is no whitespace between them. For sophisticated tokenising you would reach for tidytext, but \\S+ covers a lot of the quick cases, log lines, URLs, comma-free CSVs, without importing a new package.
Try it: Write code that extracts every run of lowercase vowels from the string "Regular expressions are fun". Store the result in ex_vowels.
Click to reveal solution
Explanation: The character class [aeiou] matches any one lowercase vowel; the + turns that into "one or more in a row." str_extract_all() gathers every non-overlapping run, so "io" from "expressions" is returned as a single token.
How do quantifiers control how many characters to match?
Quantifiers specify how many times the preceding element should repeat. The four workhorses are ? (zero or one), + (one or more), * (zero or more), and {n,m} (between n and m times). By default quantifiers are greedy, they match as much text as they can while still letting the rest of the pattern succeed.
Example 5: Extract optional area codes from phone numbers
Phone numbers sometimes have an area code in parentheses and sometimes don't. The ? quantifier is perfect for "this bit may or may not be here."
The pattern reads like a checklist: optional (, then exactly 3 digits, optional ), optional dash-or-space, 3 digits, another optional separator, then 4 digits. Notice how "555-1234" still matches: the ? on the parenthesis, the ? on the first separator, and the flexible spacing combine to accept formats the original pattern designer didn't explicitly list.
Example 6: Match variable-length words
When you need words of a specific length range, {n,m} is the right tool. This example extracts words that are 3 to 6 characters long.
The \\b marks a word boundary (covered in the next H2). Without those boundaries, "scientist" would partially match because it contains 3-to-6-letter substrings. The {3,6} quantifier enforces complete words in that length range, "scientist" (9 letters) and "analysis" (8 letters) are excluded, and single-letter "I" and "R" don't hit the lower bound of 3.
Example 7: Greedy vs lazy extraction
This is where most regex beginners get tripped up. Greedy quantifiers grab the longest possible match; lazy quantifiers (add ? after the quantifier) grab the shortest. The difference is huge when extracting between delimiters.
The greedy .* gobbled everything from the first < to the final >. The lazy .*? stopped at the first > it could reach. When you are pulling content between repeated delimiters, HTML tags, quoted strings, bracketed sections, lazy quantifiers almost always give the answer you actually wanted.
? after any quantifier to make it lazy. You can always remove it if you genuinely need the longest match.Example 8: Validate fixed-format codes
Some codes follow an exact shape, US ZIP codes, for instance, are 5 digits, optionally followed by a dash and 4 more. The {n} quantifier enforces an exact count.
The pattern reads as "start, exactly 5 digits, optionally a dash followed by exactly 4 digits, end." Without the anchors, "902101234" would sneak through because it contains 5 consecutive digits, the anchors force the entire string to match the shape, which is the whole point of validation.
Try it: Extract the content between the first pair of square brackets in the string "[INFO] start [WARN] bad [ERROR] crash". Use a lazy quantifier so you only get the first bracketed word. Store the result in ex_bracket.
Click to reveal solution
Explanation: \\[ and \\] match literal brackets (they're escaped because [ has a special meaning in regex). The .*? is a lazy "anything" that stops at the first closing bracket instead of running all the way to the last one.
How do anchors and boundaries pin patterns to positions?
Anchors don't match characters, they match positions. The caret ^ is "start of string," the dollar $ is "end of string," and \\b marks a word boundary (the position between a word character and a non-word character). Anchors are essential for validation, because without them a pattern can match anywhere inside a string and you get false positives.
Example 9: Detect strings starting with a capital letter
A single anchor at the start ensures your check applies to the beginning of the string, not just any position inside it.
The pattern ^[A-Z] says "at position zero, there must be an uppercase letter." Drop the ^ and any string containing an uppercase letter anywhere would match, which is almost certainly not what you want. Position anchors are how you turn "contains" checks into "starts with" or "ends with" checks.
Example 10: Extract the last word of a sentence
The $ anchor pins the pattern to the end of the string. Combined with \\w+, it captures the final word in one move.
\\w+$ means "one or more word characters at the end of the string." This is cleaner than splitting on spaces and taking the last element, and it handles edge cases like extra trailing whitespace gracefully. If a sentence ended with punctuation like "dog.", you would swap \\w+ for [a-zA-Z]+ so the period doesn't count as a word character.
Example 11: Replace whole words only using boundaries
Word boundaries \\b prevent accidental partial matches. This is one of the most underused regex features, and forgetting it is the top cause of surprised bug reports on find-and-replace jobs.
Without \\b, the pattern "cat" matches inside "caterpillar" and "concatenate," producing nonsense. Adding \\b on each side pins the match to positions where the word actually starts and ends. Any time you do find-and-replace on English text, wrap the target in boundaries by default.
\\bword\\b prevents the "caterpillar problem", accidentally matching inside longer words that happen to contain your target as a substring.Example 12: Validate email format
Combining anchors with character classes gives you a validation pattern. This example checks for a basic email shape (enough for catching obvious non-emails, not for spec-perfect validation).
The pattern breaks down as: one or more allowed characters before the @, a domain with dots, and a top-level domain of at least 2 letters. The ^ and $ anchors demand the entire string match, without them, "alice@gmail.com extra" would slip through. Real-world email validation is far more complex (RFC 5322 is a monster), but this catches the 80% case.
Try it: Detect which of the strings c("running", "sing", "stop", "playing", "bring it") end with the letters "ing". Store the logical vector in ex_ing.
Click to reveal solution
Explanation: The $ anchor pins "ing" to the very end of the string. "bring it" ends in "it", not "ing", so it returns FALSE even though it contains "ing" as a substring, the anchor blocks the mid-string match.
How do groups and backreferences capture subpatterns?
Parentheses () create capturing groups. Each group remembers the text it matched separately from the full match. str_match() returns a matrix where column 1 is the full match and columns 2+ are the groups. Backreferences like \\1 let you reuse a captured group inside a replacement string to rearrange text.
Example 13: Extract phone number parts with groups
When you need to split a match into components, groups do the work. Each parenthesised subpattern becomes its own column in the str_match() result.
Each (\\d{3}) captures exactly 3 digits. The non-capturing parts (parentheses, dashes, spaces) are matched but not stored, they shape the match without showing up as columns. This is the go-to technique whenever you have structured text and need the pieces, not just the whole.
str_extract(), which only ever returns the full match as a plain character vector.Example 14: Swap first and last names with backreferences
Backreferences let you rearrange captured groups inside a replacement string. \\1 refers to the first group, \\2 to the second, and so on.
The pattern (\\w+) (\\w+) captures two words separated by a space. In the replacement "\\2, \\1" we put the second capture first, add a comma, then the first capture. This is a one-liner for reformatting name columns in data frames, and the pattern naturally scales to any "flip these two tokens" task.
\\1, \\2 only in the replacement argument. Inside a pattern, a backreference means "match the same text the group just captured", useful for detecting repeated words, but a different use case.Example 15: Use alternation inside groups
The pipe | inside a group matches either alternative. Think of it as an OR operator for patterns.
The group (berry|orange|lemon|lime|grapefruit) matches if any alternative appears. "strawberry" and "blueberry" match because they contain "berry", only "grape" fails because none of the listed terms appear in it. Alternation is how you collapse several related checks into one pattern.
Example 16: Use non-capturing groups for cleaner patterns
Sometimes you need grouping for alternation or quantifiers but don't want that group to show up as a capture. The syntax (?:...) creates a non-capturing group.
Here (https?) captures "http" or "https" as a group, with the trailing ? making the "s" optional. The result has two clean columns: protocol and domain. The FTP URL returns NA because it doesn't match the pattern. If we only needed grouping for the alternation and didn't care about capturing, we could have written (?:https?) instead, but here we do want the protocol back, so a capturing group is the right call.
Try it: Extract the file extension from "report_final.tar.gz", the last extension only. Use a capturing group and store the captured extension (not the leading dot) in ex_ext.
Click to reveal solution
Explanation: The $ anchor forces the match to the end of the string so we only get the final extension, not "tar". The literal \\. matches the last dot, and the capturing group ([a-zA-Z0-9]+) grabs the characters after it. Column 2 of the str_match() result is the captured extension without the dot.
How do lookarounds match without consuming text?
Lookaround assertions check what comes before or after a position without including it in the match. There are four flavours: positive lookahead (?=...) (must be followed by), negative lookahead (?!...) (must NOT be followed by), positive lookbehind (?<=...) (must be preceded by), and negative lookbehind (?<!...) (must NOT be preceded by).
Think of a lookaround as a security guard checking your ID at a door. The guard looks at the ID but doesn't take it from you, the match only includes what's outside the lookaround. This "check but don't consume" behaviour is what makes lookarounds so useful for context-sensitive extraction.
Example 17: Extract dollar amounts using lookbehind
Lookbehinds let you match text that follows a specific prefix without including the prefix in the result.
The (?<=\\$) assertion says "there must be a dollar sign immediately before this position." The dollar sign is checked but not included in the extracted text. The Euro amount is skipped because it lacks the $ prefix. This is much cleaner than extracting "$299.99" and then stripping the $ in a second step.
Example 18: Find words followed by a comma using lookahead
Lookaheads check what comes after the current position without consuming it.
The pattern \\w+(?=,) matches one or more word characters that are followed by a comma, the comma is checked but not included in the match. "date" is excluded because nothing follows it. This is the right tool for context-aware extraction: when you care about a word's neighbour but don't want that neighbour in your result.
Example 19: Match numbers not preceded by a minus sign
Negative lookbehinds exclude matches that have a specific prefix.
(?<!-)\\b\\d+ says "match a run of digits at a word boundary, but only if there is no minus sign immediately before." The negative lookbehind rejects -7 and -3 while keeping 42, 100, and 88. The \\b is still important, without it, the engine would happily match the "7" in "-7" starting from after the minus sign.
(?<=\\$) or (?<!-), but variable-length patterns like (?<=\\$+) with a quantifier inside the lookbehind will error. If you need variable-length context, restructure the pattern or use multiple str_detect() calls.Example 20: Validate password strength with multiple lookaheads
You can stack multiple lookaheads to enforce several conditions from the same starting position. This is the classic regex technique for multi-rule validation.
Each (?=.*[X]) lookahead asserts "somewhere in this string, there is a character matching [X]." The final .{8,} requires at least 8 characters total. Because lookaheads don't consume text, all three checks fire from the same starting position, they stack without interfering. Only "Abcdef1!" satisfies every rule.
str_detect() calls and combine with &. A series of small checks is easier to read, debug, and unit-test than one giant regex.Try it: Extract every word that is immediately followed by a question mark in the string "Why? How? When do we start? Now!". Store the result in ex_qwords.
Click to reveal solution
Explanation: \\w+(?=\\?) matches a run of word characters, but only where a question mark immediately follows. The lookahead (?=\\?) checks for the ? without including it in the match. "Now" is excluded because it is followed by !, not ?.
Practice Exercises
These capstone exercises combine multiple techniques from the tutorial. They're harder than the inline "Try it" prompts, expect to use 2-3 regex concepts per solution. Every exercise is solvable with only what was taught above.
Exercise 1: Extract all 4-digit years from mixed text
Pull every 4-digit number (likely a year) out of a sentence. Make sure your pattern rejects 2-digit and 6-digit numbers even if they appear nearby. Store the result in my_years.
Click to reveal solution
Explanation: \\b\\d{4}\\b matches exactly 4 digits surrounded by word boundaries. Without boundaries, the pattern would greedily match the first four digits of "123456" and return "1234". "42" is also rejected because it has only 2 digits, not the required 4.
Exercise 2: Parse product codes into components
Product codes follow the format CAT-1234-XL, a 2-3 letter category, dash, 4 digits, dash, 1-3 letter size. Extract all three components into separate columns using str_match(). Store the result in my_parts.
Click to reveal solution
Explanation: Three capturing groups split the code into pieces. ([A-Z]{2,3}) captures 2-3 uppercase letters, (\\d{4}) captures exactly 4 digits, and ([A-Z]{1,3}) captures 1-3 uppercase letters. The ^ and $ anchors make sure the whole string matches, so a malformed code wouldn't partially succeed.
Exercise 3: Extract domain names from email addresses
Given a vector of email addresses, extract just the domain (everything after the @) using a lookbehind. Store the result in my_domains.
Click to reveal solution
Explanation: The positive lookbehind (?<=@) asserts the @ must precede the match without including it in the result. Then [\\w.-]+ grabs the domain characters (letters, digits, underscores, dots, hyphens). The @ itself is never in the output because lookarounds don't consume text.
Complete Example
Let's combine every technique from this tutorial in a realistic pipeline. You have a messy data frame of customer records where each row is a single pipe-delimited string, and you need to extract, validate, and clean several fields at once.
Every technique from the tutorial earns its keep here. Character classes extract names and email characters, quantifiers match flexible phone formats, groups and anchors validate strict phones, and a lookbehind pulls out the dollar amounts without the $ prefix. Jane's email fails validation because "AT" isn't @. Bob's amount is NA because "Free" has no dollar sign. The pipeline does in six mutate() lines what would take pages of manual string handling.
Summary
| Regex Concept | Key Syntax | Best stringr Function | When to Use |
|---|---|---|---|
| Character classes | [a-z], \\d, \\w, \\s |
str_extract(), str_replace_all() |
Match specific character types |
| Quantifiers | ?, +, *, {n,m}, *? |
str_extract(), str_detect() |
Control how many characters match |
| Anchors & boundaries | ^, $, \\b |
str_detect(), str_replace_all() |
Pin patterns to positions, validate formats |
| Groups & backreferences | (), \\1, (?:) |
str_match(), str_replace() |
Capture subpatterns, rearrange text |
| Lookaround assertions | (?=), (?!), (?<=), (?<!) |
str_extract(), str_detect() |
Match context without consuming it |
The mental model: regex describes what text looks like, and stringr gives you consistent, pipe-friendly functions to act on those descriptions. Start with the simple families (character classes + quantifiers) and reach for the richer ones (anchors, groups, lookarounds) only when you actually need the extra power.
References
- Wickham, H. & Grolemund, G., R for Data Science, 2nd Edition. Chapter 15: Regular Expressions. Link
- Wickham, H., stringr: Simple, Consistent Wrappers for Common String Operations. CRAN. Link
- stringr documentation, Regular Expressions vignette. Link
- R Core Team, Regular Expressions as used in R (
?regex). Link - Friedl, J.E.F., Mastering Regular Expressions, 3rd Edition. O'Reilly (2006).
- Sanchez, G., Handling Strings with R. Chapter 15: Boundaries and Lookarounds. Link
- ICU Regular Expressions Documentation. Link
Continue Learning
- stringr in R, The parent tutorial covering the 15 core stringr functions. If you want the full picture of string manipulation beyond regex, start here.
- lubridate in R, Dates are the other common "messy text" problem. Learn how lubridate parses, extracts, and computes with dates and times.