R Regex Cheat Sheet: 30 Patterns With stringr Examples, Copy and Paste
Copy-paste regex pattern library for R: 30 patterns across six categories, each paired with a runnable stringr example and the output it produces.
How Do You Match Literal Text and Escape Metacharacters in R?
Regex starts with matching what you can see, letters, digits, and punctuation. Letters and numbers match themselves, but characters like ., $, and ( have special regex meaning and need escaping. In R strings you double the backslash: write \\. to match a literal period. Here are the five foundational literal-match patterns with runnable examples.
| # | Pattern | Regex | Description |
|---|---|---|---|
| 1 | Literal text | abc |
Matches the exact characters "abc" |
| 2 | Any character | . |
Matches any single character except newline |
| 3 | Escaped dot | \\. |
Matches a literal period |
| 4 | Escaped backslash | \\\\ |
Matches a literal backslash |
| 5 | Escaped special | \\$ |
Matches a literal dollar sign, bracket, etc. |
The first code block loads stringr, creates the shared texts vector used throughout this cheat sheet, and runs three of the five literal-match patterns so you can see the output immediately.
str_detect() returns TRUE only for "Email: bob@mail.com" because that is the one string containing the literal word. The "P..ce" pattern matches "Price" because each . stands in for exactly one character. The final pattern "\\.\\d+" finds a literal dot followed by digits and pulls out the ".99" fraction from the price string.
\\d in R where other languages write \d. The first backslash escapes the second for R's string parser; the second backslash reaches the regex engine. Using a single backslash gives an "unrecognized escape" error.Try it: Write a str_detect() call that returns TRUE only for strings containing a literal $ sign. Test it on ex_prices.
Click to reveal solution
Explanation: $ is a regex anchor meaning "end of string", so you must escape it with \\$ to match the literal character. Without the escape, the regex engine would try to match an empty position at the end of every string.
How Do Character Classes Group Related Characters?
Character classes match one character from a defined set. Square brackets create custom sets like [aeiou]. Shorthand classes like \\d save typing for common categories, digits, word characters, whitespace.
| # | Pattern | Regex | Description |
|---|---|---|---|
| 6 | Custom set | [aeiou] |
Matches any one character in the set |
| 7 | Range | [a-z] |
Matches any lowercase letter |
| 8 | Negated set | [^0-9] |
Matches any character NOT in the set |
| 9 | Digit shorthand | \\d |
Matches any digit (same as [0-9]) |
| 10 | Word shorthand | \\w |
Matches a letter, digit, or underscore |
| 11 | Whitespace shorthand | \\s |
Matches a space, tab, or newline |
| 12 | POSIX alpha | [[:alpha:]] |
Matches any letter (locale-aware) |
Each shorthand class has an uppercase negation: \\D matches non-digits, \\W matches non-word characters, and \\S matches non-whitespace. The next code block demonstrates the four most common patterns on our sample data plus a messy phone string.
The \\d+ pattern finds the first digit run in each string, "1234" in the order number, "19" before the price decimal, "555" in the phone number. The negated set [^0-9] in the last call strips every non-digit character, leaving a clean 10-digit phone number. This is one of the most common data-cleaning patterns in R.
[[:digit:]] not [:digit:]. The outer brackets define the character class; the inner [:digit:] is the POSIX name. Forgetting the outer brackets causes a subtle wrong-match bug, not an error, because regex treats [:digit:] as the set {:, d, i, g, t}.Try it: Use a character class to extract every letter (upper or lower case) from ex_noise, returning them in a single vector.
Click to reveal solution
Explanation: The range [a-zA-Z] covers both lowercase and uppercase letters. str_extract_all() returns every match as a list element (one per input string). You could also write [[:alpha:]] for a locale-aware version.
How Do Quantifiers Control Pattern Repetition?
Quantifiers tell the regex engine how many times to repeat the preceding element. By default, quantifiers are greedy: they match as much as possible. Adding ? after a quantifier makes it lazy, matching as little as possible.
| # | Pattern | Regex | Description |
|---|---|---|---|
| 13 | Zero or one | ? |
Matches 0 or 1 of the preceding element |
| 14 | One or more | + |
Matches 1 or more (greedy) |
| 15 | Zero or more | * |
Matches 0 or more (greedy) |
| 16 | Exact count | {3} |
Matches exactly 3 repetitions |
| 17 | N or more | {2,} |
Matches 2 or more repetitions |
| 18 | Range | {2,4} |
Matches between 2 and 4 repetitions |
| 19 | Lazy one-or-more | +? |
Matches 1 or more (as few as possible) |
Let's see how quantifiers affect extraction on phone numbers and HTML, the two classic examples where greediness catches people off guard.
The greedy <.+> swallows everything from the first < to the last >, one huge match. The lazy <.+?> stops at the first > it finds, returning just the opening <b> tag. This is the single most common regex surprise, and it's also why many HTML-scraping bugs exist.
? after the quantifier. If it returns too little, remove the ?. This one rule explains most "why is my regex returning weird results?" bugs.Try it: Extract every 4-digit year from ex_years as a character vector.
Click to reveal solution
Explanation: The exact-count quantifier {4} forces the regex to match runs of exactly four consecutive digits. Any run shorter or longer is skipped. This is safer than \\d+ when you specifically want 4-digit years and not, say, zip codes.
How Do Anchors Pin Patterns to String Positions?
Anchors match a position, not a character. They answer "where in the string?" without consuming any text. The caret ^ pins a pattern to the start. The dollar sign $ pins it to the end. Word boundaries \\b pin a pattern to the edge of a word.
| # | Pattern | Regex | Description |
|---|---|---|---|
| 20 | Start of string | ^ |
Matches the beginning of the string |
| 21 | End of string | $ |
Matches the end of the string |
| 22 | Word boundary | \\b |
Matches the position between a word and non-word char |
| 23 | Non-word boundary | \\B |
Matches a position NOT at a word edge |
Anchors are essential for validation. Want to check if a string starts with a digit? Use ^\\d. Want to confirm a filename ends in .csv? Use \\.csv$.
Without anchors, "app" would match anywhere inside a string. The word-boundary pattern \\bapp\\b requires "app" to be a complete word, not part of "apple" or "application", so only the standalone "app" returns TRUE. Combining ^ and $ creates an exact-match test, a common technique for validation.
^ is an anchor meaning "start of string". Inside brackets, [^abc] means negation, any character that is NOT a, b, or c. Mixing these up produces silently wrong results, not errors.Try it: Return a logical vector indicating which filenames in ex_files end with the .csv extension (escape the dot properly).
Click to reveal solution
Explanation: \\. matches a literal period (the escape prevents it from matching any character), csv matches the literal extension, and $ anchors the match to the end of the string. Without the $, "csvfile.txt" would also match; without the \\., "summaryXcsv" would slip through.
How Do You Capture Groups and Alternate Patterns?
Groups wrap part of a pattern in parentheses. Capturing groups () let you extract submatches. Non-capturing groups (?:) organize patterns without capturing. The alternation operator | means "this or that".
| # | Pattern | Regex | Description | |
|---|---|---|---|---|
| 24 | Capturing group | (\\d{4}) |
Captures matched text for extraction | |
| 25 | Non-capturing group | (?:ab)+ |
Groups without capturing (for quantifiers) | |
| 26 | Backreference | (\\w+) \\1 |
Matches a repeated word | |
| 27 | Alternation | `cat\ | dog` | Matches "cat" or "dog" |
Use str_match() instead of str_extract() when you need captured group contents. str_match() returns a matrix with the full match in column 1 and each captured group in the following columns.
The backreference \\1 refers to whatever the first group captured. In the typo detector, (\\w+) \\1 matches any word followed by a space and the same word again, a lightweight duplicate-word finder. The str_match() call returns a matrix so you can index columns: [, 2] gives all years, [, 3] gives all months, and so on.
str_extract() always returns only the complete match text, your capturing groups get discarded. If you need the year, month, and day from a date pattern as separate values, str_match() gives you each group in its own column.Try it: Extract just the 3-digit area code from ex_phone using a capturing group and str_match().
Click to reveal solution
Explanation: \\( and \\) match literal parentheses (both are regex metacharacters). The capturing group (\\d{3}) captures the three digits between them. Indexing [, 2] pulls column 2 of the match matrix, which holds the first captured group, the area code without the parentheses.
How Do Lookarounds Match Without Consuming Text?
Lookarounds are zero-width assertions. They check what is next to a position without including it in the match. A lookahead checks what follows. A lookbehind checks what precedes. Both are powerful for extracting text next to a known marker without including the marker itself.
| # | Pattern | Regex | Description |
|---|---|---|---|
| 28 | Positive lookahead | (?=...) |
Asserts what follows matches |
| 29 | Negative lookahead | (?!...) |
Asserts what follows does NOT match |
| 30 | Positive lookbehind | (?<=...) |
Asserts what precedes matches |
These are most useful when you want to grab text adjacent to a delimiter, like the digits after a $ sign or the word before a colon, without pulling the delimiter into the result.
The lookbehind (?<=\\$) positions the match right after a dollar sign, the $ is checked but never included in the extracted text, so the result is a clean numeric string. The lookahead (?=:) works the same way but on the right: it matches a word only if a colon follows immediately.
(?<=\\$) (one character) but not (?<=\\$|USD ) (variable length). If you need variable-length lookbehinds, pass perl = TRUE to base R functions or use stringr::regex() with the comments and engine options.Try it: Extract the label (the word before =) from each string in ex_labels.
Click to reveal solution
Explanation: \\w+ matches one or more word characters, and the lookahead (?==) requires an = to follow without including it in the match. The two equals signs look odd but the first is the literal character inside the lookahead (?=...).
Practice Exercises
Exercise 1: Validate email addresses
Given a vector of strings, return a logical vector marking which ones look like valid email addresses. A valid email has word characters, an @, more word characters, an escaped dot, and a 2-4 letter extension, all anchored from start to end.
Click to reveal solution
Explanation: ^[\\w.]+ requires the string to start with one or more word characters or dots (the username). @ matches the literal separator. [\\w.]+\\. matches the domain name followed by a literal dot. [a-zA-Z]{2,4}$ matches a 2-4 letter top-level domain anchored at the end. Real-world email validation is much more complex, but this catches the common structural errors.
Exercise 2: Parse URLs into scheme, host, and path
Given a vector of URLs, use capturing groups and str_match() to pull the scheme (http or https), the host, and the path into a matrix. Store the result in my_parts.
Click to reveal solution
Explanation: (https?) captures the scheme, the ? makes the s optional. :// matches the separator literally. ([^/]+) captures the host by greedily matching any character that is not a forward slash. (/.*) captures everything from the first slash onward as the path. Each captured group appears in its own column in the matrix.
Exercise 3: Clean and reformat phone numbers
Given a vector of messy phone-number strings, extract only the digits, then reformat to the standard XXX-XXX-XXXX pattern. Assume every input has exactly 10 digits.
Click to reveal solution
Explanation: str_replace_all(..., "[^0-9]", "") strips every non-digit character, leaving a clean 10-digit string. The second call uses three capturing groups (\\d{3})(\\d{3})(\\d{4}) to split the digits and backreferences \\1, \\2, \\3 in the replacement to insert dashes between them. This is the idiomatic "clean then reformat" pattern for phone numbers.
Putting It All Together
Let's combine multiple patterns in a realistic task: extracting structured data from messy server log entries into a clean data frame.
This single example uses six different pattern families from the cheat sheet: exact-count quantifiers, capturing groups, alternation, character classes, lookbehinds, and lookaheads. Each str_extract() or str_match() call targets one field. The result is a tidy data frame ready for filtering, grouping, or plotting.
Summary
Here is the complete 30-pattern reference in one table, sorted by category.
| # | Category | Pattern | Regex | What It Matches | |
|---|---|---|---|---|---|
| 1 | Literal | Literal text | abc |
Exact characters | |
| 2 | Literal | Any character | . |
Any char except newline | |
| 3 | Literal | Escaped dot | \\. |
Literal period | |
| 4 | Literal | Escaped backslash | \\\\ |
Literal backslash | |
| 5 | Literal | Escaped special | \\$ |
Literal dollar sign | |
| 6 | Class | Custom set | [aeiou] |
One char from the set | |
| 7 | Class | Range | [a-z] |
Any lowercase letter | |
| 8 | Class | Negated set | [^0-9] |
Any char NOT in set | |
| 9 | Class | Digit | \\d |
Any digit | |
| 10 | Class | Word char | \\w |
Letter, digit, underscore | |
| 11 | Class | Whitespace | \\s |
Space, tab, newline | |
| 12 | Class | POSIX alpha | [[:alpha:]] |
Any letter (locale-aware) | |
| 13 | Quantifier | Zero or one | ? |
0 or 1 repetition | |
| 14 | Quantifier | One or more | + |
1 or more (greedy) | |
| 15 | Quantifier | Zero or more | * |
0 or more (greedy) | |
| 16 | Quantifier | Exact count | {3} |
Exactly 3 repetitions | |
| 17 | Quantifier | N or more | {2,} |
2 or more repetitions | |
| 18 | Quantifier | Range | {2,4} |
Between 2 and 4 | |
| 19 | Quantifier | Lazy | +? |
1 or more (shortest) | |
| 20 | Anchor | Start | ^ |
Beginning of string | |
| 21 | Anchor | End | $ |
End of string | |
| 22 | Anchor | Word boundary | \\b |
Edge of a word | |
| 23 | Anchor | Non-boundary | \\B |
NOT at a word edge | |
| 24 | Group | Capturing | (\\d{4}) |
Captures for extraction | |
| 25 | Group | Non-capturing | (?:ab)+ |
Groups without capturing | |
| 26 | Group | Backreference | (\\w+) \\1 |
Matches repeated word | |
| 27 | Group | Alternation | `cat\ | dog` | Matches either option |
| 28 | Lookaround | Positive lookahead | (?=...) |
Asserts what follows | |
| 29 | Lookaround | Negative lookahead | (?!...) |
Asserts what does NOT follow | |
| 30 | Lookaround | Positive lookbehind | (?<=...) |
Asserts what precedes |
Bookmark this table. The fastest way to use it is to open the page, Ctrl+F for the category you need, and copy the runnable example from the section above into your own script.
References
- Wickham, H., stringr: Simple, Consistent Wrappers for Common String Operations. CRAN package documentation. Link
- stringr documentation, Regular expressions vignette. Link
- RStudio, Basic Regular Expressions in R Cheat Sheet (PDF). Link
- Wickham, H. & Grolemund, G., R for Data Science, 2nd Edition. Chapter 15: Regular expressions. Link
- R Core Team, R Documentation: Regular Expressions (
?regexhelp page). Link - Posit, Work with Strings: stringr Cheat Sheet (HTML). Link
Continue Learning
- stringr in R: 15 Functions That Handle Every String Task, The full stringr tutorial covering
str_split(),str_pad(),str_trim(), and 12 more functions with real data examples. - R Cheat Sheet: The Ultimate Quick Reference, 200 essential R functions organized by category, including base R string functions.
- lubridate Cheat Sheet for R: Parse and Format Dates, The date-handling companion to this regex sheet, with 20+ parsing and formatting patterns.