gsub() in R: Replace All Pattern Matches
The gsub() function in base R replaces ALL regex matches in a character vector with a replacement. sub() replaces only the FIRST match per string. Both are vectorized.
gsub("apple", "orange", x) # replace all
sub("apple", "orange", x) # replace first only
gsub("\\d+", "#", x) # all digits
gsub("\\s+", " ", x) # collapse whitespace
gsub("(\\d+)-(\\d+)", "\\2-\\1", x) # backreference swap
gsub("apple", "orange", x, fixed = TRUE)# literal
gsub("apple", "orange", x, ignore.case = TRUE)Need explanation? Read on for examples and pitfalls.
What gsub() does in one sentence
gsub(pattern, replacement, x) finds every regex match of pattern in each string of x and replaces it with replacement. sub() does the same but stops after the first match per string.
These are the standard base R functions for text cleaning: removing punctuation, normalizing whitespace, swapping codes, masking sensitive values.
Syntax
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE).
fixed = TRUE for literal replacement. Without fixed, characters like . * + ( are special. gsub(".", "x", "abc") returns "xxx" (each char). gsub(".", "x", "abc", fixed = TRUE) returns "abc" (no . in input).Five common patterns
1. Remove punctuation
[[:punct:]] is a POSIX class for punctuation.
2. Collapse whitespace
\\s+ matches one or more whitespace chars; trimws removes leading/trailing.
3. Replace digits with placeholder
4. Backreferences (capture and swap)
\\1, \\2, \\3 reference the 1st, 2nd, 3rd capture groups in the pattern.
5. Remove instead of replace
Replacing with "" is the standard "delete pattern" idiom.
gsub is REGEX by default; if you want literal text, use fixed = TRUE. Forgetting this is the #1 source of bugs. gsub(".", "x", "abc") replaces EVERY character because . is regex. Always think: is my pattern regex or literal?gsub() vs sub() vs str_replace_all() vs chartr()
Four "find and replace" functions in R, with different scope.
| Function | Replaces | Regex | Best for |
|---|---|---|---|
gsub() |
All matches | Yes (default) | Standard regex replace |
sub() |
First match per string | Yes (default) | "Replace first occurrence" |
stringr::str_replace_all() |
All matches | Yes | Tidyverse pipelines |
stringr::str_replace() |
First match | Yes | Tidyverse, single replace |
chartr() |
Char-by-char map | No | Translate single chars |
When to use which:
- gsub for default base R replacement.
- sub when you only want the first hit (useful for "trim leading X" patterns).
- str_replace_all for tidyverse code.
- chartr for fast 1-to-1 character mapping (no regex needed).
A practical text-cleaning workflow
Most text cleaning is a chain of gsub calls. A typical pipeline:
- Lowercase:
tolower(x) - Strip punctuation:
gsub("[[:punct:]]", "", x) - Collapse whitespace:
gsub("\\s+", " ", x) - Trim:
trimws(x) - Standardize specific tokens:
gsub("usa|united states", "US", x, ignore.case = TRUE)
Build the chain incrementally and inspect intermediate results. The biggest source of bugs is a regex that matches more (or less) than you intended.
Common pitfalls
Pitfall 1: backslash escaping. Regex \d is "\\d" in an R string. gsub("\d+", ...) ERRORS or behaves unexpectedly. Always use \\d (double backslash).
Pitfall 2: greedy vs lazy quantifiers. gsub("<.*>", "", "<a>text<b>") returns "" (greedy). For lazy match, use <.*?> (works in perl = TRUE mode) or anchor more precisely.
gsub("\\d+", "(\\1)", x) does NOT work because there are no parentheses around \\d+. Wrap in (...) to capture: gsub("(\\d+)", "(\\1)", x).Performance and Unicode notes
For most everyday inputs, gsub is fast enough that you should not think about performance. It runs in compiled C and handles vectors of millions of strings without issue. Two situations where performance does matter: very long strings (megabyte-class text) and patterns with catastrophic backtracking (alternation inside a quantifier, e.g., (a|a)*). For megabyte text, stringi::stri_replace_all_regex() is faster and Unicode-aware. For pathological patterns, simplify or use perl = TRUE which has different backtracking rules. Unicode handling in base gsub depends on locale; use stringi or stringr for consistent UTF-8 behaviour across platforms.
Try it yourself
Try it: Clean these phone numbers by stripping all non-digit characters. Save to ex_phones.
Click to reveal solution
Explanation: \\D matches any NON-digit character. Replacing with "" removes them. Result is digits-only.
Related replace functions
After mastering gsub, look at:
sub(): first-match variantstringr::str_replace_all(): tidyverse equivalentstringr::str_replace(): first-match tidyversechartr(): character-by-character translationregexec()+regmatches(): extract capture groups for inspectiontools::toTitleCase(): capitalize words
For replacing many specific patterns at once, stringr::str_replace_all(x, named_vector) is cleaner than chaining gsubs.
FAQ
What is the difference between gsub and sub in R?
gsub replaces ALL matches in each string. sub replaces only the FIRST match per string. They share the same arguments.
How do I replace multiple patterns at once with gsub?
Chain calls: gsub("p2", "r2", gsub("p1", "r1", x)). Or use stringr::str_replace_all(x, c("p1" = "r1", "p2" = "r2")) for a cleaner named-vector approach.
How do I do a literal replace with gsub?
Pass fixed = TRUE: gsub(".", "x", x, fixed = TRUE) replaces literal periods, not "any character".
What is a backreference in gsub?
\\1, \\2, etc. in the replacement refer to capture groups (parenthesized parts of the pattern). They let you reorder or reuse matched parts in the output.
How do I make gsub case-insensitive?
Pass ignore.case = TRUE. Or use the inline regex flag with perl = TRUE: gsub("(?i)apple", "fruit", x, perl = TRUE).