stringr in R: 15 Functions That Handle Every String Task You'll Actually Encounter

stringr is the tidyverse string toolkit. Every function starts with str_, takes the string as the first argument, and returns a vector the same length as its input. That consistency, missing from base R, is why it has become the default for cleaning, extracting, and reshaping text in R.

Why use stringr instead of base R string functions?

Base R has grepl, gsub, regmatches, substr, sub, nchar, toupper, and more. They work, but they disagree with each other about argument order, return type, and what "no match" means. stringr fixes all of this with one rule: the string comes first, the pattern comes second, and the output shape is predictable. Let's see the payoff on a messy vector of product names.

RTrim whitespace and detect patterns
library(stringr) names <- c(" Laptop Pro 16 ", "USB-C Hub", "monitor-27in", "WiFi Router", NA) # Find items that contain digits, ignoring whitespace and NA str_detect(str_trim(names), "\\d") #> [1] TRUE FALSE TRUE FALSE NA

  

One line. str_trim strips whitespace, str_detect returns TRUE/FALSE for each element, NAs propagate cleanly. In base R you would write grepl("\\d", trimws(names)) and silently lose the NA behavior because grepl returns FALSE for NA input. That asymmetry is exactly the kind of bug stringr prevents.

stringr function families

Figure 1: The seven families of stringr functions. Pick a family based on what you want to do, detect, extract, replace, split, measure, modify, or format.

stringr organizes its ~40 functions into seven families. You rarely need more than 15 of them in daily work, and this post covers every one of those 15.

Tip
stringr ships inside the tidyverse, so library(tidyverse) loads it automatically. If you only need strings, library(stringr) is lighter.

Try it: Use str_detect() on the vector below to return TRUE for elements that contain "fox".

RExercise: detect fox in animals
library(stringr) animals <- c("red fox", "brown bear", "arctic fox", "deer") # Your str_detect() call

  
Click to reveal solution
RExercise solution
library(stringr) animals <- c("red fox", "brown bear", "arctic fox", "deer") str_detect(animals, "fox") #> [1] TRUE FALSE TRUE FALSE

  

str_detect() returns one logical value per input element, TRUE where the pattern "fox" appears anywhere in the string and FALSE otherwise. The two fox entries match, while "brown bear" and "deer" do not.

How do you test if a string contains something with str_detect()?

str_detect(string, pattern) returns a logical vector, TRUE where the pattern matches, FALSE where it does not. It is the workhorse of every filter step that touches text.

RFilter gmail addresses
library(stringr) library(dplyr) emails <- tibble( id = 1:5, address = c("asha@gmail.com", "bilal@work.co", "cleo+tag@gmail.com", "daan@hotmail.com", "edu@gmail.com") ) # Keep only gmail addresses emails |> filter(str_detect(address, "@gmail\\.com$")) #> # A tibble: 3 x 2 #> id address #> <int> <chr> #> 1 1 asha@gmail.com #> 2 3 cleo+tag@gmail.com #> 3 5 edu@gmail.com

  

Two details worth memorizing. First, \\. matches a literal dot, because . in regex means "any character". Escaping a dot is the single most common stringr bug. Second, $ anchors to the end of the string, so @gmail.com.backup would not match.

Three cousins of str_detect come up often:

RStarts which and count helpers
# str_starts / str_ends, cheaper than regex anchors for fixed prefixes str_starts(emails$address, "asha") #> [1] TRUE FALSE FALSE FALSE FALSE # str_which, returns indices of matches, not a logical vector str_which(emails$address, "@gmail") #> [1] 1 3 5 # str_count, how many matches per string str_count("banana", "a") #> [1] 3

  

Use str_detect when you will feed the result to filter() or if_else(). Use str_which when you need integer positions (rare). Use str_count when "how many" is the actual question.

Note
stringr patterns are ICU regex by default. They support most of what you know from PCRE with minor differences. Wrap a pattern in fixed() to turn off regex interpretation and in coll() for locale-aware matching.

Try it: Filter the tibble to rows where the title column contains the word "report", case-insensitive.

RExercise: detect ignoring case
library(stringr); library(dplyr) docs <- tibble( id = 1:4, title = c("Sales Report Q1", "invoice 2024", "REPORT-draft", "memo") ) # Hint: wrap the pattern in regex(..., ignore_case = TRUE)

  
Click to reveal solution
RExercise solution
library(stringr); library(dplyr); library(tibble) docs <- tibble( id = 1:4, title = c("Sales Report Q1", "invoice 2024", "REPORT-draft", "memo") ) docs |> filter(str_detect(title, regex("report", ignore_case = TRUE))) #> # A tibble: 2 x 2 #> id title #> <int> <chr> #> 1 1 Sales Report Q1 #> 2 3 REPORT-draft

  

Wrapping the pattern in regex(..., ignore_case = TRUE) tells stringr to match "report" regardless of capitalization, so both "Sales Report Q1" and "REPORT-draft" pass the filter. The plain strings "invoice 2024" and "memo" are dropped because they don't contain the word at all.

How do str_extract() and str_match() pull out parts of a string?

When you need the matched text itself, not just TRUE/FALSE, use str_extract for a simple capture and str_match when you need named groups.

RExtract ISO dates from logs
library(stringr) logs <- c( "2026-04-01 INFO user=42 msg=login", "2026-04-01 WARN user=17 msg=bad_pwd", "2026-04-02 INFO user=99 msg=logout" ) # Pull out the first ISO date in each string str_extract(logs, "\\d{4}-\\d{2}-\\d{2}") #> [1] "2026-04-01" "2026-04-01" "2026-04-02" # Pull out every 2-digit-or-longer number str_extract_all(logs, "\\d+") #> [[1]] #> [1] "2026" "04" "01" "42" #> [[2]] #> [1] "2026" "04" "01" "17" #> [[3]] #> [1] "2026" "04" "02" "99"

  

str_extract returns one match per string (the first). str_extract_all returns a list, one vector per string, possibly of different lengths. The list shape is annoying but honest: you cannot fit variable-length results into a plain vector.

For structured extraction, use capture groups with str_match:

RMatch with three capture groups
str_match(logs, "(\\d{4}-\\d{2}-\\d{2}) (\\w+) user=(\\d+)") #> [,1] [,2] [,3] [,4] #> [1,] "2026-04-01 INFO user=42" "2026-04-01" "INFO" "42" #> [2,] "2026-04-01 WARN user=17" "2026-04-01" "WARN" "17" #> [3,] "2026-04-02 INFO user=99" "2026-04-02" "INFO" "99"

  

Column 1 is the full match; columns 2+ are the capture groups in order. You can wrap this in a tibble and rename the columns for a quick parser. For the pipe-friendly version, tidyr::extract() does the same thing straight into a data frame.

regex anatomy for stringr

Figure 2: Anatomy of a regex pattern. Every stringr function uses these same pieces.

Warning
str_match returns a character matrix, not a list or tibble. If your strings have no match, that row is all NAs, check with anyNA() before assuming success.

Try it: Extract the phone number (10 digits, possibly with dashes) from each string below.

RExercise: extract phone numbers
library(stringr) contacts <- c("call 555-123-4567 anytime", "fax: 9876543210", "no phone") # Use str_extract with a suitable regex

  
Click to reveal solution
RExercise solution
library(stringr) contacts <- c("call 555-123-4567 anytime", "fax: 9876543210", "no phone") str_extract(contacts, "\\d{3}-?\\d{3}-?\\d{4}") #> [1] "555-123-4567" "9876543210" NA

  

The pattern \\d{3}-?\\d{3}-?\\d{4} asks for three digits, an optional dash, three more digits, an optional dash, then four digits, so it matches both the dashed and undashed forms. The third string has no run of 10 digits, so str_extract() returns NA for it.

How does str_replace() change text inside strings?

str_replace(string, pattern, replacement) swaps the first match; str_replace_all swaps every match. The replacement string can reference capture groups with \\1, \\2, etc.

RReplace dollar and comma in prices
library(stringr) prices <- c("$1,299", "$450", "$12,000") # Strip dollar signs and commas, then convert clean <- prices |> str_replace_all("[$,]", "") |> as.numeric() clean #> [1] 1299 450 12000

  

The character class [$,] matches either a dollar sign or a comma. One str_replace_all call handles both. The pipe into as.numeric gives you a numeric column ready for arithmetic.

Capture groups make format-switching trivial:

RReorder date parts with backreferences
dates_us <- c("04/01/2026", "12/25/2025", "07/04/2026") # Convert MM/DD/YYYY to YYYY-MM-DD (ISO) str_replace(dates_us, "(\\d{2})/(\\d{2})/(\\d{4})", "\\3-\\1-\\2") #> [1] "2026-04-01" "2025-12-25" "2026-07-04"

  

\\1, \\2, \\3 correspond to the three parenthesized groups in the pattern. This is vastly simpler than a nested substr() + paste0() dance.

For non-regex replacement, when your pattern contains special characters you do not want interpreted, wrap the pattern in fixed():

RFixed literal replacement
# Literal replacement of a dotted string str_replace("version 1.2.3", fixed("1.2.3"), "2.0.0") #> [1] "version 2.0.0"

  

Without fixed(), the dots would match any character and you might accidentally replace "1X2Y3". Use fixed() whenever the pattern is a known literal.

Tip
stringr also ships str_remove() and str_remove_all(), sugar for str_replace(..., ""). Both are clearer at the call site when you just want to delete text.

Try it: Normalize these filenames to lowercase kebab-case (lowercase, dashes not spaces).

RExercise: slugify filenames
library(stringr) files <- c("Sales Report Q1.pdf", "Customer List 2024.csv") # Hint: str_replace_all(..., " ", "-") then str_to_lower()

  
Click to reveal solution
RExercise solution
library(stringr) files <- c("Sales Report Q1.pdf", "Customer List 2024.csv") files |> str_replace_all(" ", "-") |> str_to_lower() #> [1] "sales-report-q1.pdf" "customer-list-2024.csv"

  

str_replace_all(" ", "-") swaps every space for a dash, and str_to_lower() then lowercases the full string including the file extension. Chaining the two in the pipe keeps the transformation readable and avoids an intermediate variable.

How do you split and join strings with str_split() and str_c()?

str_split cuts one string into pieces on a pattern; str_c is the opposite, glueing several vectors into one. Both are used constantly in data cleaning.

RSplit addresses into matrix
library(stringr) addresses <- c( "21 Main St, Pune, 411001", "9 Park Ave, Lima, 15001", "42 Oak Rd, Berlin, 10115" ) # Split on comma + optional space; simplify to a matrix parts <- str_split(addresses, ",\\s*", simplify = TRUE) parts #> [,1] [,2] [,3] #> [1,] "21 Main St" "Pune" "411001" #> [2,] "9 Park Ave" "Lima" "15001" #> [3,] "42 Oak Rd" "Berlin" "10115"

  

simplify = TRUE promotes the list result to a character matrix when every input has the same number of parts. When that is not guaranteed, leave it as the default list and purrr::map_chr through it. str_split_fixed(x, pattern, n) is another option, it pads short rows with empty strings so you always get n columns.

Going the other way:

RConcatenate with sep and collapse
first <- c("Asha", "Bilal", "Cleo") last <- c("Rao", "Khan", "Patel") str_c(first, last, sep = " ") #> [1] "Asha Rao" "Bilal Khan" "Cleo Patel" # Collapse a vector into one string str_c(first, collapse = ", ") #> [1] "Asha, Bilal, Cleo"

  

sep concatenates element-wise; collapse concatenates the whole vector into a single string. The two arguments compose, you can use both in one call when combining vectors and then flattening.

RCombine sep and collapse
str_c(first, last, sep = " ", collapse = "; ") #> [1] "Asha Rao; Bilal Khan; Cleo Patel"

  
Note
str_c treats NA like NA, any element with an NA becomes NA. Use paste or coalesce(x, "") if you want NAs to be silently treated as empty strings.

Try it: Split each string on the colon, then build a named vector from the result.

RExercise: split and name metadata
library(stringr) meta <- c("name: Asha", "age: 30", "city: Pune") # str_split on ": " then map to a named vector

  
Click to reveal solution
RExercise solution
library(stringr) meta <- c("name: Asha", "age: 30", "city: Pune") parts <- str_split(meta, ": ", simplify = TRUE) named <- setNames(parts[, 2], parts[, 1]) named #> name age city #> "Asha" "30" "Pune"

  

simplify = TRUE turns the split list into a 3x2 character matrix, keys in column 1, values in column 2. setNames() then attaches the key column as names on the value column, producing a named character vector ready for lookup.

How do you clean whitespace, case, and padding?

Most real string cleanup involves four things: trimming whitespace, changing case, padding to a fixed width, and fixing length. stringr has one-liners for each.

RTrim squish and case helpers
library(stringr) messy <- c(" Asha ", "BILAL", "cleo", " Daan", "Edu ") str_trim(messy) # drop leading + trailing whitespace #> [1] "Asha" "BILAL" "cleo" "Daan" "Edu" str_squish(messy) # also collapse repeated internal whitespace #> [1] "Asha" "BILAL" "cleo" "Daan" "Edu" str_to_lower(messy) #> [1] " asha " "bilal" "cleo" " daan" "edu " str_to_title(str_squish(messy)) #> [1] "Asha" "Bilal" "Cleo" "Daan" "Edu"

  

Pair str_squish with str_to_title as a one-stop cleanup for names. str_squish is stronger than str_trim because it also collapses any internal runs of whitespace to a single space, critical when copy-pasting from spreadsheets.

Padding is the opposite problem: making short strings match a target width, typically for alignment.

RPad identifiers with leading zeros
ids <- c("1", "12", "123", "1234") str_pad(ids, width = 5, side = "left", pad = "0") #> [1] "00001" "00012" "00123" "01234"

  

Zero-padded IDs are a classic need, think invoice numbers, customer codes, file names sorted lexically. side = "right" and side = "both" are also valid.

str_length answers "how many characters in this string?". It counts by code points, not bytes, so it is safe for non-ASCII text.

RLength of multibyte strings
str_length(c("hi", "hello", "नमस्ते")) #> [1] 2 5 6

  

Choosing the right stringr function

Figure 3: A cheat sheet for picking the right stringr function based on the question you are trying to answer.

Tip
str_to_title uses locale rules, so "o'brien" becomes "O'Brien" in English but may behave differently in Turkish locale due to the dotted/dotless I. For strict ASCII, use str_to_upper(substring(x, 1, 1)) + str_to_lower(substring(x, 2)).

Try it: Clean the vector below so every element is title-case, single-spaced, and trimmed.

RExercise: squish and title case animals
library(stringr) raw <- c(" red fox ", "BROWN BEAR", "arctic fox") # Hint: chain str_squish then str_to_title

  
Click to reveal solution
RExercise solution
library(stringr) raw <- c(" red fox ", "BROWN BEAR", "arctic fox") raw |> str_squish() |> str_to_title() #> [1] "Red Fox" "Brown Bear" "Arctic Fox"

  

str_squish() trims leading and trailing whitespace and collapses any internal runs of spaces down to a single space, which handles the double and quintuple gaps in the input. str_to_title() then capitalizes the first letter of each word, giving you a clean, uniform vector.

How do you use regex with stringr effectively?

Every stringr function takes a pattern, and by default that pattern is regex. A short tour of the five regex features you will actually use:

RTour of five regex features
library(stringr) x <- c("apple123", "banana45", "cherry", "PEACH-7", "Grape_12") # 1. Character classes str_extract(x, "[A-Za-z]+") #> [1] "apple" "banana" "cherry" "PEACH" "Grape" # 2. Quantifiers str_extract(x, "\\d{2,}") #> [1] "123" "45" NA NA "12" # 3. Anchors str_detect(x, "^[A-Z]") #> [1] FALSE FALSE FALSE TRUE TRUE # 4. Alternation str_detect(x, "apple|banana") #> [1] TRUE TRUE FALSE FALSE FALSE # 5. Groups str_match(x, "([A-Za-z]+)(\\d*)")[, -1] #> [,1] [,2] #> [1,] "apple" "123" #> [2,] "banana" "45" #> [3,] "cherry" "" #> [4,] "PEACH" "" #> [5,] "Grape" "12"

  

Five regex tools, five concepts. Most data-cleaning regex you will ever write is just a combination of these with careful escaping. Resist the temptation to build a 200-character super-regex, split it into two or three simpler steps that are easier to debug.

When regex feels like overkill, wrap the pattern in fixed() for literal matching or coll() for locale-aware matching. When the pattern needs options like case-insensitive or dotall, use regex(..., ignore_case = TRUE).

RCase insensitive regex wrapper
# Case-insensitive search str_detect(c("Error", "ERROR", "error", "warn"), regex("error", ignore_case = TRUE)) #> [1] TRUE TRUE TRUE FALSE

  
Warning
stringr uses double-escaped backslashes in R strings: "\\d" in code, \d in the compiled regex. Forgetting the double backslash is the second-most-common stringr bug (after unescaped dots).

Try it: Use a regex to extract the hashtags from the tweet below into a character vector.

RExercise: extract hashtags
library(stringr) tweet <- "Loving the new R release! #rstats #tidyverse #coding" # Hint: str_extract_all with "#\\w+"

  
Click to reveal solution
RExercise solution
library(stringr) tweet <- "Loving the new R release! #rstats #tidyverse #coding" str_extract_all(tweet, "#\\w+")[[1]] #> [1] "#rstats" "#tidyverse" "#coding"

  

str_extract_all() returns a list (one element per input string), so [[1]] unwraps the single tweet's matches into a plain character vector. The pattern #\\w+ matches a # followed by one or more word characters (letters, digits, underscore), which is exactly the shape of a hashtag.

Practice Exercises

Exercise 1: Parse a messy log

Given this log vector, extract a tibble with columns timestamp, level, user_id, and message.

RPractice one: parse log columns
library(stringr); library(tibble); library(dplyr) log <- c( "2026-04-01 10:30:15 [INFO] user=42 login successful", "2026-04-01 10:31:02 [WARN] user=17 password retry", "2026-04-01 10:31:45 [ERROR] user=99 db connection lost" ) # Your code: use str_match with capture groups

  
Solution
RPractice one solution
m <- str_match(log, "(\\S+ \\S+) \\[(\\w+)\\] user=(\\d+) (.*)") tibble( timestamp = m[, 2], level = m[, 3], user_id = as.integer(m[, 4]), message = m[, 5] )

  

Exercise 2: Normalize phone numbers

Take the messy vector and normalize each to the format +1-AAA-BBB-CCCC. Drop numbers that do not have exactly 10 digits.

RPractice two: normalise phone format
phones <- c("(555) 123-4567", "555.987.6543", "5551234567", "123-45", "+1 555 111 2222")

  
Solution
RPractice two solution
digits <- str_extract_all(phones, "\\d") |> sapply(function(x) str_c(x, collapse = "")) digits <- str_sub(digits, -10) # keep last 10 digits ok <- str_length(digits) == 10 formatted <- ifelse( ok, str_c("+1-", str_sub(digits, 1, 3), "-", str_sub(digits, 4, 6), "-", str_sub(digits, 7, 10)), NA ) formatted

  

Exercise 3: Find and count hashtags

Given a vector of tweets, return a tibble with columns hashtag and count, sorted descending.

RPractice three: count hashtags per tweet
tweets <- c( "Learning #rstats today #tidyverse", "#rstats community is the best #rstats", "tried #python but prefer #rstats", "no tags here" )

  
Solution
RPractice three solution
library(tibble); library(dplyr) tags <- str_extract_all(tweets, "#\\w+") |> unlist() tibble(hashtag = tags) |> count(hashtag, sort = TRUE, name = "count")

  

Complete Example

Here is a full cleaning pipeline on a customer table with messy names, emails, and phone numbers.

REnd-to-end customer cleanup
library(stringr); library(dplyr); library(tibble) raw <- tibble( id = 1:5, name = c(" ASHA rao ", "bilal Khan ", "cleo PATEL", " daan de Vries", "EDU Silva"), email = c("asha@GMAIL.com", "bilal@work.co", "cleo@gmail.com", "daan@HOTMAIL.com", "edu@gmail.com"), phone = c("(555) 111-2222", "5551234567", "555.999.8888", "not listed", "+1 555 777 6666") ) clean <- raw |> mutate( name = name |> str_squish() |> str_to_title(), email = str_to_lower(email), domain = str_extract(email, "(?<=@).+$"), phone_digits = str_replace_all(phone, "\\D", ""), phone_clean = if_else( str_length(phone_digits) >= 10, str_c( "+1-", str_sub(phone_digits, -10, -8), "-", str_sub(phone_digits, -7, -5), "-", str_sub(phone_digits, -4, -1) ), NA_character_ ), is_gmail = str_detect(domain, "^gmail\\.") ) |> select(id, name, email, domain, phone_clean, is_gmail) clean #> # A tibble: 5 x 6 #> id name email domain phone_clean is_gmail #> <int> <chr> <chr> <chr> <chr> <lgl> #> 1 1 Asha Rao asha@gmail.com gmail.com +1-555-111-2222 TRUE #> 2 2 Bilal Khan bilal@work.co work.co +1-555-123-4567 FALSE #> 3 3 Cleo Patel cleo@gmail.com gmail.com +1-555-999-8888 TRUE #> 4 4 Daan De Vries daan@hotmail.com hotmail.com NA FALSE #> 5 5 Edu Silva edu@gmail.com gmail.com +1-555-777-6666 TRUE

  

Seven stringr calls, one clean pipeline, tidy output ready for the next step in your workflow. The lookahead regex (?<=@).+$ extracts everything after the @ without including the @ itself, a clean way to get the domain.

Summary

Task Function Returns
Test match str_detect() logical
Count matches str_count() integer
Find positions str_which() integer
Extract first match str_extract() character
Extract all matches str_extract_all() list
Extract with groups str_match() matrix
Replace first str_replace() character
Replace all str_replace_all() character
Split str_split() list
Join str_c() character
Trim whitespace str_trim() / str_squish() character
Change case str_to_lower/upper/title() character
Pad str_pad() character
Length str_length() integer
Subset by position str_sub() character

Four rules worth internalizing:

  1. Escape dots. \\. matches a literal dot; . matches any character.
  2. Pattern interpretation. Plain strings are regex; fixed() is literal; regex() adds options.
  3. Single vs all. Most functions have a _all variant, pick the one that matches your question.
  4. List vs vector. _all extractors return lists because match counts vary; handle that shape explicitly.

References

Continue Learning

{% endraw %}