R String Manipulation Exercises: 10 stringr Practice Problems Solved
Practice R string manipulation with 10 exercises: combining text, searching patterns, extracting substrings, replacing text, and cleaning messy data. Uses both base R and stringr functions.
String manipulation is essential for data cleaning — fixing column names, parsing text fields, extracting patterns from messy data. These exercises progress from basic paste/grep to regex-powered text processing.
Easy (1-4): Basic String Operations
Exercise 1: Build Formatted Strings
Given vectors of first names, last names, and ages, create formatted strings like "Alice Smith (age 25)".
# Exercise 1: Format strings
first <- c("Alice", "Bob", "Carol")
last <- c("Smith", "Jones", "Williams")
ages <- c(25, 32, 28)
# Create: "Alice Smith (age 25)", etc.
Key concept:gsub("[^0-9]", "", x) removes everything except digits. Then sprintf() reformats with consistent structure. This pattern works for any standardization task.
Exercise 6: Extract Data from Text
Parse structured text to extract names, values, and units.
# Exercise 6: Parse measurement strings
measurements <- c("Temperature: 72.5 F", "Humidity: 45 %",
"Pressure: 1013.25 hPa", "Wind Speed: 12.3 mph")
# Extract: parameter name, numeric value, and unit from each string
Click to reveal solution
measurements <- c("Temperature: 72.5 F", "Humidity: 45 %",
"Pressure: 1013.25 hPa", "Wind Speed: 12.3 mph")
# Split on ": " to get name and value+unit
parts <- strsplit(measurements, ": ")
names_vec <- sapply(parts, `[`, 1)
# Extract numeric value and unit from the second part
value_unit <- sapply(parts, `[`, 2)
values <- as.numeric(gsub("[^0-9.]", "", value_unit))
units <- trimws(gsub("[0-9.]", "", value_unit))
# Create a data frame
result <- data.frame(
parameter = names_vec,
value = values,
unit = units,
stringsAsFactors = FALSE
)
print(result)
Key concept: Combine strsplit() for structured parts and gsub() with regex for extracting numbers vs text. gsub("[^0-9.]", "", x) keeps only digits and dots.
Exercise 7: Text Search and Replace
Clean a paragraph of text: fix double spaces, standardize quotes, remove trailing punctuation from a list, and count word frequency.
# Exercise 7: Text cleaning
text <- ' The "quick" brown fox jumped over the lazy dog. The dog was not amused. The fox ran away. '
# 1. Trim and collapse multiple spaces to single
# 2. Replace curly quotes with straight quotes
# 3. Count total words
# 4. Find the 3 most common words
Click to reveal solution
text <- ' The "quick" brown fox jumped over the lazy dog. The dog was not amused. The fox ran away. '
# 1. Trim and collapse spaces
clean <- trimws(text)
clean <- gsub("\\s+", " ", clean)
cat("Cleaned:", clean, "\n\n")
# 2. Replace special quotes (if any)
clean <- gsub("[\u201c\u201d]", '"', clean)
# 3. Count words
words <- tolower(unlist(strsplit(clean, "\\s+")))
words <- gsub("[^a-z]", "", words) # Remove punctuation
words <- words[nchar(words) > 0] # Remove empty strings
cat("Word count:", length(words), "\n\n")
# 4. Most common words
freq <- sort(table(words), decreasing = TRUE)
cat("Top 5 words:\n")
print(head(freq, 5))
Key concept:gsub("\\s+", " ", x) collapses all whitespace runs to single spaces. strsplit(x, "\\s+") splits on any whitespace. table() counts frequencies.
Hard (8-10): Real-World Text Processing
Exercise 8: CSV Line Parser
Write a function that parses a CSV line, handling quoted fields that may contain commas.
# Exercise 8: Parse CSV with quoted fields
# "John Smith","New York, NY","45","Engineer"
# The comma in "New York, NY" should NOT split the field
lines <- c(
'"Alice","San Francisco, CA","30","Designer"',
'"Bob","Austin, TX","25","Developer"',
'"Carol","Portland, OR","35","Manager"'
)
# Parse each line into a vector of 4 fields
Click to reveal solution
lines <- c(
'"Alice","San Francisco, CA","30","Designer"',
'"Bob","Austin, TX","25","Developer"',
'"Carol","Portland, OR","35","Manager"'
)
parse_csv_line <- function(line) {
# Use a regex that matches quoted fields
matches <- regmatches(line, gregexpr('"[^"]*"', line))[[1]]
# Remove surrounding quotes
gsub('^"|"$', "", matches)
}
# Parse all lines
for (line in lines) {
fields <- parse_csv_line(line)
cat(sprintf("Name: %-8s City: %-20s Age: %s Job: %s\n",
fields[1], fields[2], fields[3], fields[4]))
}
# Or build a data frame
records <- lapply(lines, parse_csv_line)
df <- do.call(rbind, lapply(records, function(r) {
data.frame(name=r[1], city=r[2], age=as.integer(r[3]), job=r[4],
stringsAsFactors=FALSE)
}))
cat("\nData frame:\n")
print(df)
Key concept:gregexpr('"[^"]*"', line) finds all quoted strings. [^"]* means "any characters except quotes." This handles commas inside quotes correctly.
Exercise 9: Log File Analysis
Parse web server log entries to extract IP addresses, timestamps, and status codes.
Key concept:regexpr() finds the first match, regmatches() extracts it. Each regex targets a specific part of the log format. This is how real log analysis works.
Exercise 10: Email Validator and Parser
Write functions to validate email addresses and extract the username and domain parts.
# Exercise 10: Email validation and parsing
emails <- c("alice@example.com", "bob.smith@company.co.uk", "invalid@",
"@nodomain.com", "carol+tag@gmail.com", "not an email",
"david@sub.domain.org", "eve@.com")
# 1. Validate each email (TRUE/FALSE)
# 2. For valid emails: extract username and domain
# 3. Count emails per domain
Click to reveal solution
emails <- c("alice@example.com", "bob.smith@company.co.uk", "invalid@",
"@nodomain.com", "carol+tag@gmail.com", "not an email",
"david@sub.domain.org", "eve@.com")
# 1. Validate with regex
is_valid_email <- function(x) {
grepl("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$", x)
}
valid <- is_valid_email(emails)
cat("Validation:\n")
for (i in seq_along(emails)) {
cat(sprintf(" %-30s %s\n", emails[i], if (valid[i]) "VALID" else "INVALID"))
}
# 2. Parse valid emails
valid_emails <- emails[valid]
parts <- strsplit(valid_emails, "@")
usernames <- sapply(parts, `[`, 1)
domains <- sapply(parts, `[`, 2)
cat("\nParsed valid emails:\n")
parsed <- data.frame(email = valid_emails, user = usernames, domain = domains,
stringsAsFactors = FALSE)
print(parsed)
# 3. Count per domain
cat("\nEmails per domain:\n")
print(table(domains))
Key concept: The email regex ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$ checks for: valid characters before @, valid domain, and a TLD of 2+ letters. strsplit(x, "@") cleanly separates username and domain.
Summary: Skills Practiced
Exercises
String Skills
1-4 (Easy)
paste/sprintf, case conversion, grep, substr/strsplit
5-7 (Medium)
gsub with regex, text extraction, word frequency
8-10 (Hard)
CSV parsing, log analysis, email validation with regex
What's Next?
More exercise sets:
R Date/Time Exercises — lubridate practice problems
R apply Family Exercises — master apply, lapply, sapply, tapply
Or continue learning: Data Wrangling with dplyr tutorial.