tidyr separate() & unite() in R: Split & Combine Character Columns
separate_wider_delim(), separate_wider_position(), and separate_wider_regex() split one character column into several new columns. unite() does the reverse, glueing several columns into one string. These are the modern tidyr tools for reshaping columns by splitting and combining text.
Introduction
Real data often packs multiple pieces of information into a single column. A "date" column might contain "2026-04-06" when you need separate year, month, and day fields. An order ID like "US-2026-1042" hides region, year, and sequence number. Before you can filter, group, or visualize those pieces, you need to break them apart.
The tidyr package provides three modern functions for this job. Each one splits a character column into new columns using a different strategy: a delimiter character, fixed character widths, or a regex pattern. For the reverse direction, unite() pastes columns back together with a separator of your choice.
separate_wider_*() functions replace it with clearer names and better error handling. All code in this tutorial uses the modern functions. If you see separate() in older tutorials, the concepts are the same but the syntax differs.Every code block on this page runs in your browser. Click Run on the first block to load the library, then work through the rest top to bottom. Variables carry over between blocks.
How do you split a column on a delimiter?
The most common case is splitting on a known character like a dash, underscore, or comma. This is the job for separate_wider_delim(). You tell it which column to split, the delimiter to split on, and the names for the new columns.
Let's start with a small order dataset where region, year, and ID are packed into one column.
Each order_code holds three pieces separated by dashes. Let's split them into region, year, and id.
The original order_code column is gone (replaced by the three new ones). The new columns are all character type. If you need year as an integer, pipe into mutate(year = as.integer(year)) afterward.
names = c("region", NA, NA). The unnamed parts are silently discarded.Here is a second example: splitting a date string into year, month, and day.
The function reads left to right, assigning each piece between delimiters to the next name. Three dashes in the date produce two splits and three pieces, matching the three names you provided.
How do you split a fixed-width column?
Some data encodes meaning in character positions, not delimiters. A product code like "m1234" might pack a one-character category code followed by a four-digit serial. There is no separator to split on, but separate_wider_position() handles this by character widths.
You pass a named integer vector where the names become column names and the values are the character widths.
The widths must add up to the total character length of every value. If some rows are shorter, the function errors by default. You can relax this with the too_few argument (covered in the next section).
How do you split columns using a regex pattern?
When neither a single delimiter nor fixed widths describe your data, separate_wider_regex() is the escape hatch. You provide a named character vector of regex patterns. Named patterns become columns; unnamed patterns match text to skip.
Consider tags like "v2.1-beta" where you want the major version, minor version, and stage. The hyphen separates the version from the stage, but the dot separates major from minor.
Each named element in the patterns vector creates a column. Unnamed elements match text that gets discarded. The patterns are applied left to right, consuming the string as they go.
separate_wider_delim() when you have a single clean separator. Move to separate_wider_position() for fixed-width formats. Only reach for separate_wider_regex() when the other two cannot handle the pattern.How do you handle rows that do not split cleanly?
Real data is messy. Some rows may have fewer separators than expected, and others may have more. By default, all three separate_wider_*() functions throw an error when this happens. The too_few and too_many arguments control the behavior.
Let's see what happens when some orders have only two parts instead of three.
Row 2 had only two pieces, so too_few = "align_start" fills from the left and puts NA in the remaining column. Row 3 had four pieces, so too_many = "merge" keeps the extra part attached to the last column instead of erroring.
too_few and too_many, the function stops at the first row that does not match. Always inspect your data first with count() or str_count() to know how many pieces each row has.Here is a quick reference for the options:
| Argument | Option | What it does |
|---|---|---|
too_few |
"error" (default) |
Stops with an error message |
too_few |
"align_start" |
Fills columns left to right, puts NA in the rest |
too_few |
"align_end" |
Fills columns right to left, puts NA at the start |
too_many |
"error" (default) |
Stops with an error message |
too_many |
"drop" |
Silently drops extra pieces |
too_many |
"merge" |
Merges extra pieces into the last column |
Both arguments also accept "debug", which keeps the original column plus adds problem indicator columns so you can investigate.
How do you combine columns with unite()?
unite() is the complement of the split functions. It pastes two or more columns together with a separator and optionally removes the original columns.
Let's combine year, month, and day columns back into a date string.
The col argument names the new combined column. The sep argument controls the glue between values (default is "_"). By default, unite() removes the source columns. Set remove = FALSE to keep them.
When your data has missing values, unite() pastes the literal string "NA" by default. Use na.rm = TRUE to skip missing values.
Without na.rm = TRUE, the first row would read "123 Main St, NA, Portland". The na.rm argument drops the missing piece and adjusts the separators automatically.
Common Mistakes and How to Fix Them
Mistake 1: Using the deprecated separate() instead of separate_wider_delim()
❌ Wrong:
Why it is wrong: separate() is superseded in tidyr 1.3+. It still works but has weaker error handling and no too_few/too_many controls. New code should use the modern functions.
✅ Correct:
Mistake 2: Forgetting too_few when data has inconsistent splits
❌ Wrong:
Why it is wrong: The default too_few = "error" halts on the first row with fewer pieces than expected. If even one row is short, the entire call fails.
✅ Correct:
Mistake 3: Getting "NA" strings from unite() with missing values
❌ Wrong:
Why it is wrong: unite() converts NA to the string "NA" and pastes it. The result looks correct but contains a literal "NA" that passes is.na() as FALSE.
✅ Correct:
Practice Exercises
Exercise 1: Split email addresses
Split the email column into username and domain using the @ delimiter.
Click to reveal solution
Explanation: The @ symbol cleanly separates every email into exactly two pieces. Each piece gets the name you assigned.
Exercise 2: Split fixed-width census codes
Each census_code is exactly 7 characters: 2-character state, 3-digit county, 2-digit tract. Split it into three columns.
Click to reveal solution
Explanation: The widths 2 + 3 + 2 = 7 match the total length of each code. Each width grabs that many characters from left to right.
Exercise 3: Unite columns and handle NAs
Combine first_name, middle_name, and last_name into a single full_name column, skipping any missing values.
Click to reveal solution
Explanation: With na.rm = TRUE, unite() skips the missing middle names and produces clean full names without extra spaces or literal "NA" strings.
Putting It All Together
Let's combine splitting and combining in one realistic pipeline. You receive a dataset of employee records where the ID encodes department and hire year, and the name is a single column. You need to split the ID, split the name, and then unite department with last name for a label.
This pipeline chains two splits and one unite. The first split breaks the ID into department, year, and sequence. The second split breaks the name into first and last. Then unite() creates a department-last name label. Setting remove = FALSE keeps the source columns so you still have them for filtering or grouping.
Summary
Here is a quick comparison of the four functions covered in this tutorial:
| Function | Strategy | Use when... | Key argument |
|---|---|---|---|
separate_wider_delim() |
Split on a delimiter character | Data has a consistent separator (dash, comma, space) | delim |
separate_wider_position() |
Split by character widths | Data uses fixed-width encoding (2-char state + 3-digit code) | widths |
separate_wider_regex() |
Split using regex patterns | No single delimiter or fixed width works | patterns |
unite() |
Combine columns into one string | You need to paste columns back together | sep, na.rm |
Key takeaways:
- Start with
separate_wider_delim(). It handles 80% of split tasks. - Use
too_fewandtoo_manyto handle inconsistent data gracefully. - Set
na.rm = TRUEinunite()to avoid literal "NA" strings. - All new columns from splitting are character type. Convert with
as.integer()oras.double()as needed.
FAQ
When should I use separate_wider_regex() instead of separate_wider_delim()?
Use separate_wider_regex() when there is no single consistent delimiter. For example, a string like "v2.1-beta" has both dots and dashes as separators, plus a leading "v" to skip. A regex pattern handles all of these in one call.
Does unite() change column types?
Yes. unite() always produces a character column, even if the input columns are numeric or logical. If you need a typed result, parse the united column afterward with as.Date(), as.numeric(), or similar.
Can I split into rows instead of columns?
Yes, but you need different functions. separate_longer_delim() and separate_longer_position() split a column into multiple rows instead of multiple columns. Use the _wider_ variants when each piece should become its own column and the _longer_ variants when each piece should become its own row.
References
- Wickham, H. et al. — tidyr: Tidy Messy Data. CRAN package documentation. Link
- tidyr reference —
separate_wider_delim(),separate_wider_position(),separate_wider_regex(). Link - tidyr reference —
unite(). Link - Wickham, H. & Grolemund, G. — R for Data Science, 2nd Edition. O'Reilly (2023). Chapter 14: Strings. Link
- tidyr changelog — tidyr 1.3.0 release notes (separate_wider_* introduction). Link
What's Next?
- pivot_longer() and pivot_wider(): Reshape Data in R — The parent tutorial covering row-level reshaping, the complement to column-level splitting and combining.
- Missing Values in R: Detect, Count, Remove, and Impute NA — Splitting columns often reveals hidden NAs. Learn the full toolkit for handling them.