readr parse_factor() in R: Parse Text Into a Factor
The readr parse_factor() function converts a character vector into a factor and, unlike base R's factor(), warns you when a value falls outside the set of levels you expected.
parse_factor(c("a", "b", "a")) # levels inferred, sorted
parse_factor(x, levels = c("lo", "mid", "hi")) # fixed level set
parse_factor(x, levels = lv, ordered = TRUE) # ordered factor
parse_factor(x, levels = valid) # flags unexpected values
parse_factor(x, na = "missing") # treat a string as NA
parse_factor(x, include_na = FALSE) # no NA levelNeed explanation? Read on for examples and pitfalls.
What parse_factor() does
parse_factor() turns text into a validated factor. You pass a character vector and an optional set of levels, and it returns a factor. Its defining feature is validation: when a value is not in the levels you supplied, readr records a parsing problem instead of silently discarding it.
With no levels argument, parse_factor() infers the level set from the unique values, sorted alphabetically, exactly as factor() would. The real value appears once you supply the levels yourself.
factor() converts the same stray value to NA in silence. That single difference is the whole reason to reach for the readr parser.parse_factor() syntax and arguments
Most calls use only x and levels; the rest tune edge cases. The full signature is parse_factor(x, levels = NULL, ordered = FALSE, na = c("", "NA"), locale = default_locale(), include_na = TRUE, trim_ws = TRUE). Each argument controls one part of how text becomes a factor.
| Argument | What it controls | Default |
|---|---|---|
x |
The character vector to convert | (required) |
levels |
The allowed factor levels; NULL infers them |
NULL |
ordered |
Whether the factor is ordered | FALSE |
na |
Strings to treat as missing | c("", "NA") |
include_na |
Whether NA becomes an explicit level |
TRUE |
trim_ws |
Trim surrounding whitespace before matching | TRUE |
The levels argument does double duty. It fixes the order in which the levels are stored, and it defines the valid set. Supplying it is what turns parse_factor() from a plain converter into a checker.
The levels appear in the order you wrote them, not alphabetically, which matters for any plot or model that respects factor order.
Examples by use case
parse_factor() shines wherever a column has a known, fixed vocabulary. Survey responses, product grades, and experiment groups all draw from a finite set of valid values, and parse_factor() makes that set explicit and enforced.
The < signs in the Levels line show the factor is ordered, so comparisons such as grade > "B" now return meaningful results.
The validation step is the part base R cannot do. When a value does not match any level, parse_factor() names the offending row.
The misspelled "mabye" becomes NA, and readr reports exactly which row failed. Call problems() on the result for the full report.
The other common use is import. Inside any readr reader, col_factor() runs parse_factor() on a column as the file loads.
col_factor(levels = ...) inside col_types parses and checks the column while the file loads. You get the same level validation as parse_factor() with no separate cleanup pass over the data frame.parse_factor() vs factor() and as.factor()
All three build factors; only parse_factor() reports bad values. The choice comes down to whether you want unexpected input to fail loudly or disappear quietly.
| Function | Validates against levels | Result for value outside levels | Use when |
|---|---|---|---|
parse_factor() |
Yes, with a problem report | NA plus a warning naming the row |
You want stray values caught |
factor() |
No | NA, silently |
Base R, you trust the input |
as.factor() |
No, cannot set levels | Keeps the value as a new level | Quick conversion, order does not matter |
The decision rule is simple. If the column should hold only a known set of values, use parse_factor() so violations surface. If you already trust the data or do not care about level order, factor() or as.factor() are fine and add no readr dependency.
Common pitfalls
Three parse_factor() behaviours trip up new users. The first is forgetting that without levels, you get no validation at all. parse_factor() with levels = NULL is just a sorted factor() and will never warn you about a typo.
The second is case and whitespace sensitivity. parse_factor() matches values exactly, so "Yes" does not match a level of "yes". Surrounding whitespace is trimmed by default, but inner case differences are not, and any mismatch becomes NA.
The third is treating the warning as fatal. A parsing failure does not stop your script. It returns NA for the bad value and keeps going. If you need failures to halt the pipeline, inspect problems() and act on it yourself.
NA values with only a console warning to show for it. In a non-interactive script that warning is easy to miss, so always check problems() when the level set matters.Try it yourself
Try it: Use parse_factor() to convert c("hard", "easy", "easy", "medium") into an ordered factor with levels easy < medium < hard. Save the result to ex_difficulty.
Click to reveal solution
Explanation: Passing levels in the order easy, medium, hard fixes the ranking, and ordered = TRUE records it so comparison operators work on the result.
Related readr functions
The parse_factor() function belongs to readr's family of vector parsers, each tuned to a different kind of text:
- parse_number(): extracts a number from messy text such as prices and percentages.
- parse_date(): turns date text into proper
Datevalues. - col_types: the argument that assigns a parser, including
col_factor(), to each column on import. - read_csv(): the reader where
col_factor()appliesparse_factor()while the file loads.
The official readr parse_factor reference documents every argument and the locale system in full.
FAQ
What does parse_factor() do in R?
parse_factor() is a readr function that converts a character vector into a factor. You can supply a set of valid levels, and the function checks every value against them. Values that match become factor entries; values that do not become NA and are recorded as parsing problems. For example, parse_factor(c("a", "b"), levels = c("a", "b")) returns a clean two-level factor. It is the standard readr tool for turning categorical text into a validated factor.
What is the difference between parse_factor() and factor()?
Both create a factor, but only parse_factor() validates the input. When a value is not in the supplied levels, factor() converts it to NA silently, while parse_factor() returns NA and a warning that names the failing row. So a typo in your data passes unnoticed through factor() but is flagged by parse_factor(). Use factor() for trusted data and parse_factor() when stray values must be caught.
Why does parse_factor() return NA?
parse_factor() returns NA when a value does not appear in the levels you supplied, or when the value is listed in the na argument. Matching is exact, so a case difference such as "Yes" against a level of "yes" also produces NA. When this happens readr records a parsing problem; call problems() on the result to see which values failed and why.
How do I create an ordered factor with parse_factor()?
Pass ordered = TRUE along with the levels argument in rank order. For example, parse_factor(x, levels = c("low", "medium", "high"), ordered = TRUE) creates an ordered factor where low < medium < high. The order you write the levels in becomes the ranking, so comparison operators like > and < return meaningful results on the parsed values.
Can parse_factor() be used inside read_csv()?
Yes, indirectly. Inside a read_csv() call, set the column type with col_factor(levels = ...) in the col_types argument. col_factor() is the column specification that runs parse_factor() on that column as the file loads, so the column arrives already validated against your level set without a separate parsing step.