R Type Coercion: Why Your Numeric Columns Silently Turn Into Characters
When you mix types in R, the most flexible type wins. One character value in a numeric vector turns everything into text. Understanding this coercion hierarchy prevents the most common data import bugs in R.
You read a CSV file, run mean(df$price), and R gives you an error or NA. You check class(df$price) and discover it's character — even though every value looks numeric. A single "N/A" or "$" or comma in one cell turned the entire column into text. This tutorial explains why, and how to fix it.
The Coercion Hierarchy
R's type system has a strict hierarchy. When you combine values of different types, everything gets converted to the most flexible type:
logical → integer → numeric → complex → character
Each type to the right can represent everything to its left, but not vice versa. A character can represent "TRUE" or "42", but a number can't represent "hello."
# Numbers to logical in conditions
# 0 is FALSE, anything else is TRUE
if (1) cat("1 is TRUE\n")
if (0) cat("0 is TRUE\n") else cat("0 is FALSE\n")
if (-5) cat("-5 is TRUE (any nonzero is TRUE)\n")
Explicit Coercion (Manual Conversion)
When you need to convert types intentionally, use the as.*() functions:
# as.integer truncates, doesn't round!
cat("as.integer(3.9):", as.integer(3.9), "\n") # 3, not 4!
cat("as.integer(3.1):", as.integer(3.1), "\n") # 3
cat("Use round() first if you want rounding:", round(3.9), "\n")
# as.numeric on non-numeric text → NA with warning
result <- suppressWarnings(as.numeric(c("10", "twenty", "30")))
cat("Mixed conversion:", result, "\n") # 10, NA, 30
# Factor to numeric — the classic trap!
f <- factor(c("10", "20", "30"))
cat("Factor:", f, "\n")
cat("Wrong (gives level codes):", as.numeric(f), "\n") # 1, 2, 3!
cat("Right:", as.numeric(as.character(f)), "\n") # 10, 20, 30
The factor trap:as.numeric(factor_var) gives you the internal level codes (1, 2, 3...), not the values you see! Always convert to character first: as.numeric(as.character(factor_var)).
The CSV Import Problem
This is the most common real-world coercion issue. When R reads a CSV, it guesses column types based on the data. One non-numeric value forces the entire column to character:
# Simulate a CSV with problems
csv_data <- data.frame(
product = c("Widget", "Gadget", "Doohickey", "Thingamajig", "Whatsit"),
price = c("19.99", "N/A", "12.50", "8.75", "$25.00"), # "N/A" and "$" break it
quantity = c("100", "50", "75", "1,200", "80"), # Comma in "1,200" breaks it
stringsAsFactors = FALSE
)
cat("Column types:\n")
str(csv_data)
cat("\nPrice is character because of 'N/A' and '$25.00'\n")
cat("Quantity is character because of '1,200'\n")
# readr::read_csv() handles many issues automatically
# It uses the first 1000 rows to guess types, handles NA strings,
# and never converts strings to factors
# Base R read.csv with explicit NA handling:
# df <- read.csv("file.csv", na.strings = c("N/A", "NA", "", "null", "#N/A"))
# You can also specify column types explicitly:
# library(readr)
# df <- read_csv("file.csv", col_types = cols(
# price = col_double(),
# quantity = col_integer(),
# name = col_character()
# ))
cat("Best practices for CSV import:\n")
cat("1. Use readr::read_csv() instead of base read.csv()\n")
cat("2. Specify na.strings for common NA representations\n")
cat("3. Check str() immediately after import\n")
cat("4. Fix types before analysis, not during\n")
Type Checking: Diagnosing Problems
When something goes wrong, these functions help you figure out what happened:
# A data frame with type problems
df <- data.frame(
x = c("1", "2", "3"), # Looks numeric but it's character
y = factor(c("A", "B", "A")), # Factor, not character
z = c(TRUE, FALSE, TRUE), # Logical
w = c(1L, 2L, 3L) # Integer
)
# str() shows everything at once
str(df)
# Check specific columns
cat("\nClass of each column:\n")
print(sapply(df, class))
# is.* functions for specific checks
cat("\nis.numeric(df$x):", is.numeric(df$x), "\n")
cat("is.character(df$x):", is.character(df$x), "\n")
cat("is.factor(df$y):", is.factor(df$y), "\n")
Quick diagnostic function
# A handy function to diagnose type problems in a data frame
type_check <- function(df) {
data.frame(
column = names(df),
type = sapply(df, class),
example = sapply(df, function(x) paste(head(x, 3), collapse = ", ")),
n_na = sapply(df, function(x) sum(is.na(x))),
row.names = NULL
)
}
# Test it
test_df <- data.frame(
name = c("Alice", "Bob", NA),
score = c("88", "N/A", "75"),
active = c(TRUE, FALSE, TRUE),
stringsAsFactors = FALSE
)
print(type_check(test_df))
# Comparing numbers to strings — R converts numbers to strings first!
cat("'2' > '10':", "2" > "10", "\n") # TRUE! String comparison: "2" > "1"
cat("2 > 10:", 2 > 10, "\n") # FALSE (numeric comparison)
# This trips people up with sorted data:
x <- c("1", "2", "10", "20", "3")
cat("String sort:", sort(x), "\n") # "1", "10", "2", "20", "3"
cat("Numeric sort:", sort(as.numeric(x)), "\n") # 1, 2, 3, 10, 20
Warning: String comparison is alphabetical, not numerical. "2" > "10" is TRUE because "2" comes after "1" alphabetically. Always convert to numeric before numerical comparison.
Coercion in data frame operations
# Adding a new column with mixed types
df <- data.frame(a = 1:3, b = 4:6)
# This works — numeric stays numeric
df$c <- c(7.5, 8.5, 9.5)
cat("Column c type:", class(df$c), "\n")
# This coerces the whole column to character
df$d <- c(1, "two", 3)
cat("Column d type:", class(df$d), "\n")
cat("Column d values:", df$d, "\n")
str(df)
Practice Exercises
Exercise 1: Predict the Type
# Exercise: Predict the type of each result, then verify
# Write your prediction as a comment, then uncomment the cat() line
a <- c(1, 2, 3)
b <- c(1L, 2L, 3L)
d <- c(TRUE, 1, "hello")
e <- c(1L, 3.14)
f <- c(FALSE, 0L)
# cat("a:", class(a), "\n") # Prediction: ?
# cat("b:", class(b), "\n") # Prediction: ?
# cat("d:", class(d), "\n") # Prediction: ?
# cat("e:", class(e), "\n") # Prediction: ?
# cat("f:", class(f), "\n") # Prediction: ?
Click to reveal solution
# Solution
a <- c(1, 2, 3) # numeric (default for bare numbers)
b <- c(1L, 2L, 3L) # integer (L suffix)
d <- c(TRUE, 1, "hello") # character (string wins over everything)
e <- c(1L, 3.14) # numeric (double beats integer)
f <- c(FALSE, 0L) # integer (integer beats logical)
cat("a:", class(a), "\n") # numeric
cat("b:", class(b), "\n") # integer
cat("d:", class(d), "\n") # character
cat("e:", class(e), "\n") # numeric
cat("f:", class(f), "\n") # integer
Explanation: Remember the hierarchy: logical → integer → numeric → character. The most flexible type always wins. d has all three types and character wins. f has logical and integer, so integer wins.
Exercise 2: Fix the Broken Data
# Exercise: This data has type problems. Fix all columns to their
# correct types and calculate the total revenue.
sales <- data.frame(
product = c("Widget", "Gadget", "Doohickey"),
price = c("$12.50", "$8.99", "$15.00"),
qty = c("100", "2,500", "50"),
taxable = c("yes", "no", "yes"),
stringsAsFactors = FALSE
)
# Target types: product=character, price=numeric, qty=integer, taxable=logical
# Then calculate: revenue = price * qty
# Write your code below:
Explanation:gsub() removes unwanted characters before conversion. == "yes" converts the string to a logical comparison result (TRUE/FALSE). This is a pattern you'll use every time you clean imported data.
Exercise 3: The Factor Trap
# Exercise: A survey recorded satisfaction on a 1-5 scale.
# Due to CSV import, the scores became factors.
# Convert them to proper integers and compute the average.
satisfaction <- factor(c("4", "5", "3", "5", "2", "4", "5", "3", "4", "1"))
# WARNING: as.numeric(satisfaction) gives WRONG results!
# Show the wrong result, then the correct one.
# Write your code below:
Click to reveal solution
# Solution
satisfaction <- factor(c("4", "5", "3", "5", "2", "4", "5", "3", "4", "1"))
# The WRONG way (gives internal level codes, not actual values)
wrong <- as.numeric(satisfaction)
cat("WRONG (level codes):", wrong, "\n")
cat("WRONG mean:", mean(wrong), "\n\n")
# The RIGHT way (convert to character first, then numeric)
right <- as.numeric(as.character(satisfaction))
cat("RIGHT (actual values):", right, "\n")
cat("RIGHT mean:", mean(right), "\n\n")
# Alternative: use levels() directly
also_right <- as.numeric(levels(satisfaction))[satisfaction]
cat("Also right:", also_right, "\n")
cat("Mean:", mean(also_right), "\n")
Explanation: Factors store integers internally (level codes: 1, 2, 3...) and display text labels. as.numeric() gives you the internal codes. You must go through as.character() first to get the actual text, then convert to numeric. This is one of R's most infamous gotchas.
Summary
Conversion
Function
Gotcha
To numeric
as.numeric(x)
Fails silently on text → returns NA
To integer
as.integer(x)
Truncates, doesn't round
To character
as.character(x)
Always works
To logical
as.logical(x)
Only "TRUE"/"FALSE" strings work
Factor to numeric
as.numeric(as.character(x))
Direct as.numeric() gives level codes!
Check type
class(x), is.numeric(x)
is.numeric() is TRUE for both double and integer
The coercion hierarchy: logical → integer → numeric → complex → character
The golden rules:
Character always wins in mixed vectors
Check str() immediately after importing data
Clean strings (gsub) before converting types
Never use as.numeric() directly on factors
FAQ
Why does R coerce silently instead of throwing an error?
R was designed for interactive data analysis where flexibility matters more than strictness. Automatic coercion means TRUE + 1 just works (giving 2) rather than requiring explicit conversion. The trade-off is that it can hide bugs — which is why str() after import is essential.
How do I prevent coercion when creating vectors?
You can't — c() always coerces to a single type. If you need mixed types, use a list() instead of c().
Why does read.csv turn strings into factors?
Historical reasons. In older R (pre-4.0), read.csv() had stringsAsFactors = TRUE by default. Since R 4.0, the default changed to FALSE. If you're using R 4.0+, this shouldn't be an issue. If it is, add stringsAsFactors = FALSE or switch to readr::read_csv().
How do I convert an entire data frame's column types at once?
Not directly for implicit coercion. But str() shows the current types, and comparing class() before and after operations reveals changes. The readr package shows column type guesses during import.
What's Next?
Understanding coercion prevents the most frustrating R bugs. Related topics:
R Attributes — metadata that coercion can preserve or destroy
R Factors — the data type built on top of coercion
Data Wrangling with dplyr — type-safe data transformation