Synthetic Data Generation in R: Protect Privacy While Testing Models
Synthetic data looks and behaves like real data but contains no actual individuals. It lets you share datasets, test models, and develop code without exposing sensitive information — all while preserving the statistical relationships your analysis depends on.
You need to share a dataset with a collaborator, but it contains patient records. You want to test a machine learning pipeline, but the real data is locked behind an NDA. You're teaching a class and need realistic examples without privacy risks. Synthetic data solves all of these problems.
The synthpop package uses classification and regression trees to model the conditional distributions in your data. Here's the concept implemented manually.
# Simplified CART-based synthesis concept
set.seed(42)
# Original data
original <- data.frame(
age = sample(20:70, 200, replace = TRUE),
income = NA,
education = sample(c("HS","BA","MA","PhD"), 200, replace = TRUE,
prob = c(0.3, 0.4, 0.2, 0.1))
)
# Income depends on age and education (realistic relationships)
edu_bonus <- c(HS = 0, BA = 15000, MA = 25000, PhD = 35000)
original$income <- 20000 + 500 * original$age +
edu_bonus[original$education] +
rnorm(200, 0, 10000)
cat("=== Original Data Summary ===\n")
cat(sprintf("N = %d\n", nrow(original)))
cat(sprintf("Mean income: $%.0f\n", mean(original$income)))
cat(sprintf("Income by education:\n"))
print(round(tapply(original$income, original$education, mean)))
# Synthesize: model each variable conditionally
# Variable 1: age (sample from empirical distribution)
synth_age <- sample(original$age, 200, replace = TRUE)
# Variable 2: education (conditional on age - use original proportions per age group)
age_group <- cut(synth_age, c(0,35,50,80), labels = c("young","mid","senior"))
synth_edu <- character(200)
for (ag in levels(age_group)) {
mask_synth <- age_group == ag
orig_mask <- cut(original$age, c(0,35,50,80), labels = c("young","mid","senior")) == ag
probs <- prop.table(table(original$education[orig_mask]))
synth_edu[mask_synth] <- sample(names(probs), sum(mask_synth), replace = TRUE, prob = probs)
}
# Variable 3: income (conditional on age and education)
model <- lm(income ~ age + education, data = original)
synth_income <- predict(model, data.frame(age = synth_age, education = synth_edu)) +
rnorm(200, 0, summary(model)$sigma)
synthetic <- data.frame(age = synth_age, income = synth_income, education = synth_edu)
cat("\n=== Synthetic Data Summary ===\n")
cat(sprintf("N = %d\n", nrow(synthetic)))
cat(sprintf("Mean income: $%.0f\n", mean(synthetic$income)))
cat(sprintf("Income by education:\n"))
print(round(tapply(synthetic$income, synthetic$education, mean)))
Method 3: Bootstrap-Based Synthesis
Simple and effective for many use cases: sample with replacement and add noise.
Measuring Utility: Is the Synthetic Data Good Enough?
Synthetic data is useful only if it preserves the statistical properties you need. Here's how to measure that.
# Utility comparison
set.seed(42)
# Fit the same model on real and synthetic data
real_model <- lm(mpg ~ hp + wt, data = mtcars)
synth_model <- lm(mpg ~ hp + wt, data = synth)
cat("=== Utility Comparison ===\n")
cat("\nReal data model coefficients:\n")
print(round(coef(real_model), 4))
cat("\nSynthetic data model coefficients:\n")
print(round(coef(synth_model), 4))
# Compare distributions
cat("\n--- Distribution Comparison ---\n")
for (var in c("mpg", "hp", "wt")) {
ks <- ks.test(real[[var]], synth[[var]])
cat(sprintf("%s: KS statistic = %.3f, p = %.3f %s\n",
var, ks$statistic, ks$p.value,
ifelse(ks$p.value > 0.05, "(similar)", "(different!)")))
}
Privacy Check: Is the Synthetic Data Safe?
# Check for identity disclosure risk
# If a synthetic record exactly matches a real record, that's a risk
check_disclosure <- function(real, synth, tolerance = 0.01) {
matches <- 0
for (i in 1:nrow(synth)) {
for (j in 1:nrow(real)) {
diffs <- abs(as.numeric(synth[i,]) - as.numeric(real[j,]))
ranges <- apply(real, 2, function(x) diff(range(x)))
rel_diffs <- diffs / ranges
if (all(rel_diffs < tolerance)) {
matches <- matches + 1
break
}
}
}
matches
}
cat("=== Privacy Risk Assessment ===\n")
# Use numeric columns only
real_num <- mtcars[, c("mpg","hp","wt")]
synth_num <- synth[, c("mpg","hp","wt")]
n_matches <- check_disclosure(real_num, synth_num, tolerance = 0.05)
cat(sprintf("Records within 5%% of a real record: %d of %d (%.1f%%)\n",
n_matches, nrow(synth_num), 100 * n_matches/nrow(synth_num)))
cat("\nGuidelines:\n")
cat(" < 5% close matches: Low risk\n")
cat(" 5-20% close matches: Moderate risk, add more noise\n")
cat(" > 20% close matches: High risk, regenerate with more perturbation\n")
Can synthetic data completely replace real data for model training? For testing and development, yes. For final model training, usually no — synthetic data may miss rare patterns, outliers, or subtle interactions. Use synthetic data for development and privacy-safe sharing, but validate final models on real data.
How much noise should I add? Enough to prevent re-identification but not so much that utility is destroyed. Start with 5-10% of the variable's standard deviation and check both utility metrics (KS test, coefficient comparison) and privacy metrics (close-match percentage). Adjust iteratively.
Is synthpop the only R package for synthetic data? No. Other options include simPop (for complex survey data), simstudy (for clinical trial simulation), fabricatr (for social science), and MICE (imputation-based synthesis). Choose based on your data type and use case.
What's Next
Data Ethics in R — The broader ethical framework for data analysis
Data Privacy in R — Anonymization and differential privacy techniques