Data Privacy in R: Anonymization, Differential Privacy & GDPR Compliance
Data privacy is not optional — it's a legal requirement in most jurisdictions and an ethical imperative everywhere. This guide teaches you practical techniques for protecting sensitive data in R, from simple anonymization to formal differential privacy guarantees.
You have a dataset with personal information. Before you can analyze it, share it, or publish results from it, you need to protect the individuals in it. Simply removing names and IDs is not enough — research shows that 87% of Americans can be uniquely identified by just zip code, birth date, and gender. This guide gives you the tools and techniques to do privacy right.
The Privacy Landscape
Key Regulations
Regulation
Region
Key Requirements
GDPR
EU/EEA
Lawful basis, data minimization, right to erasure, DPIAs
CCPA/CPRA
California
Right to know, delete, opt out of sale
HIPAA
US (healthcare)
De-identification standard, safe harbor
FERPA
US (education)
Consent for education records disclosure
PIPEDA
Canada
Consent, limiting collection, accuracy
Types of Identifiers
Category
Examples
Risk Level
Direct identifiers
Name, SSN, email, phone
Very high — remove always
Quasi-identifiers
Age, zip code, gender, job title
High — can re-identify in combination
Sensitive attributes
Diagnosis, salary, religion
High — must protect
Non-sensitive attributes
Purchase count, click count
Low — usually safe
Anonymization Techniques in R
1. Suppression and Generalization
The simplest techniques: remove identifying fields or make them less specific.
# Original data with PII
original <- data.frame(
name = c("Alice Smith", "Bob Jones", "Carol Lee", "Dave Kim", "Eve Brown"),
age = c(28, 34, 45, 52, 31),
zipcode = c("02139", "02138", "02139", "02140", "02138"),
salary = c(65000, 82000, 71000, 95000, 58000),
diagnosis = c("Flu", "Diabetes", "Flu", "Cancer", "Diabetes")
)
cat("=== Original Data ===\n")
print(original)
# Anonymized: suppress direct IDs, generalize quasi-IDs
anonymized <- data.frame(
id = paste0("P", 1:5), # Pseudonymous ID
age_group = cut(original$age, breaks = c(0,30,40,50,60), # Generalize age
labels = c("20-30","31-40","41-50","51-60")),
zipcode_3 = substr(original$zipcode, 1, 3), # Generalize zip
salary_band = cut(original$salary, breaks = c(0,60000,80000,100000),
labels = c("<60K","60-80K","80-100K")),
diagnosis = original$diagnosis # Keep for analysis
)
cat("\n=== Anonymized Data ===\n")
print(anonymized)
2. K-Anonymity
A dataset is k-anonymous if every combination of quasi-identifiers appears at least k times. This means no individual can be distinguished from at least k-1 others.
# Check k-anonymity
check_k_anonymity <- function(data, quasi_ids) {
groups <- do.call(paste, data[quasi_ids])
group_sizes <- table(groups)
k <- min(group_sizes)
cat("Quasi-identifiers:", paste(quasi_ids, collapse = ", "), "\n")
cat("Group size distribution:\n")
print(table(group_sizes))
cat("k-anonymity level:", k, "\n")
return(k)
}
# Test with the anonymized data
cat("=== K-Anonymity Check ===\n")
k <- check_k_anonymity(anonymized, c("age_group", "zipcode_3"))
cat("\nIf k = 1, someone could be uniquely identified.\n")
cat("Goal: k >= 5 for most applications.\n")
3. L-Diversity
K-anonymity isn't enough if everyone in a group has the same sensitive attribute. L-diversity requires that each group has at least L distinct values of the sensitive attribute.
# Check l-diversity
check_l_diversity <- function(data, quasi_ids, sensitive) {
groups <- do.call(paste, data[quasi_ids])
unique_groups <- unique(groups)
min_l <- Inf
for (g in unique_groups) {
mask <- groups == g
l <- length(unique(data[mask, sensitive]))
min_l <- min(min_l, l)
}
cat("Sensitive attribute:", sensitive, "\n")
cat("l-diversity level:", min_l, "\n")
return(min_l)
}
cat("=== L-Diversity Check ===\n")
l <- check_l_diversity(anonymized, c("age_group", "zipcode_3"), "diagnosis")
cat("\nIf l = 1, all people in a group share the same diagnosis\n")
cat("(attacker knows the diagnosis even without identifying the person).\n")
4. Data Perturbation
Add controlled noise to numeric values to prevent exact identification while preserving statistical properties.
Differential privacy provides a mathematical guarantee: the output of a query changes very little whether or not any single individual's data is included.
# Differential privacy: noisy counting
dp_count <- function(true_count, epsilon) {
# Add Laplace noise with scale = 1/epsilon
noise <- laplace_noise(1, 1/epsilon)
max(0, round(true_count + noise)) # Clamp to non-negative
}
dp_mean <- function(values, epsilon, lower, upper) {
# Clip values to known range
clipped <- pmin(pmax(values, lower), upper)
sensitivity <- (upper - lower) / length(values)
noise <- laplace_noise(1, sensitivity / epsilon)
mean(clipped) + noise
}
set.seed(42)
true_data <- c(52000, 61000, 48000, 73000, 55000, 67000, 59000, 82000, 45000, 64000)
cat("=== Differential Privacy Demo ===\n")
cat("True count:", length(true_data), "\n")
cat("True mean salary:", mean(true_data), "\n\n")
for (eps in c(0.1, 0.5, 1.0, 2.0)) {
set.seed(42)
dp_m <- dp_mean(true_data, eps, 30000, 100000)
cat(sprintf("epsilon = %.1f: DP mean = %.0f (error: %.0f)\n",
eps, dp_m, abs(dp_m - mean(true_data))))
}
cat("\nSmaller epsilon = more privacy but more noise.\n")
cat("epsilon = 1.0 is a common practical choice.\n")
GDPR Compliance Checklist for R Users
#
Requirement
Implementation in R
1
Lawful basis for processing
Document in data provenance metadata
2
Data minimization
Remove unneeded columns before analysis
3
Purpose limitation
Separate datasets per project
4
Accuracy
Validate data, document cleaning steps
5
Storage limitation
Delete raw PII after anonymization
6
Integrity & confidentiality
Encrypt files, restrict access
7
Right to erasure
Ability to remove individual records
8
Data Protection Impact Assessment
Document risks before processing
9
Record of processing activities
Log all data operations
10
Pseudonymization where feasible
Replace IDs, use digest::digest() for hashing
# Pseudonymization with hashing
pseudonymize <- function(id, salt = "my_secret_salt_2026") {
# Create irreversible hash of the ID
vapply(paste0(salt, id), function(x) {
# Simple hash for demonstration (use digest::digest in practice)
chars <- utf8ToInt(x)
hash_val <- sum(chars * seq_along(chars)) %% 1000000
sprintf("P%06d", hash_val)
}, character(1), USE.NAMES = FALSE)
}
ids <- c("alice@example.com", "bob@example.com", "carol@example.com")
pseudo_ids <- pseudonymize(ids)
cat("=== Pseudonymization ===\n")
for (i in seq_along(ids)) {
cat(sprintf(" %s -> %s\n", ids[i], pseudo_ids[i]))
}
cat("\nIn practice, use: digest::digest(id, algo='sha256')\n")
cat("Keep the salt secret and separate from the data.\n")
Exercises
Exercise 1: Anonymization
Given a dataset of patients with name, age, zipcode, gender, and disease, apply suppression and generalization to achieve 2-anonymity.
You have a privacy budget of epsilon = 1.0. You want to release the mean and standard deviation of a salary column. How should you split the budget?
cat("=== Answer ===\n")
cat("Split epsilon between queries: each gets epsilon/2 = 0.5\n")
cat("This is the 'composition theorem' of differential privacy.\n\n")
cat("More queries = more noise per query (for the same total privacy).\n")
cat("If you need 4 statistics, each gets epsilon/4 = 0.25.\n")
cat("Plan your queries carefully — every query costs privacy budget.\n")
Summary
Technique
Protection Level
Utility Loss
Complexity
Suppression
Basic
Low (lose columns)
Very easy
Generalization
Medium
Medium (lose precision)
Easy
K-anonymity
Medium
Medium
Moderate
L-diversity
Medium-high
Medium-high
Moderate
Perturbation
Medium
Low-medium
Moderate
Differential privacy
High (mathematical guarantee)
Higher
Complex
FAQ
Is removing names and IDs enough for anonymization? No. Research by Latanya Sweeney showed that 87% of Americans can be uniquely identified by {zip code, birth date, gender}. You need to address quasi-identifiers too — through generalization, suppression, or formal privacy techniques.
What epsilon value should I use for differential privacy? There's no universal answer. Values between 0.1 and 1.0 are considered strong privacy. Values between 1.0 and 10.0 are moderate. Apple uses epsilon around 2-8 for telemetry data. The US Census used epsilon = 19.61 for the 2020 Census. Consider your threat model and utility needs.
Can anonymized data be re-identified? Potentially, yes. Netflix "anonymized" viewing data was re-identified by cross-referencing with IMDb reviews. AOL search logs were re-identified through unique query patterns. True anonymization is hard — differential privacy provides the strongest guarantees.