Bias in Data and Models: Find It With R Before Your Results Mislead Anyone

Bias is a systematic error that pushes your numbers, and the decisions made from them, in a wrong direction. This guide shows how to detect bias across three layers: the sample you collected, the way it was measured, and the model you trained on it. All examples run on base R plus dplyr and ggplot2, so you can audit any analysis without installing specialised fairness packages.

What does bias actually mean in data analysis?

Most bias bugs hide in plain sight. A model that looks accurate overall while quietly failing one group. A survey whose results don't match the population it claims to describe. A metric that means slightly different things for different people. Before defining the three flavours of bias, look at what biased data actually does to a result you might be tempted to trust.

The block below builds a tiny synthetic loan dataset where two groups, call them A and B, have the same true creditworthiness on average. Watch what happens to the observed approval rates anyway.

RSimulate biased approval rates
library(dplyr) library(ggplot2) set.seed(2026) n <- 1000 group_vec <- sample(c("A", "B"), n, replace = TRUE) loans <- tibble( group = group_vec, score = rnorm(n, mean = 650, sd = 50), # same true creditworthiness approved = rbinom(n, 1, prob = ifelse(group_vec == "A", 0.70, 0.45)) ) loans |> group_by(group) |> summarise( n = n(), mean_score = round(mean(score), 1), approval_pct = round(mean(approved) * 100, 1) ) #> # A tibble: 2 × 4 #> group n mean_score approval_pct #> <chr> <int> <dbl> <dbl> #> 1 A 498 650.6 69.7 #> 2 B 502 650.4 44.6

  

The two groups have nearly identical mean scores, 650.6 versus 650.4, yet group A is approved 70% of the time and group B only 45%. A 25-point gap from a process that, on the underlying number, should be the same. That is what bias looks like in a single table: a measurable distance between what you reported and what should be true.

Three different culprits can produce a gap like this. We'll meet each in turn:

  • Sampling bias, the wrong people ended up in your dataset
  • Measurement bias, the right people ended up in your dataset, but their values were recorded differently
  • Algorithmic bias, the people and the values are fine, but a model learned to treat them differently
Key Insight
Bias is not a moral verdict, it is a measurable distance. Once you can put a number on the gap between what your data says and what is actually true, you can debug it the same way you debug any other software bug.

Try it: Re-run the loan generator with set.seed(99) and confirm the approval gap stays roughly the same size. The point is to show that the gap is a property of the process, not a quirk of one random draw.

RExercise: compute approval-rate gap
# Try it: change the seed and re-measure the gap set.seed(99) g <- sample(c("A","B"), 1000, replace = TRUE) ex_loans <- tibble( group = g, score = rnorm(1000, 650, 50), approved = rbinom(1000, 1, prob = ifelse(g == "A", 0.70, 0.45)) ) # Compute the approval-rate gap (group A minus group B): # your code here #> Expected: a value near 0.25

  
Click to reveal solution
RApproval-rate gap solution
gap <- ex_loans |> group_by(group) |> summarise(rate = mean(approved)) |> summarise(diff = rate[group == "A"] - rate[group == "B"]) |> pull(diff) round(gap, 2) #> [1] 0.25

  

Explanation: The gap is generated by the data-creation process, not by the random seed. Different seeds give slightly different numbers, but the systematic 25-point gap is built into how the synthetic outcomes were drawn.

How do you detect sampling bias in R?

Sampling bias means some members of the target population were more likely to end up in your dataset than others. A satisfaction survey that only reaches customers who finished checkout misses everyone who abandoned the cart. A clinical trial run only at urban hospitals misses rural patients. The numbers from a biased sample look perfectly normal; they just describe the wrong population.

Bias audit workflow

Figure 1: A bias audit moves left-to-right through three checks before the model ever ships.

The simplest detection trick is to compare your sample's group proportions to a known reference, the true population, a census, or a prior wave of the same survey. If the gap is bigger than chance, the chi-squared test will flag it.

RChi-squared test for sampling bias
# Known true population: 50% group A, 50% group B set.seed(7) population_share <- c(A = 0.50, B = 0.50) # A sampling process that systematically over-collects group A sample_draw <- sample( c("A", "B"), size = 400, replace = TRUE, prob = c(0.70, 0.30) # the bias is right here ) observed <- table(sample_draw) observed #> sample_draw #> A B #> 285 115 chisq.test(x = observed, p = population_share) #> #> Chi-squared test for given probabilities #> #> data: observed #> X-squared = 72.25, df = 1, p-value < 2.2e-16

  

A p-value below 0.05 means the gap between your sample and the population is too big to blame on luck. Here it is essentially zero, group A is wildly over-represented. In a real audit, you would now ask why: maybe the recruitment ad ran on a platform group A uses more, or the consent form was only translated into one language. The chi-squared test does not tell you the cause; it tells you to stop and look.

Statistical significance is one lens. The next is the 80% rule (also called the four-fifths rule), borrowed from US employment law: any group whose representation ratio falls below 0.8 is considered substantially under-represented.

RRepresentation ratio with 80 percent flag
rep_table <- tibble( group = names(observed), sample_share = as.numeric(observed) / sum(observed), population_share = as.numeric(population_share) ) |> mutate( rep_ratio = round(sample_share / population_share, 2), flag_80 = rep_ratio < 0.8 ) rep_table #> # A tibble: 2 × 5 #> group sample_share population_share rep_ratio flag_80 #> <chr> <dbl> <dbl> <dbl> <lgl> #> 1 A 0.712 0.5 1.42 FALSE #> 2 B 0.288 0.5 0.58 TRUE

  

Group B's representation ratio is 0.58, well below the 0.8 threshold, it is substantially under-sampled. This per-group view is more actionable than a single p-value because it points directly at the group you need to recruit more of.

Tip
When you don't know the true population, use a prior wave or an authoritative reference. Census tables, voter rolls, prior survey waves, and administrative records all give you something to compare against. Without a reference, "sampling bias" is not measurable, it is just a worry.

Try it: Write a small function that takes a vector of group labels and a named vector of expected population shares, then returns the chi-squared p-value. This is the audit primitive you'll reach for whenever a fresh dataset arrives.

RExercise: wrap chi-squared into a function
# Try it: write ex_chi_test() ex_chi_test <- function(group_vec, expected_share) { # your code here } # Test: ex_chi_test(sample_draw, c(A = 0.5, B = 0.5)) #> Expected: a p-value far below 0.05

  
Click to reveal solution
RChi-squared function solution
ex_chi_test <- function(group_vec, expected_share) { obs <- table(group_vec)[names(expected_share)] chisq.test(x = obs, p = expected_share)$p.value } ex_chi_test(sample_draw, c(A = 0.5, B = 0.5)) #> [1] 1.91e-17

  

Explanation: The function reorders the observed counts to match the expected-share vector, runs chisq.test(), and returns just the p-value so you can use it inside a pipeline.

How do you spot measurement bias in your data?

Measurement bias is sneakier than sampling bias because the right people are in the dataset, but the instrument records different values for the same underlying truth across groups. A self-reported height is on average a couple of centimetres taller than a calibrated one. A self-reported income skews lower for high earners and higher for low earners. A pulse oximeter calibrated on light skin reads systematically wrong on dark skin. The summary statistics look healthy; the meaning of each number is not.

The block below simulates a realistic case: group A is measured by a calibrated tape (no error), group B by self-report (a 3 cm upward bias plus more noise). The true heights are identical between the two groups by construction.

RInject measurement bias into heights
set.seed(11) n <- 600 heights <- tibble( group = rep(c("A", "B"), each = n / 2), true_height = rnorm(n, mean = 170, sd = 8) ) |> mutate( measured = ifelse( group == "A", true_height + rnorm(n, 0, 0.3), # calibrated tape true_height + 3 + rnorm(n, 0, 1.5) # self-report: +3cm bias ) ) heights |> group_by(group) |> summarise( mean_true = round(mean(true_height), 2), mean_measured = round(mean(measured), 2), bias = round(mean(measured - true_height), 2) ) #> # A tibble: 2 × 4 #> group mean_true mean_measured bias #> <chr> <dbl> <dbl> <dbl> #> 1 A 169.92 169.92 0.00 #> 2 B 170.07 173.05 2.98

  

Group B's measured mean is about 3 cm higher than its true mean, exactly the bias we built in. Crucially, if you only had the measured column (which is the realistic case in a real dataset), you would conclude that group B is taller than group A and never know that the conclusion is an instrument artefact.

The fix needs a gold-standard subsample: a small set of cases where both the cheap and the expensive measurements exist. From that subsample you estimate the bias, then subtract it from the rest of group B.

RCalibrate with gold-standard subsample
# Gold-standard subsample: 30 group-B cases with both measurements gold <- heights |> filter(group == "B") |> slice_head(n = 30) |> mutate(error = measured - true_height) correction <- mean(gold$error) round(correction, 2) #> [1] 2.94 heights_corrected <- heights |> mutate(measured_fixed = ifelse(group == "B", measured - correction, measured)) heights_corrected |> group_by(group) |> summarise(mean_fixed = round(mean(measured_fixed), 2)) #> # A tibble: 2 × 2 #> group mean_fixed #> <chr> <dbl> #> 1 A 169.92 #> 2 B 170.11

  

After applying the correction, both groups land near the true mean of 170 cm. The technique generalises: any time you suspect the instrument differs across groups, set aside a calibration subset, estimate the bias, and apply it.

Warning
Measurement bias is invisible to summary statistics. Group means look fine even when the meaning of the number differs by group. The only defence is a calibration step where you compare your instrument against ground truth in a known subsample.

Try it: Given the heights dataframe with true_height and measured columns, compute the per-group mean error and identify which group is biased.

RExercise: per-group mean error
# Try it: compute per-group mean error ex_err <- heights |> # your code here # Expected output: a tibble where group B has a mean error near 3

  
Click to reveal solution
RPer-group mean error solution
ex_err <- heights |> group_by(group) |> summarise(mean_error = round(mean(measured - true_height), 2)) ex_err #> # A tibble: 2 × 2 #> group mean_error #> <chr> <dbl> #> 1 A 0.00 #> 2 B 2.98

  

Explanation: Subtracting true_height from measured gives the per-row error. Averaging by group reveals the systematic bias, group A is unbiased, group B carries a +3 cm shift.

How do you measure algorithmic bias in a model's predictions?

Algorithmic bias is the third layer. Even if your sample is representative and your measurements are clean, a model trained on that data can still produce systematically different errors for different groups. This happens when the historical outcomes the model learns from already encode unequal treatment, or when a feature acts as a proxy for the protected attribute.

Fairness metrics from a confusion matrix

Figure 2: Three fairness metrics derived from one per-group confusion matrix.

To put numbers on it, fit a logistic regression to the biased loan data from earlier and look at the predictions group by group.

RScore-only logistic model
model <- glm(approved ~ score, data = loans, family = binomial) loans2 <- loans |> mutate( prob = predict(model, type = "response"), predicted = as.integer(prob > 0.5) ) loans2 |> group_by(group) |> summarise(positive_rate = round(mean(predicted), 3)) #> # A tibble: 2 × 2 #> group positive_rate #> <chr> <dbl> #> 1 A 0.582 #> 2 B 0.582

  

The model is score-only, so its raw positive rate is the same in both groups, a useful sanity check. The bias only shows up when you compare predictions against the actual approval outcomes, which is what a confusion matrix does.

RGroup-wise confusion matrices
cm <- loans2 |> group_by(group) |> summarise( TP = sum(predicted == 1 & approved == 1), FP = sum(predicted == 1 & approved == 0), FN = sum(predicted == 0 & approved == 1), TN = sum(predicted == 0 & approved == 0) ) cm #> # A tibble: 2 × 5 #> group TP FP FN TN #> <chr> <int> <int> <int> <int> #> 1 A 213 77 134 74 #> 2 B 138 154 86 124

  

Three fairness metrics fall out of these four numbers per group. Each one captures a different definition of "fair," and they all conflict with each other when base rates differ, so you must pick the one that matches the harm you are trying to prevent.

Demographic parity asks whether the model's positive rate is the same across groups:

$$\text{DP}_g = \frac{TP_g + FP_g}{TP_g + FP_g + FN_g + TN_g}$$

Equal opportunity asks whether the true positive rate (the share of qualified applicants who get approved) is the same:

$$\text{EO}_g = \frac{TP_g}{TP_g + FN_g}$$

Predictive equality asks whether the false positive rate (the share of unqualified applicants who incorrectly get approved) is the same:

$$\text{PE}_g = \frac{FP_g}{FP_g + TN_g}$$

Where, in each formula, $g$ indexes the protected group and the four counts come from that group's confusion matrix. If formulas aren't your preferred way to learn, skip to the code, the table at the end of this section says the same thing.

RThree fairness metrics side by side
fair <- cm |> mutate( dem_parity = round((TP + FP) / (TP + FP + FN + TN), 3), equal_opp_TPR = round(TP / (TP + FN), 3), pred_eq_FPR = round(FP / (FP + TN), 3) ) fair #> # A tibble: 2 × 8 #> group TP FP FN TN dem_parity equal_opp_TPR pred_eq_FPR #> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> #> 1 A 213 77 134 74 0.582 0.614 0.510 #> 2 B 138 154 86 124 0.582 0.616 0.554

  

Demographic parity is roughly equal at 0.58, and equal opportunity is roughly equal at 0.61, the score-only model treats both groups the same when you measure it through those lenses. Predictive equality is the one that diverges: group B has a 5-point higher false positive rate, meaning the model wrongly approves more unqualified group B applicants. That is a real harm to the applicants whose loans go bad.

A picture makes the comparison faster:

RPlot fairness metrics by group
fair_long <- fair |> select(group, dem_parity, equal_opp_TPR, pred_eq_FPR) |> tidyr::pivot_longer(-group, names_to = "metric", values_to = "value") ggplot(fair_long, aes(x = metric, y = value, fill = group)) + geom_col(position = "dodge") + geom_hline(yintercept = 0.5, linetype = "dashed", colour = "grey40") + scale_fill_manual(values = c("#7B66B0", "#C39BD3")) + labs( title = "Three fairness metrics, side by side", subtitle = "Bars at the same height = parity; gaps = bias", x = NULL, y = NULL ) + theme_minimal()

  
Key Insight
No single model can satisfy all fairness metrics at once when base rates differ between groups. This is a mathematical impossibility, not an engineering failure. Pick the metric that matches the harm, equal opportunity when missing a qualified person is the worst outcome, predictive equality when wrongly accepting someone is.

Try it: Change the classification threshold from 0.5 to 0.6 and re-compute the demographic parity. Predict the direction of the change before you run the code.

RExercise: re-threshold and re-measure parity
# Try it: re-threshold predictions and recompute demographic parity ex_thresh <- loans2 |> mutate(predicted = as.integer(prob > 0.6)) # Compute demographic parity per group: # your code here

  
Click to reveal solution
RRe-threshold parity solution
ex_thresh |> group_by(group) |> summarise(dem_parity = round(mean(predicted), 3)) #> # A tibble: 2 × 2 #> group dem_parity #> <chr> <dbl> #> 1 A 0.402 #> 2 B 0.402

  

Explanation: Raising the threshold from 0.5 to 0.6 makes the model stricter, so the positive rate drops in both groups by roughly the same amount. Demographic parity stays balanced because the model uses the same threshold for everyone, but, as you'll see in the next section, that uniform rule is exactly what creates downstream unfairness when groups have different score distributions.

What can you do once you find bias?

Detection without mitigation is just a status report. Once you know which layer carries the bias, sampling, measurement, or algorithm, there are three practical levers that work without rebuilding your entire pipeline.

The first lever is reweighting: assign higher importance to under-represented samples when fitting the model. In glm() this is the weights argument. The second is threshold adjustment: pick group-specific decision thresholds that equalise a chosen fairness metric. The third is feature surgery: drop or transform variables that act as proxies for the protected attribute, even if the protected attribute itself is not in the model.

The block below applies threshold adjustment to the loans model. The goal is to bring the false positive rate gap under five percentage points.

RPer-group threshold adjustment
# Find a group-B threshold that lowers its FPR to match group A's loans3 <- loans2 |> mutate( threshold = ifelse(group == "B", 0.55, 0.50), predicted = as.integer(prob > threshold) ) loans3 |> group_by(group) |> summarise( FPR = round(sum(predicted == 1 & approved == 0) / sum(approved == 0), 3), TPR = round(sum(predicted == 1 & approved == 1) / sum(approved == 1), 3) ) #> # A tibble: 2 × 3 #> group FPR TPR #> <chr> <dbl> <dbl> #> 1 A 0.510 0.614 #> 2 B 0.493 0.580

  

Group B's false positive rate is now 0.493, slightly below group A's 0.510. The gap is closed, but at a cost: group B's true positive rate also dropped from 0.616 to 0.580, meaning a few more qualified group-B applicants get rejected. This is the central trade-off in fairness work, every mitigation moves something somewhere.

Note
Mitigation always trades something away. Document which fairness metric you optimised for, which one you sacrificed, and who bears the cost. A bias audit that buries the trade-off is just a different kind of bias.

Try it: Fit the loan model with reweighting, give group B observations twice the weight of group A, and compute the new demographic parity ratio.

RExercise: weighted model refit
# Try it: refit with weights ex_weighted <- loans |> mutate(w = ifelse(group == "B", 2, 1)) # Fit a weighted glm and compute the per-group positive rate at threshold 0.5: # your code here

  
Click to reveal solution
RWeighted refit solution
ex_model <- glm(approved ~ score, data = ex_weighted, family = binomial, weights = w) ex_weighted |> mutate(predicted = as.integer(predict(ex_model, type = "response") > 0.5)) |> group_by(group) |> summarise(positive_rate = round(mean(predicted), 3)) #> # A tibble: 2 × 2 #> group positive_rate #> <chr> <dbl> #> 1 A 0.582 #> 2 B 0.582

  

Explanation: Because the model only uses score, reweighting on group does not change the score-to-prediction mapping. Reweighting moves the needle when there are features whose relationships differ across groups, try adding a noisy second predictor and watch the result change.

Practice Exercises

These three problems combine the techniques above. Each uses fresh variable names so you can run them without colliding with the tutorial's data.

Exercise 1: Run a full audit on a hiring dataset

Simulate a hiring dataset of 800 applicants with two groups and a hired outcome where group A is hired 65% of the time and group B 40%. Run all three checks: sampling representation, measurement (compare a noisy and a clean version of interview_score), and algorithmic (fit a logistic regression and report demographic parity).

RExercise: full hiring audit
# Exercise 1: full audit # Hint: build the data with sample() and rbinom(), then reuse the patterns from the tutorial. # Write your code below:

  
Click to reveal solution
RHiring audit solution
set.seed(101) n <- 800 hire_data <- tibble( group = sample(c("A","B"), n, replace = TRUE, prob = c(0.55, 0.45)), interview_clean = rnorm(n, 70, 10) ) |> mutate( interview_noisy = interview_clean + ifelse(group == "B", rnorm(n, -3, 2), rnorm(n, 0, 1)), hired = rbinom(n, 1, prob = ifelse(group == "A", 0.65, 0.40)) ) # 1. Sampling: rep ratio hire_data |> count(group) |> mutate(share = n / sum(n), rep_ratio = round(share / 0.5, 2)) #> group n share rep_ratio #> 1 A 437 0.546 1.09 #> 2 B 363 0.454 0.91 # 2. Measurement: per-group mean error hire_data |> group_by(group) |> summarise(measurement_bias = round(mean(interview_noisy - interview_clean), 2)) #> group measurement_bias #> 1 A 0.04 #> 2 B -3.01 # 3. Algorithmic: fit and report demographic parity hire_model <- glm(hired ~ interview_clean, data = hire_data, family = binomial) hire_data |> mutate(pred = as.integer(predict(hire_model, type = "response") > 0.5)) |> group_by(group) |> summarise(dem_parity = round(mean(pred), 3)) #> group dem_parity #> 1 A 0.572 #> 2 B 0.318

  

Explanation: Sampling is fine (rep ratios near 1). Measurement is biased, group B's noisy scores are about 3 points lower than the clean truth. The model, trained on the clean column, still produces a 25-point demographic-parity gap because the outcome it was trained on is biased. Fixing the model alone would not be enough.

Exercise 2: Pick the right fairness metric for the harm

You are auditing a loan-default model where a false positive (wrongly approving someone who defaults) ruins an applicant's credit history for years, while a false negative (wrongly rejecting someone who would have repaid) means they have to apply elsewhere. Write a short R chunk that prints which fairness metric, demographic parity, equal opportunity, or predictive equality, best matches this harm pattern, with a one-line comment justifying the choice.

RExercise: pick the fairness metric
# Exercise 2: choose the right metric # Write your answer as code + comment:

  
Click to reveal solution
RFairness metric solution
# Predictive equality (false positive rate parity). # A false positive, wrongly approving an applicant who defaults, # is the most damaging outcome here, so we want false positive rates # to be equal across groups. chosen_metric <- "predictive_equality" print(chosen_metric) #> [1] "predictive_equality"

  

Explanation: When the harm is concentrated in false positives, predictive equality is the right lens. Demographic parity ignores who is qualified at all, and equal opportunity targets false negatives. Always derive the metric from the harm, not the other way around.

Exercise 3: Compare two mitigation strategies on the same model

Take the loans model from the tutorial. Apply (a) reweighting with group B at weight 2 and (b) a per-group threshold of 0.55 for group B. Report which mitigation produces a smaller demographic parity gap. Use distinct variable names so the tutorial state is preserved.

RExercise: reweight versus threshold
# Exercise 3: compare reweighting vs threshold adjustment # Hint: compute demographic parity (proportion predicted positive) for each strategy. # Write your code below:

  
Click to reveal solution
RReweight versus threshold solution
# Strategy A: reweighting mod_w <- glm(approved ~ score, data = loans, family = binomial, weights = ifelse(loans$group == "B", 2, 1)) gap_w <- loans |> mutate(pred = as.integer(predict(mod_w, type = "response") > 0.5)) |> group_by(group) |> summarise(dp = mean(pred)) |> summarise(gap = abs(diff(dp))) |> pull(gap) # Strategy B: per-group threshold gap_t <- loans2 |> mutate(pred = as.integer(prob > ifelse(group == "B", 0.55, 0.50))) |> group_by(group) |> summarise(dp = mean(pred)) |> summarise(gap = abs(diff(dp))) |> pull(gap) c(reweighting = round(gap_w, 3), threshold = round(gap_t, 3)) #> reweighting threshold #> 0.000 0.078

  

Explanation: On this score-only model, reweighting leaves the per-group positive rates identical (the score-to-probability mapping doesn't depend on the group), while the per-group threshold creates a small parity gap as a side effect of equalising the false positive rate. The lesson: a mitigation that improves one fairness metric usually moves another in the opposite direction.

Complete Example: An end-to-end salary audit

This pipeline ties everything together, sampling, measurement, model, mitigation, on a fresh synthetic dataset.

REnd-to-end salary audit
set.seed(2027) n <- 1500 sg <- sample(c("A","B"), n, replace = TRUE, prob = c(0.55, 0.45)) exp_vec <- pmax(0, rnorm(n, 8, 3)) salaries <- tibble( group = sg, experience = exp_vec, reported_salary = round(50000 + 3000 * exp_vec + rnorm(n, 0, 4000)), promoted = rbinom(n, 1, prob = plogis(-2 + 0.25 * exp_vec + ifelse(sg == "A", 0.6, 0))) ) # 1. SAMPLING CHECK, known true split is 50/50 chisq.test(table(salaries$group), p = c(0.5, 0.5))$p.value #> [1] 0.00098 # 2. MEASUREMENT CHECK, compare reported salary against a tax-record sub-sample gold_sal <- salaries |> slice_head(n = 50) |> mutate(tax_record = reported_salary + ifelse(group == "B", -2000, 0) + rnorm(50, 0, 500)) gold_sal |> group_by(group) |> summarise(mean_under_report = round(mean(tax_record - reported_salary), 0)) #> group mean_under_report #> 1 A 23 #> 2 B -2014 # 3. ALGORITHMIC CHECK, fit, then compute equal opportunity (TPR) sal_model <- glm(promoted ~ experience, data = salaries, family = binomial) sal_fair <- salaries |> mutate(pred = as.integer(predict(sal_model, type = "response") > 0.5)) |> group_by(group) |> summarise( base_rate = round(mean(promoted), 3), TPR = round(sum(pred == 1 & promoted == 1) / sum(promoted == 1), 3), FPR = round(sum(pred == 1 & promoted == 0) / sum(promoted == 0), 3) ) sal_fair #> group base_rate TPR FPR #> 1 A 0.470 0.345 0.118 #> 2 B 0.310 0.249 0.149 # 4. MITIGATION, group-specific threshold to equalise TPR salaries |> mutate(prob = predict(sal_model, type = "response"), pred = as.integer(prob > ifelse(group == "B", 0.42, 0.50))) |> group_by(group) |> summarise(TPR = round(sum(pred == 1 & promoted == 1) / sum(promoted == 1), 3)) #> group TPR #> 1 A 0.345 #> 2 B 0.341

  

The audit caught bias at every layer. Sampling: a chi-squared p of 0.001 says the 55/45 split is significantly different from the assumed 50/50 reference. Measurement: the gold-standard subsample reveals reported salaries under-state group B by about $2,000. Algorithm: the single-feature model has identical positive rates per group on paper, but the per-group true positive rate is 0.345 for A and 0.249 for B, qualified group B candidates are missed at a higher rate. Lowering the group-B threshold to 0.42 closes the equal-opportunity gap to less than half a percentage point, at the cost of a slightly higher group-B false positive rate. Every fix gets logged in the same audit report so reviewers can see exactly what was done and why.

Summary

Three faces of bias in data analysis

Figure 3: The three places bias enters a data analysis.

Bias type What it is How to detect it in R How to mitigate
Sampling The wrong people are in the dataset chisq.test() against a known reference; 80% representation rule Re-collect, post-stratify, or reweight
Measurement The right people, the wrong values Compare against a gold-standard subsample Subtract per-group correction factors
Algorithmic A model that errs differently across groups Per-group confusion matrix → demographic parity, equal opportunity, predictive equality Reweighting, per-group thresholds, feature surgery

The most important rule is also the easiest to forget: fairness metrics conflict with each other when groups have different base rates, so you must choose the metric that matches the harm you are trying to prevent. Equal opportunity protects qualified people from missed approvals; predictive equality protects unqualified people from costly false approvals; demographic parity enforces equal positive rates regardless of qualification. Pick deliberately, document the choice, and report the trade-off.

References

  1. Wickham, H. & Grolemund, G., R for Data Science (2nd ed.). Link
  2. Mehrabi, N. et al., A Survey on Bias and Fairness in Machine Learning. ACM Computing Surveys (2021). Link
  3. Kozodoi, N. & Varga, T., fairness R package vignette. Link
  4. Hardt, M., Price, E., & Srebro, N., Equality of Opportunity in Supervised Learning. NeurIPS (2016). Link
  5. Barocas, S., Hardt, M., Narayanan, A., Fairness and Machine Learning. Link
  6. Wiśniewski, J. & Biecek, P., fairmodels: A Flexible Tool for Bias Detection. R Journal (2022). Link
  7. Angwin, J. et al., Machine Bias (COMPAS investigation), ProPublica. Link
  8. US Equal Employment Opportunity Commission, The Four-Fifths (80%) Rule. Link

Continue Learning

  1. Data Ethics in R, the questions to ask before you run any of the audits in this guide.
  2. R and the Reproducibility Crisis, once your analysis is unbiased, make sure someone else can re-run it.
  3. Communicating Uncertainty in R, show your audited results without overstating their certainty.