Reproducibility Crisis: What It Means for R Users & How to Help Fix It
More than half of published research findings may be false or unreproducible. The reproducibility crisis is a systemic failure in how science is done — and R programmers are uniquely positioned to help fix it.
In 2015, the Open Science Collaboration attempted to replicate 100 psychology studies. Only 36% of replications produced significant results, compared to 97% of the originals. Similar replication failures have been found in cancer biology, economics, and social science. This isn't just an academic problem — it wastes billions in research funding and erodes public trust in science.
What Went Wrong
The Garden of Forking Paths
Every analysis involves dozens of small decisions: how to clean the data, which outliers to exclude, which covariates to include, which subgroups to analyze. Each decision creates a "fork" — and researchers (often unconsciously) follow the path that leads to significant results.
# Demonstration: The garden of forking paths
set.seed(42)
n <- 50
# Generate completely random data (no true effect)
df <- data.frame(
x = rnorm(n),
y = rnorm(n),
group = sample(c("A","B"), n, replace = TRUE),
age = sample(20:60, n, replace = TRUE),
outlier_flag = sample(c(FALSE, FALSE, FALSE, FALSE, TRUE), n, replace = TRUE)
)
# Fork 1: Full data vs remove outliers
p1 <- cor.test(df$x, df$y)$p.value
p2 <- cor.test(df$x[!df$outlier_flag], df$y[!df$outlier_flag])$p.value
# Fork 2: All subjects vs subgroup
p3 <- cor.test(df$x[df$group == "A"], df$y[df$group == "A"])$p.value
p4 <- cor.test(df$x[df$age < 40], df$y[df$age < 40])$p.value
# Fork 3: Control for age
p5 <- summary(lm(y ~ x + age, data = df))$coefficients["x", "Pvalue"]
cat("=== Garden of Forking Paths ===\n")
cat("Same data, different analysis choices:\n\n")
cat(sprintf("Full data correlation: p = %.4f\n", p1))
cat(sprintf("Remove flagged outliers: p = %.4f\n", p2))
cat(sprintf("Group A only: p = %.4f\n", p3))
cat(sprintf("Age < 40 only: p = %.4f\n", p4))
cat(sprintf("Control for age: p = %.4f\n", p5))
cat(sprintf("\nSmallest p-value: %.4f\n", min(p1,p2,p3,p4,p5)))
cat("A researcher could 'choose' any of these paths.\n")
cat("With enough forks, you'll find p < 0.05 even in random noise.\n")
P-Hacking in Practice
P-hacking encompasses many practices that inflate false-positive rates:
Practice
Problem
Better Alternative
Running many tests, reporting significant ones
Multiplicity inflation
Pre-register primary analysis
Removing outliers until p < 0.05
Data-dependent decisions
Pre-specify outlier rules
Adding covariates to "improve" the model
Specification searching
Pre-specify the model
Stopping data collection when p < 0.05
Optional stopping
Pre-specify sample size
Dichotomizing continuous variables
Loss of information, inflated Type I
Keep variables continuous
HARKing (Hypothesizing After Results are Known)
Post-hoc presented as a priori
Pre-registration
# How optional stopping inflates false positives
set.seed(123)
# Simulate: check for significance every 10 observations
# (True effect = 0, no real difference)
n_sims <- 1000
false_positives <- 0
for (sim in 1:n_sims) {
x <- c()
y <- c()
found_sig <- FALSE
for (batch in 1:10) {
x <- c(x, rnorm(10))
y <- c(y, rnorm(10))
if (length(x) >= 20) {
p <- t.test(x, y)$p.value
if (p < 0.05) {
found_sig <- TRUE
break
}
}
}
if (found_sig) false_positives <- false_positives + 1
}
cat("=== Optional Stopping Inflation ===\n")
cat(sprintf("False positive rate with optional stopping: %.1f%%\n",
false_positives / n_sims * 100))
cat("Expected rate (no peeking): 5.0%\n")
cat("Optional stopping nearly doubles the false positive rate!\n")
How R Tools Help Fix It
R has an exceptional ecosystem for reproducible research. Here are the key tools:
1. renv: Lock Package Versions
Different package versions can produce different results. renv creates a snapshot of your exact package versions.
# renv workflow (conceptual - not run in WebR)
cat("=== renv Workflow ===\n")
cat("1. renv::init() # Initialize project-local library\n")
cat("2. install.packages() # Install needed packages\n")
cat("3. renv::snapshot() # Lock current versions to renv.lock\n")
cat("4. # Share renv.lock with collaborators\n")
cat("5. renv::restore() # Collaborator installs exact same versions\n\n")
cat("Your renv.lock file captures:\n")
cat("- R version\n")
cat("- Every package name, version, and source\n")
cat("- Repository URLs\n")
cat("- Dependency tree\n")
2. targets: Reproducible Pipelines
The targets package defines your analysis as a directed acyclic graph (DAG). It tracks which steps need re-running when inputs change.
# Without set.seed: results change every run
cat("=== Importance of set.seed() ===\n")
cat("Without set.seed():\n")
cat(" Run 1:", mean(rnorm(100)), "\n")
cat(" Run 2:", mean(rnorm(100)), "\n")
cat("\nWith set.seed(42):\n")
set.seed(42); cat(" Run 1:", mean(rnorm(100)), "\n")
set.seed(42); cat(" Run 2:", mean(rnorm(100)), "\n")
cat("\nSame seed = same results. Always set.seed() before randomization.\n")
4. Docker: Freeze the Entire Environment
cat("=== Docker for R (conceptual) ===\n")
cat("A Dockerfile captures EVERYTHING:\n")
cat("- Operating system version\n")
cat("- R version\n")
cat("- System libraries\n")
cat("- R packages and versions\n")
cat("- Your code and data\n\n")
cat("Example Dockerfile:\n")
cat(" FROM rocker/r-ver:4.4.0\n")
cat(" RUN install2.r renv\n")
cat(" COPY renv.lock renv.lock\n")
cat(" RUN R -e 'renv::restore()'\n")
cat(" COPY analysis.R analysis.R\n")
cat(" CMD Rscript analysis.R\n")
The Reproducibility Checklist
Level
What to Do
Tool
Minimal
Set random seeds, comment code
set.seed(), # comments
Basic
Version control, share code
Git + GitHub
Good
Lock package versions
renv
Better
Automated pipeline
targets or make
Best
Container + registered plan
Docker + OSF pre-registration
Exercises
Exercise 1: Multiple Comparison Correction
You run 20 hypothesis tests. Three are significant at p < 0.05. After correction, how many remain significant?
# 20 tests, 3 "significant" raw p-values
p_raw <- c(0.001, 0.023, 0.041, rep(0.3, 17))
cat("=== Multiple Comparison Correction ===\n")
cat("Raw p-values (significant ones):\n")
cat(sprintf(" Test 1: p = %.3f\n", p_raw[1]))
cat(sprintf(" Test 2: p = %.3f\n", p_raw[2]))
cat(sprintf(" Test 3: p = %.3f\n", p_raw[3]))
# Bonferroni (most conservative)
p_bonf <- p.adjust(p_raw, method = "bonferroni")
cat(sprintf("\nBonferroni corrected: %d significant\n", sum(p_bonf < 0.05)))
# Benjamini-Hochberg (FDR, less conservative)
p_bh <- p.adjust(p_raw, method = "BH")
cat(sprintf("BH (FDR) corrected: %d significant\n", sum(p_bh < 0.05)))
cat("\nAlways correct for multiple comparisons!\n")
Exercise 2: Reproducibility Audit
Check if this analysis is reproducible. What's missing?
cat("=== Audit This Analysis ===\n")
cat("result <- lm(y ~ x + z, data = read.csv('data.csv'))\n")
cat("summary(result)\n\n")
cat("Missing for reproducibility:\n")
cat("1. No set.seed() (if any randomization upstream)\n")
cat("2. No package versions recorded (no renv.lock)\n")
cat("3. No data provenance (where did data.csv come from?)\n")
cat("4. No session info (R version, OS)\n")
cat("5. No documentation of data cleaning steps\n")
cat("6. No version control (is this in Git?)\n")
Summary
Problem
Root Cause
Solution
P-hacking
Flexible analysis + significance pressure
Pre-registration
Non-reproducible code
Missing dependencies/versions
renv + Docker
Garden of forking paths
Undocumented analysis decisions
targets pipeline
Optional stopping
Peeking at results during collection
Pre-specify sample size
HARKing
Post-hoc hypotheses
Registered reports
Software rot
Packages change over time
renv::snapshot()
FAQ
Does reproducibility mean the same results every time? Computational reproducibility means running the same code on the same data produces identical output. Replicability means a new study with new data reaches similar conclusions. Both matter, but reproducibility is the minimum bar.
Isn't pre-registration too rigid? What if I find something unexpected? Pre-registration doesn't prevent exploratory analysis — it just requires you to label it as exploratory. You can (and should) explore your data. The key is being transparent about what was planned versus what was discovered.
Do I need Docker for every analysis? No. For most analyses, renv + Git is sufficient. Use Docker when you need to guarantee reproducibility across different operating systems, or for high-stakes analyses (clinical trials, regulatory submissions, published papers).