Probability Simulation in R: Dice, Cards & Birthday Problem Solved

Probability simulation in R uses random sampling, sample() and replicate(), to estimate probabilities that are painful or impossible to derive by hand. Three classic problems (dice sums, card draws, and the birthday paradox) show how ten lines of code can match a formula to within 0.1%, and crack problems where no tidy formula exists.

By Selva Prabhakaran · Published May 23, 2026 · Last updated May 23, 2026

How do you simulate a dice roll in R?

Forget the formula for a moment. If you roll two dice a million times and count how often the sum is 7 or more, that fraction is the probability. R's sample() gives you one roll; replicate() repeats the experiment as many times as you want. You're about to estimate a non-trivial probability in three lines, and compare it to the closed-form answer.

Here's the whole simulation. Each trial rolls two dice, sums them, and checks if the sum is at least 7. Running the trial 100,000 times and averaging the TRUE/FALSE results gives us the probability.

REstimate P(sum = 7) by simulation

set.seed(42) p_hat <- mean(replicate(100000, sum(sample(1:6, 2, replace = TRUE)) >= 7)) p_hat #> [1] 0.58309

The simulated probability is 0.583. The exact answer, the number of dice-pair outcomes where the sum is ≥ 7 divided by 36 total outcomes, is 21/36 = 0.5833. Three lines of R got us within 0.0003 of the truth.

Let me unpack what that one-liner is doing. sample(1:6, 2, replace = TRUE) picks two numbers from 1 through 6 with replacement, one dice roll. sum() adds them. replicate(100000, ...) runs that whole expression 100,000 times and returns a logical vector. mean() of a logical vector gives the proportion of TRUEs, which is our probability estimate.

RAnatomy of one simulated roll

# Anatomy of one trial set.seed(42) one_roll <- sample(1:6, 2, replace = TRUE) one_roll #> [1] 6 1 sum(one_roll) #> [1] 7 # Theoretical answer for comparison theoretical <- 21 / 36 theoretical #> [1] 0.5833333

That first roll, a 6 and a 1, sums to 7, so it counts as a success. Multiply that logic by 100,000 independent trials and you converge on the true probability. The theoretical 21/36 comes from listing every (die1, die2) pair where the sum is ≥ 7 and dividing by the 36 total pairs.

Key Insight

Simulation turns a probability question into a counting question. You don't need a formula, you just need a way to run the random experiment once, repeat it, and count successes. Any problem you can describe procedurally, you can simulate.

Try it: Simulate rolling a single fair die 10,000 times. Estimate the probability of getting a 6. Store the result in ex_single_die.

RTry it: Probability of rolling a six

# Try it: P(roll a 6) via simulation set.seed(1) rolls <- sample(1:6, 10000, replace = TRUE) ex_single_die <- # your code here ex_single_die #> Expected: approximately 0.1667

Click to reveal solution

RP(six) simulation solution

set.seed(1) rolls <- sample(1:6, 10000, replace = TRUE) ex_single_die <- mean(rolls == 6) ex_single_die #> [1] 0.1675

Explanation: rolls == 6 returns a logical vector of length 10,000; mean() gives the fraction of TRUEs. The true probability is 1/6 ≈ 0.1667.

How do you simulate drawing cards from a deck?

A deck of cards is just a vector of 52 labels, 13 ranks across 4 suits. sample() with replace = FALSE (the default) deals a hand: it picks without replacement, so no card appears twice.

Let's build the deck, deal one hand, then ask a harder question: what's the probability of a flush (all five cards the same suit)?

RBuild a 52-card deck

# Build a 52-card deck ranks <- c("A", "2", "3", "4", "5", "6", "7", "8", "9", "10", "J", "Q", "K") suits <- c("Clubs", "Diamonds", "Hearts", "Spades") deck <- expand.grid(rank = ranks, suit = suits, stringsAsFactors = FALSE) head(deck, 3) #> rank suit #> 1 A Clubs #> 2 2 Clubs #> 3 3 Clubs # Deal one 5-card hand set.seed(99) hand <- deck[sample(nrow(deck), 5), ] hand #> rank suit #> 20 7 Diamonds #> 42 3 Spades #> 11 J Clubs #> 36 10 Hearts #> 23 10 Diamonds

expand.grid() gives us all rank × suit combinations, exactly 52 rows. sample(nrow(deck), 5) picks 5 distinct row indices; we use those to pull a hand from the data frame. The seeded hand above is a mix of suits, no flush this time.

Now we repeat the deal 200,000 times and check how often all five suits match. The check is one line: length(unique(hand$suit)) == 1.

REstimate flush probability by simulation

# Estimate P(flush) by simulation set.seed(2026) p_flush <- mean(replicate(200000, { hand <- deck[sample(nrow(deck), 5), ] length(unique(hand$suit)) == 1 })) p_flush #> [1] 0.002

That's 0.002, about 1 in 500 hands. The exact value is 5108 possible flushes out of C(52,5) = 2,598,960 hands, which is 0.001981. Our estimate rounds to the same three decimals.

RExact flush probability for comparison

# Exact flush probability (includes straight flushes) p_flush_exact <- 5108 / choose(52, 5) p_flush_exact #> [1] 0.001980792

The choose(52, 5) function is R's way of computing "52 choose 5", the number of ways to pick 5 cards from 52 ignoring order. That denominator, paired with the 5108 flush-friendly combinations counted by combinatorics, gives the exact probability simulation is approximating.

Tip

sample(x, k, replace = FALSE) is the default, use it for dealing cards. If you forget and pass replace = TRUE, you'll deal the same card twice and silently corrupt the simulation. Always state the replacement flag explicitly when it matters.

Try it: Estimate the probability that a 5-card hand contains at least one Ace. Store the result in ex_ace.

RTry it: At least one ace

# Try it: P(≥1 Ace in a 5-card hand) set.seed(7) ex_ace <- # your code here, hint: check if "A" appears in hand$rank ex_ace #> Expected: approximately 0.34

Click to reveal solution

RAce-in-hand solution

set.seed(7) ex_ace <- mean(replicate(100000, { hand <- deck[sample(nrow(deck), 5), ] "A" %in% hand$rank })) ex_ace #> [1] 0.3406

Explanation: "A" %in% hand$rank returns TRUE if any of the five cards has rank "A". The exact answer is 1 − C(48,5)/C(52,5) ≈ 0.3412.

What is the birthday problem and how does simulation solve it?

Here's the problem that breaks intuition. How many people do you need in a room before there's a better-than-even chance that two share a birthday? Most people guess 100 or 180. The actual answer is 23. Simulation makes this believable in four lines.

A single simulated group is sample(1:365, 23, replace = TRUE), 23 random days with replacement (two people can share). any(duplicated(...)) checks whether any day repeats.

ROne group of 23 birthdays

# One group of 23 people set.seed(2026) bdays <- sample(1:365, 23, replace = TRUE) has_match <- any(duplicated(bdays)) has_match #> [1] FALSE

This particular group has no shared birthday. But one trial tells us nothing. We need thousands. Let's ask: across 10,000 groups of 23, how often does at least one pair share a birthday?

RP(match) across 10,000 groups

# P(match) for n = 23, via 10,000 simulated groups set.seed(2026) p_23 <- mean(replicate(10000, any(duplicated(sample(1:365, 23, replace = TRUE))))) p_23 #> [1] 0.5095

50.95%, just over half. The exact formula gives 0.5073. Our simulation is off by 0.002, well inside the margin of error for 10,000 trials.

The formula itself is worth seeing, because the simulation result makes it intuitive in a way the algebra doesn't:

$$P(\text{match}) = 1 - \prod_{i=0}^{n-1}\left(1 - \frac{i}{365}\right)$$

Where:

$n$ = number of people in the room
The product counts the probability that every person has a different birthday from everyone before them
$1 - \text{that product}$ flips it into the probability that somebody clashes

Now let's trace the probability across group sizes from 2 to 60, overlay the exact formula, and mark the 50% crossing.

RProbability curve from 2 to 60

# Probability curve: n = 2..60 library(ggplot2) library(dplyr) set.seed(1) n_values <- 2:60 bday_df <- data.frame( n = n_values, sim = sapply(n_values, function(n) { mean(replicate(2000, any(duplicated(sample(1:365, n, replace = TRUE))))) }), exact = sapply(n_values, function(n) { 1 - prod(1 - (0:(n - 1)) / 365) }) ) bday_plot <- ggplot(bday_df, aes(x = n)) + geom_line(aes(y = exact), color = "#1f77b4", linewidth = 1) + geom_point(aes(y = sim), color = "#ff7f0e", alpha = 0.8) + geom_hline(yintercept = 0.5, linetype = "dashed", color = "gray40") + geom_vline(xintercept = 23, linetype = "dashed", color = "gray40") + labs( x = "Number of people", y = "P(at least one shared birthday)", title = "Birthday problem: simulation vs exact formula" ) + theme_minimal() bday_plot

The orange dots (simulation) hug the blue line (formula) across the entire range. The dashed crosshair marks the surprise: at n = 23, the curve crosses 0.5. By n = 40 the probability is already 89%.

Warning

duplicated() checks if a value appeared earlier in the vector, not the same as unique(). For this problem any(duplicated(x)) is what you want; length(unique(x)) < length(x) also works. Do not confuse duplicated() with !unique(), they return different-length vectors.

Try it: Estimate the probability of a shared birthday in a group of 50 people using 5,000 simulated groups. Store the result in ex_50.

RTry it: P(match) for n = 50

# Try it: P(match) for n = 50 set.seed(2026) ex_50 <- # your code here ex_50 #> Expected: approximately 0.97

Click to reveal solution

Rn = 50 solution

set.seed(2026) ex_50 <- mean(replicate(5000, any(duplicated(sample(1:365, 50, replace = TRUE))))) ex_50 #> [1] 0.9704

Explanation: The formula gives 0.9704, the simulation matches to four decimals. By 50 people a shared birthday is nearly certain.

How does sample size affect simulation accuracy?

Simulations are estimates, not answers. Run the two-dice-sum experiment 10 times and you might get 0.4 or 0.7. Run it 10,000 times and you get 0.583. This is the law of large numbers in action: the sample mean converges to the true expectation as sample size grows.

Let's quantify the convergence. We'll run the P(sum ≥ 7) simulation at five sample sizes and plot the estimate against the true answer.

RConvergence study for two-dice sum

# Convergence study for P(sum of two dice >= 7) set.seed(42) n_reps <- c(100, 1000, 10000, 100000, 1000000) conv_df <- data.frame( n = n_reps, estimate = sapply(n_reps, function(n) { mean(replicate(n, sum(sample(1:6, 2, replace = TRUE)) >= 7)) }), truth = 21 / 36 ) conv_df #> n estimate truth #> 1 100 0.540000 0.5833333 #> 2 1000 0.585000 0.5833333 #> 3 10000 0.586500 0.5833333 #> 4 100000 0.582630 0.5833333 #> 5 1000000 0.583506 0.5833333 conv_plot <- ggplot(conv_df, aes(x = n, y = estimate)) + geom_line(color = "#ff7f0e") + geom_point(size = 3, color = "#ff7f0e") + geom_hline(yintercept = 21 / 36, linetype = "dashed", color = "#1f77b4") + scale_x_log10() + labs( x = "Number of simulated rolls (log scale)", y = "Estimated P(sum >= 7)", title = "Monte Carlo convergence for the two-dice sum" ) + theme_minimal() conv_plot

The five-step pattern every probability simulation follows in R.

Figure 1: Every probability simulation in R follows the same five-step loop, define the experiment, sample one outcome, replicate, average, then compare to the formula when one exists.

At 100 rolls the estimate is 0.54, off by 0.04. By 10,000 rolls the error is under 0.003. At a million rolls, we match the truth to the fourth decimal. The rate of improvement follows a clean rule: the standard error shrinks as $\sqrt{N}$:

$$\text{SE}(\hat{p}) \approx \sqrt{\frac{p(1-p)}{N}}$$

Where:

$\hat{p}$ = the simulated probability
$p$ = the true probability we're trying to estimate
$N$ = the number of simulated trials

For $p = 0.58$ and $N = 10{,}000$, that gives SE ≈ 0.005. Our actual error at that sample size was 0.003, inside the expected range.

Key Insight

Error shrinks as √N, not N. Tenfold accuracy costs 100× the runs. To cut your simulation error in half, quadruple the number of reps. This is why simulations that need four decimal places of precision cost millions of trials, not thousands.

Try it: Estimate P(roll a 6) at N = 100 and at N = 100,000, store them in ex_small_n and ex_big_n, and compare their absolute errors against the truth 1/6.

RTry it: Small versus large N

# Try it: convergence demo set.seed(123) ex_small_n <- # your code here (N = 100) ex_big_n <- # your code here (N = 100000) c(ex_small_n, ex_big_n, truth = 1/6) #> Expected: small_n far from 0.1667; big_n very close

Click to reveal solution

RConvergence solution

set.seed(123) ex_small_n <- mean(sample(1:6, 100, replace = TRUE) == 6) ex_big_n <- mean(sample(1:6, 100000, replace = TRUE) == 6) c(ex_small_n, ex_big_n, truth = 1/6) #> truth #> 0.1800000 0.1670900 0.1666667

Explanation: At N = 100 the estimate is 0.18 (error 0.013); at N = 100,000 it's 0.167 (error 0.0004). Thousand-fold more trials bought roughly 32× better accuracy, exactly what √N predicts.

When should you simulate vs use the formula?

Simulation and formulas are not rivals. They answer the same question with different trade-offs. Here's how to choose.

Situation	Better approach	Why
Independent trials with clean counting	Formula	Exact, instant, generalizes
Complex dependent events	Simulation	Formulas get intractable fast
You want to verify your own derivation	Both	Simulate first, then trust the formula
You need an answer to 6+ decimal places	Formula	Simulation noise dominates
The problem has no closed-form solution	Simulation	Only option

Run both methods side by side whenever the formula exists, that's how you build confidence the code reflects the problem you meant to solve. Then push simulation into territory where formulas give up.

RValidate a formula, then extend

# Simulate-first validation: verify a known formula, then extend beyond it set.seed(2026) # Classical: P(sum of two i.i.d. dice >= 7) classical <- mean(replicate(50000, sum(sample(1:6, 2, replace = TRUE)) >= 7)) classical #> [1] 0.58176 # Extension: what if the second die depends on the first? # If first die is even, second is 1:4; else second is 3:6. dependent <- mean(replicate(50000, { d1 <- sample(1:6, 1) d2 <- if (d1 %% 2 == 0) sample(1:4, 1) else sample(3:6, 1) d1 + d2 >= 7 })) dependent #> [1] 0.47436

The classical version matches 21/36 = 0.583, a sanity check that our simulation mechanics are correct. The dependent version has no tidy formula (you'd have to enumerate the conditional distribution), but swapping three lines of simulation code gives 0.474. That's the value of simulation: once the machinery is trustworthy, it handles problems the textbook can't.

Note

Simulation is how statisticians catch errors in their own derivations. If your formula predicts 0.5833 and your simulation returns 0.48 with 1M trials, one of them is wrong, and it's almost never the simulation. Use discrepancies as a debugging signal, not an inconvenience.

Try it: For each scenario below, decide whether to use a formula or simulation. Store your three choices as character vector ex_choice with values "formula" or "simulation".

RTry it: Which scenario needs simulation

# Scenario A: P(heads on one fair coin flip) # Scenario B: P(at least 3 pairs in 5-card hand with partial replacement) # Scenario C: P(sum of 10 d20 rolls >= 120) ex_choice <- c(A = "", B = "", C = "") ex_choice #> Expected: A = formula, B = simulation, C = simulation

Click to reveal solution

RMethod choice solution

ex_choice <- c(A = "formula", B = "simulation", C = "simulation") ex_choice #> A B C #> "formula" "simulation" "simulation"

Explanation: (A) one flip is trivially 0.5. (B) "partial replacement" breaks standard combinatorics, simulate. (C) summing 10 twenty-sided dice has a formula (convolution), but it's enough of a chore that simulation wins on time-to-answer.

Practice Exercises

Exercise 1: Custom 10-sided dice

Estimate P(sum of two 10-sided dice ≥ 15) using 100,000 simulated rolls. Save the estimate to my_result_1.

RExercise: Two d10 dice sum

# Exercise 1: two d10 dice, sum >= 15 # Hint: sample(1:10, 2, replace = TRUE), then check the sum # Write your code below:

Click to reveal solution

Rd10 sum solution

set.seed(501) my_result_1 <- mean(replicate(100000, sum(sample(1:10, 2, replace = TRUE)) >= 15)) my_result_1 #> [1] 0.35995

Explanation: Sums 15..20 occur in 21 of 100 outcomes (by enumeration), so the exact answer is 0.36. Simulation lands at 0.360, a perfect match.

Exercise 2: Longest heads streak in 20 flips

Simulate 10,000 sequences of 20 coin flips. Estimate the probability that the longest run of heads is at least 5. Save the estimate to my_result_2. (Hint: rle() returns run lengths.)

RExercise: Longest heads streak

# Exercise 2: P(longest heads streak >= 5 in 20 flips) # Hint: rle(x) returns a list with $lengths and $values # Keep only the runs where $values == "H" # Write your code below:

Click to reveal solution

RHeads streak solution

set.seed(777) my_result_2 <- mean(replicate(10000, { flips <- sample(c("H", "T"), 20, replace = TRUE) runs <- rle(flips) heads_runs <- runs$lengths[runs$values == "H"] length(heads_runs) > 0 && max(heads_runs) >= 5 })) my_result_2 #> [1] 0.2503

Explanation: rle() collapses consecutive duplicates into run lengths. Filter to head-runs, take the max, compare to 5. The exact answer is near 0.25; simulation returns 0.250.

Exercise 3: Minimum group size for 99% birthday match

Find the smallest number of people n such that P(at least one shared birthday) ≥ 0.99. Loop over n = 2..100, simulate 3,000 groups per value of n, and return the first n that crosses 0.99. Save it to my_result_3.

RExercise: 99% birthday threshold

# Exercise 3: minimum n where P(match) >= 0.99 # Hint: loop n = 2..100, break when simulated P >= 0.99 # Write your code below:

Click to reveal solution

R99% threshold solution

set.seed(1234) my_result_3 <- NA for (n in 2:100) { p <- mean(replicate(3000, any(duplicated(sample(1:365, n, replace = TRUE))))) if (p >= 0.99) { my_result_3 <- n break } } my_result_3 #> [1] 57

Explanation: The exact threshold is n = 57 (P ≈ 0.9901). Your simulated answer may vary by ±1 because of Monte Carlo noise near the 0.99 boundary, that's part of the lesson.

Complete Example: Birthday-problem pipeline

The pieces above combine into a clean, reusable birthday-problem analysis. This pipeline estimates the probability curve, overlays the exact formula, and returns the 50% threshold.

REnd-to-end birthday pipeline

# Complete birthday-problem pipeline set.seed(2026) simulate_birthday <- function(n, trials = 2000) { mean(replicate(trials, any(duplicated(sample(1:365, n, replace = TRUE))))) } exact_birthday <- function(n) { 1 - prod(1 - (0:(n - 1)) / 365) } final_df <- data.frame(n = 2:60) final_df$simulated <- sapply(final_df$n, simulate_birthday) final_df$exact <- sapply(final_df$n, exact_birthday) final_df$abs_error <- abs(final_df$simulated - final_df$exact) # Find first n where simulated probability crosses 0.5 threshold_n <- final_df$n[which(final_df$simulated >= 0.5)[1]] threshold_n #> [1] 23 # Average absolute error across the curve round(mean(final_df$abs_error), 4) #> [1] 0.0081

Two functions, one data frame, and a single which() call recover the textbook answer: 23 people. The mean absolute error across 59 values of n is under 0.01, tighter than the resolution of most published birthday-problem tables.

Summary

The R probability-simulation toolkit at a glance.

Figure 2: The core functions, classical problems, and validation tools that make up probability simulation in R.

Concept	Function	When to use
Draw outcomes	`sample(x, k, replace = ...)`	Every simulation starts here
Repeat an experiment	`replicate(N, expr)`	Estimate probability by averaging TRUEs
Reproducible randomness	`set.seed(n)`	Before every `sample()` or `replicate()`
Detect duplicates	`any(duplicated(x))`	Birthday problem, coincidences
Run lengths	`rle(x)`	Streaks in coin flips or Bernoulli sequences
Combinatorial count	`choose(n, k)`	Exact denominators for formula checks

Three rules of thumb from this tutorial. First, every simulation is five lines: set a seed, define one trial, replicate it, average the TRUEs, compare to the formula if one exists. Second, error shrinks as √N, so chase precision through patience, not cleverness. Third, simulate first when you're unsure of a derivation; the numerical answer is a cheap, trustworthy second opinion.

References

R Core Team, sample() documentation. Link
R Core Team, replicate() documentation. Link
Wickham, H., Advanced R, 2nd Edition. CRC Press, 2019. Chapter 24: Profiling and improving performance. Link
Grinstead, C. M. & Snell, J. L., Introduction to Probability, 2nd rev. edition. American Mathematical Society. Chapter 3 covers the birthday problem. Link
Ross, S., A First Course in Probability, 10th Edition. Pearson, 2019. Classical counting and axioms.
Diaconis, P. & Mosteller, F. (1989). Methods for Studying Coincidences. Journal of the American Statistical Association, 84(408), 853–861. Link
Robinson, D., The birthday paradox puzzle: tidy simulation in R. Variance Explained. Link
Grolemund, G., Hands-On Programming with R, Project 1: Weighted Dice. O'Reilly / RStudio Education. Link

Continue Learning

Conditional Probability in R, simulate P(A|B), test independence, apply Bayes' theorem to a medical-testing scenario where the formula contradicts intuition.
Counting Principles in R, permutations, combinations, and the multiplication rule that power the exact denominators behind every formula-vs-simulation check in this tutorial.
Central Limit Theorem in R, why sample means go normal, how the √N convergence rule generalizes beyond probability estimates, and how to visualize the CLT with simulation.

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

Probability Simulation in R: Dice, Cards & Birthday Problem Solved

How do you simulate a dice roll in R?

How do you simulate drawing cards from a deck?

What is the birthday problem and how does simulation solve it?

How does sample size affect simulation accuracy?

When should you simulate vs use the formula?

Practice Exercises

Exercise 1: Custom 10-sided dice

Exercise 2: Longest heads streak in 20 flips

Exercise 3: Minimum group size for 99% birthday match

Complete Example: Birthday-problem pipeline

Summary

References

Continue Learning

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

Probability Simulation in R: Dice, Cards & Birthday Problem Solved

How do you simulate a dice roll in R?

How do you simulate drawing cards from a deck?

What is the birthday problem and how does simulation solve it?

How does sample size affect simulation accuracy?

When should you simulate vs use the formula?

Practice Exercises

Exercise 1: Custom 10-sided dice

Exercise 2: Longest heads streak in 20 flips

Exercise 3: Minimum group size for 99% birthday match

Complete Example: Birthday-problem pipeline

Summary

References

Continue Learning

Related Tutorials