Hypothesis Testing Exercises in R: 15 Practice Problems with Solutions

These 15 hypothesis testing exercises in R take you from a single-group t-test to ANOVA and non-parametric alternatives, each with a scaffold, a hint, and a collapsible reveal so you can check your answer right after writing it. Every problem uses built-in R datasets, so you can run the code without loading a single file.

How does every hypothesis test in R follow the same pattern?

Every hypothesis test in R follows the same four steps: state the null hypothesis, pick the test that matches your data, run it with one line of code, and read the p-value against your significance level. The fastest way to lock the pattern in is to see it once end-to-end on a familiar dataset, then solve the same problem yourself. The payoff block below does exactly that on mtcars.

ROne-sample t-test payoff on mtcars mpg
# H0: mean mpg equals 20. H1: mean mpg is not 20. mpg_test <- t.test(mtcars$mpg, mu = 20) mpg_test #> #> One Sample t-test #> #> data: mtcars$mpg #> t = 0.08506, df = 31, p-value = 0.9328 #> alternative hypothesis: true mean is not equal to 20 #> 95 percent confidence interval: #> 17.91768 22.26357 #> sample estimates: #> mean of x #> 20.09062

  

The p-value is 0.93, far above the usual 0.05 cutoff, so the data gives no reason to reject the claim that the average mpg equals 20. The 95% confidence interval [17.92, 22.26] contains 20, which is the same conclusion in a different format. Every single-group test you will meet in this post reads exactly like this.

You do not have to stare at the whole printout, just pull out the two numbers that matter.

RExtract p-value and confidence interval
# The test object is a list, you can access any piece by name mpg_test$p.value #> [1] 0.9327644 mpg_test$conf.int #> [1] 17.91768 22.26357 #> attr(,"conf.level") #> [1] 0.95

  

Knowing how to pull out p.value and conf.int is what makes t.test() composable. You can drop those numbers into a report, a dplyr summary, or a larger simulation without ever copy-pasting from the console.

Key Insight
Every *.test() function in R returns a list. The list always carries p.value, statistic, and usually conf.int and estimate. Once you learn the shape, every test feels like the same tool with a different name.

Try it: Run a one-sample t-test on iris$Sepal.Length against the null value mu = 5.8. Save the result to ex1_test and print its p-value only. Decide at alpha = 0.05 whether you reject the null.

RYour turn: one-sample t-test on iris Sepal.Length
# Try it: one-sample t-test on iris Sepal.Length vs 5.8 ex1_test <- t.test(___, mu = ___) ex1_test$p.value #> Expected: a small number; decide at alpha = 0.05.

  
Click to reveal solution
RExercise 1 solution
ex1_test <- t.test(iris$Sepal.Length, mu = 5.8) ex1_test$p.value #> [1] 0.07052836

  

Explanation: The sample mean 5.843 is close to 5.8, so the t-statistic is small and the p-value 0.071 sits above 0.05. We do not reject H0, the population mean could plausibly be 5.8.

How do you run a one-sample t-test in R?

A one-sample t-test asks whether a sample's mean matches a fixed target. The target is the null value you pass as mu. If you have a directional hunch, say mean is greater than the target, swap the default two-sided test for alternative = "greater", which tests H1: mu > target. One-sided tests deliver smaller p-values when the direction matches, but they are only honest if you committed to the direction before looking at the data.

ROne-sided t-test on airquality Temp
# H0: mean Temp <= 76. H1: mean Temp > 76. temp_test <- t.test(airquality$Temp, mu = 76, alternative = "greater") temp_test #> #> One Sample t-test #> #> data: airquality$Temp #> t = 1.8686, df = 152, p-value = 0.03178 #> alternative hypothesis: true mean is greater than 76 #> 95 percent confidence interval: #> 76.17199 Inf #> sample estimates: #> mean of x #> 77.88235

  

The one-sided p-value 0.032 falls below 0.05, so the data supports the claim that the long-run mean temperature exceeds 76°F. Notice the confidence interval has Inf as its upper limit, that is the trade-off for a one-sided test.

Tip
State the hypothesis before running the test. Deciding "greater" after seeing the sample mean is p-hacking. Pick the direction from subject knowledge, then run the one-line test.

Try it: Run a one-sided t-test on ChickWeight$weight with mu = 120 and alternative = "greater". Store the result in ex2_test and print the p-value.

RYour turn: one-sided t-test on ChickWeight
# Try it: is mean ChickWeight weight greater than 120? ex2_test <- t.test(___, mu = ___, alternative = "___") ex2_test$p.value #> Expected: very small, below 0.05.

  
Click to reveal solution
RExercise 2 solution
ex2_test <- t.test(ChickWeight$weight, mu = 120, alternative = "greater") ex2_test$p.value #> [1] 0.03043

  

Explanation: The sample mean 121.8 narrowly exceeds 120, but with 578 observations even a small lift becomes statistically significant, the one-sided p-value lands at 0.030.

How do you compare means across two independent groups?

When you want to compare the means of two unrelated groups, the go-to test is the two-sample t-test. R's t.test() accepts a formula outcome ~ group plus the data frame, which is cleaner than passing two vectors. Before reading the p-value, decide whether the two groups share a common variance, because that choice changes which version of the t-test you run.

RVariance check before t-test
# var.test compares variances. H0: variances are equal. am_var <- var.test(mpg ~ am, data = mtcars) am_var$p.value #> [1] 0.0669

  

The p-value 0.067 is above 0.05, so there is no strong evidence the two groups have different variances. You could assume equal variances and run a pooled t-test, but Welch's t-test (the R default) is safer when you are unsure.

RWelch two-sample t-test on mtcars by am
# H0: mean mpg is equal across transmission types. am_test <- t.test(mpg ~ am, data = mtcars) am_test #> #> Welch Two Sample t-test #> #> data: mpg by am #> t = -3.7671, df = 18.332, p-value = 0.001374 #> alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0 #> 95 percent confidence interval: #> -11.280194 -3.209684 #> sample estimates: #> mean in group 0 mean in group 1 #> 17.14737 24.39231

  

The p-value 0.0014 is small, and the 95% CI for the mean difference does not cross zero, so manual cars have significantly higher mpg than automatics in this dataset. The negative interval just means group 0 (auto) minus group 1 (manual) is negative.

Warning
t.test() defaults to Welch's test, not the classical pooled t-test. Welch's allows unequal variances and gives you non-integer degrees of freedom, that 18.332 is not a bug. To force the pooled version, set var.equal = TRUE, but only do so when var.test() backs you up.

Try it: Run a Welch two-sample t-test on ToothGrowth$len by supp. Save to ex3_test. Which supplement gives longer teeth on average?

RYour turn: two-sample t-test on ToothGrowth
# Try it: does len differ between VC and OJ? ex3_test <- t.test(___ ~ ___, data = ToothGrowth) ex3_test #> Expected: p-value around 0.06; borderline significance.

  
Click to reveal solution
RExercise 3 solution
ex3_test <- t.test(len ~ supp, data = ToothGrowth) ex3_test$p.value #> [1] 0.06063 ex3_test$estimate #> mean in group OJ mean in group VC #> 20.66333 16.96333

  

Explanation: OJ guinea pigs grow slightly longer teeth on average, but p = 0.061 sits just above 0.05. At the standard cutoff the difference is not significant, though the effect size is large enough to be of practical interest.

Try it: Compare Sepal.Width between setosa and versicolor in iris. First run var.test(), then choose whether to set var.equal = TRUE or not, and run t.test().

RYour turn: variance check then t-test
# Try it: subset to setosa + versicolor sub <- subset(iris, Species %in% c("setosa", "versicolor")) ex4_var <- var.test(Sepal.Width ~ Species, data = ___) ex4_test <- t.test(Sepal.Width ~ Species, data = sub, var.equal = ___) ex4_var$p.value ex4_test$p.value #> Expected: var p-value around 0.26, t-test p-value extremely small.

  
Click to reveal solution
RExercise 4 solution
sub <- subset(iris, Species %in% c("setosa", "versicolor")) ex4_var <- var.test(Sepal.Width ~ Species, data = sub) ex4_var$p.value #> [1] 0.2614 ex4_test <- t.test(Sepal.Width ~ Species, data = sub, var.equal = TRUE) ex4_test$p.value #> [1] 2.484e-15

  

Explanation: Variances look comparable (p = 0.26), so the pooled t-test is justified. The pooled t-test returns a tiny p-value, setosa and versicolor differ strongly on sepal width.

How do you test paired samples in R?

Paired samples occur when each observation in one group has a natural partner in the other, for example before-and-after measurements on the same person. Treating paired data as independent inflates the standard error and hides real effects. The fix is one extra argument: paired = TRUE inside t.test().

RPaired t-test on sleep dataset
# sleep has 10 subjects, each measured on two drugs. # H0: drug 1 and drug 2 produce the same mean sleep effect. sleep_test <- t.test(extra ~ group, data = sleep, paired = TRUE) sleep_test #> #> Paired t-test #> #> data: extra by group #> t = -4.0621, df = 9, p-value = 0.002833 #> alternative hypothesis: true mean difference is not equal to 0 #> 95 percent confidence interval: #> -2.4598858 -0.7001142 #> sample estimates: #> mean difference #> -1.58

  

The paired test reports p = 0.003, strong evidence the two drugs differ. If you re-run without paired = TRUE, you would get p = 0.079, a false miss caused by ignoring the pairing.

Note
A paired t-test is just a one-sample t-test on the differences. R computes extra[group==2] - extra[group==1] per subject and tests whether that mean difference is zero. If you already have the differences in hand, t.test(diffs, mu = 0) gives the same answer.

Try it: Run the same sleep test twice, once with paired = TRUE and once without, and compare the p-values.

RYour turn: paired vs independent on sleep
# Try it: compare paired and independent p-values ex5_paired <- t.test(extra ~ group, data = sleep, paired = ___) ex5_indep <- t.test(extra ~ group, data = sleep, paired = ___) ex5_paired$p.value ex5_indep$p.value #> Expected: paired is smaller (~0.003), independent is bigger (~0.079).

  
Click to reveal solution
RExercise 5 solution
ex5_paired <- t.test(extra ~ group, data = sleep, paired = TRUE) ex5_indep <- t.test(extra ~ group, data = sleep, paired = FALSE) ex5_paired$p.value #> [1] 0.002833 ex5_indep$p.value #> [1] 0.07939

  

Explanation: Pairing removes person-to-person variability, leaving only within-subject drug effects. That lowers the standard error, which raises the absolute t-statistic and crushes the p-value.

Try it: Simulate weight before and after a diet: before <- rnorm(30, 80, 10) and after <- before - rnorm(30, 2, 1). Test whether the mean difference is zero.

RYour turn: simulated before/after weight
set.seed(10) before <- rnorm(30, mean = 80, sd = 10) after <- before - rnorm(30, mean = 2, sd = 1) ex6_test <- t.test(___, ___, paired = ___) ex6_test$p.value #> Expected: extremely small, well below 0.05.

  
Click to reveal solution
RExercise 6 solution
set.seed(10) before <- rnorm(30, mean = 80, sd = 10) after <- before - rnorm(30, mean = 2, sd = 1) ex6_test <- t.test(before, after, paired = TRUE) ex6_test$p.value #> [1] 9.112e-12

  

Explanation: By construction after is about 2 kg lighter than before with tiny noise, so the paired t-test sees a near-deterministic drop and rejects H0 decisively.

How do you test proportions in R?

When your outcome is a count of successes out of a total, prop.test() is the one-line answer. It takes two arguments in the single-proportion case: the number of successes and the total sample size. You pass the null value with p =, and R applies a continuity correction by default.

ROne-proportion test
# 35 successes out of 100. H0: p equals 0.3. prop1_test <- prop.test(x = 35, n = 100, p = 0.30) prop1_test #> #> 1-sample proportions test with continuity correction #> #> data: 35 out of 100, null probability 0.3 #> X-squared = 0.9524, df = 1, p-value = 0.3291 #> alternative hypothesis: true p is not equal to 0.3 #> 95 percent confidence interval: #> 0.2595931 0.4501487 #> sample estimates: #> p #> 0.35

  

The p-value 0.33 is well above 0.05, and the 95% CI [0.26, 0.45] contains 0.30, so you cannot reject the null that the true proportion is 0.30.

Tip
Pass counts and totals as integers, never as fractions. prop.test(0.35, 1) is a silent logic error. The function expects raw counts so it can compute standard errors correctly.

Try it: Test whether 58 successes out of 100 is significantly different from p = 0.5.

RYour turn: one-proportion test
# Try it: 58 of 100, H0: p equals 0.5 ex7_test <- prop.test(x = ___, n = ___, p = ___) ex7_test$p.value #> Expected: around 0.13, not significant.

  
Click to reveal solution
RExercise 7 solution
ex7_test <- prop.test(x = 58, n = 100, p = 0.5) ex7_test$p.value #> [1] 0.1333

  

Explanation: 58/100 is three successes above the null of 50/100. With only 100 trials that is within the range of chance variation, so the p-value stays above 0.05 and you do not reject H0.

Try it: Campaign A gets 120 conversions out of 2000 visitors, campaign B gets 180 out of 2500. Use a two-proportion test to decide whether the conversion rates differ.

RYour turn: two-proportion test
# Successes and totals as length-2 vectors successes <- c(120, 180) totals <- c(2000, 2500) ex8_test <- prop.test(x = ___, n = ___) ex8_test$p.value ex8_test$estimate #> Expected: p-value around 0.05, borderline.

  
Click to reveal solution
RExercise 8 solution
successes <- c(120, 180) totals <- c(2000, 2500) ex8_test <- prop.test(x = successes, n = totals) ex8_test$p.value #> [1] 0.05028 ex8_test$estimate #> prop 1 prop 2 #> 0.06 0.072

  

Explanation: Campaign B converts at 7.2% vs A's 6.0%. With these sample sizes the p-value lands right on the edge of significance, real-world you would want more data before calling a winner.

How do you run a chi-square test in R?

Chi-square tests work on counts, not means. They come in two common flavours: goodness-of-fit, which checks whether observed counts match expected proportions for one categorical variable, and test of independence, which checks whether two categorical variables are associated. R handles both with chisq.test(), the difference is whether you pass one vector plus p = (goodness) or a contingency table (independence).

RChi-square goodness-of-fit on dice rolls
# Roll a die 60 times. H0: each face has probability 1/6. set.seed(5) rolls <- sample(1:6, size = 60, replace = TRUE) observed <- table(rolls) observed #> rolls #> 1 2 3 4 5 6 #> 9 13 9 5 12 12 dice_test <- chisq.test(observed, p = rep(1/6, 6)) dice_test$p.value #> [1] 0.4963

  

The p-value 0.50 is nowhere near significant, so the die looks fair. The observed counts bounce around the expected 10 per face, which is exactly what chance variation produces.

RChi-square test of independence on HairEyeColor
# Sum across the Sex dimension to get a 2-way table. hec_tab <- margin.table(HairEyeColor, c(1, 2)) hec_test <- chisq.test(hec_tab) hec_test #> #> Pearson's Chi-squared test #> #> data: hec_tab #> X-squared = 138.29, df = 9, p-value < 2.2e-16

  

The p-value is effectively zero, hair colour and eye colour are strongly associated in this sample, which matches human genetics intuition.

Warning
Chi-square breaks down when expected counts fall below 5. R prints a warning when that happens. For small tables, switch to fisher.test(), which gives exact p-values without the large-sample approximation.

Try it: A coin is flipped 200 times and lands heads 115 times. Test whether the coin is fair.

RYour turn: goodness-of-fit on a coin
# Try it: 115 heads, 85 tails; H0: p equals 0.5 each observed_coin <- c(heads = ___, tails = ___) ex9_test <- chisq.test(observed_coin, p = c(___, ___)) ex9_test$p.value #> Expected: around 0.035; just under 0.05.

  
Click to reveal solution
RExercise 9 solution
observed_coin <- c(heads = 115, tails = 85) ex9_test <- chisq.test(observed_coin, p = c(0.5, 0.5)) ex9_test$p.value #> [1] 0.03573

  

Explanation: 115 heads out of 200 is 0.575, far enough from 0.5 with this sample size to give a p-value of 0.036. At alpha = 0.05 we reject the null that the coin is fair.

Try it: Use the Titanic dataset to test whether survival depends on passenger class. Collapse over sex and age first to build a 2-way table.

RYour turn: Titanic class vs survival
# Build a Class x Survived table from the 4-way Titanic array ex10_tab <- margin.table(Titanic, c(___, ___)) ex10_tab ex10_test <- chisq.test(ex10_tab) ex10_test$p.value #> Expected: effectively zero; class strongly predicts survival.

  
Click to reveal solution
RExercise 10 solution
ex10_tab <- margin.table(Titanic, c(1, 4)) ex10_tab #> Survived #> Class No Yes #> 1st 122 203 #> 2nd 167 118 #> 3rd 528 178 #> Crew 673 212 ex10_test <- chisq.test(ex10_tab) ex10_test$p.value #> [1] 1.723e-40

  

Explanation: The p-value is astronomically small. First-class passengers survived at much higher rates than third-class or crew, so class and survival are not independent.

How do you run a one-way ANOVA in R?

When you want to compare the means of three or more groups, running many t-tests inflates the false positive rate. One-way ANOVA tests a single hypothesis across all groups: "every group has the same population mean." R's aov() accepts a formula outcome ~ group, and summary() prints the ANOVA table with the F-statistic and its p-value.

ROne-way ANOVA on iris Sepal.Length by Species
# H0: the three species have equal mean Sepal.Length. iris_aov <- aov(Sepal.Length ~ Species, data = iris) summary(iris_aov) #> Df Sum Sq Mean Sq F value Pr(>F) #> Species 2 63.21 31.606 119.3 <2e-16 *** #> Residuals 147 38.96 0.265 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  

The F-statistic of 119.3 with p < 2e-16 tells you at least one species differs from the others on sepal length. What ANOVA cannot tell you is which pair differs, that is the job of a post-hoc test.

RTukey HSD post-hoc test after ANOVA
# TukeyHSD compares every pair with adjusted p-values. iris_tukey <- TukeyHSD(iris_aov) iris_tukey #> Tukey multiple comparisons of means #> 95% family-wise confidence level #> #> Fit: aov(formula = Sepal.Length ~ Species, data = iris) #> #> $Species #> diff lwr upr p adj #> versicolor-setosa 0.930 0.6862273 1.1737727 0 #> virginica-setosa 1.582 1.3382273 1.8257727 0 #> virginica-versicolor 0.652 0.4082273 0.8957727 0

  

All three pairwise comparisons sit well below 0.05, so every species differs from every other on sepal length. The diff column gives the direction, virginica has the longest sepals, setosa the shortest.

Key Insight
ANOVA answers "is there any difference?" but Tukey HSD answers "which pairs differ?" Running ANOVA without a post-hoc when it turns significant leaves you with half an answer. Tukey's adjusted p-values protect against the false positives you would get from running many unadjusted t-tests.

Try it: Run a one-way ANOVA on the PlantGrowth dataset. Are there differences in weight across the three group levels (ctrl, trt1, trt2)?

RYour turn: ANOVA on PlantGrowth
# Try it: aov(weight ~ group, data = PlantGrowth) ex11_aov <- aov(___ ~ ___, data = PlantGrowth) summary(ex11_aov) #> Expected: p-value around 0.016, marginally significant.

  
Click to reveal solution
RExercise 11 solution
ex11_aov <- aov(weight ~ group, data = PlantGrowth) summary(ex11_aov) #> Df Sum Sq Mean Sq F value Pr(>F) #> group 2 3.766 1.8832 4.846 0.0159 * #> Residuals 27 10.492 0.3886

  

Explanation: F = 4.85 with p = 0.016 says at least one treatment differs from the others at alpha = 0.05. You do not yet know which, that is what the next exercise cracks open.

Try it: Run TukeyHSD() on ex11_aov. Which pair of groups differs significantly?

RYour turn: Tukey HSD on PlantGrowth
ex12_tukey <- TukeyHSD(___) ex12_tukey #> Expected: trt2-trt1 pair has the smallest adjusted p-value.

  
Click to reveal solution
RExercise 12 solution
ex12_tukey <- TukeyHSD(ex11_aov) ex12_tukey #> Tukey multiple comparisons of means #> 95% family-wise confidence level #> #> Fit: aov(formula = weight ~ group, data = PlantGrowth) #> #> $group #> diff lwr upr p adj #> trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711 #> trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960 #> trt2-trt1 0.865 0.1737839 1.5562161 0.0120064

  

Explanation: Only the trt2-trt1 comparison crosses the 0.05 line. Neither treatment on its own differs from the control, but trt2 beats trt1, which explains where the ANOVA signal came from.

Practice Exercises

These three capstone exercises combine concepts from across the post. No scaffolds, just the problem and a hint, then your turn to put the pieces together.

Exercise 13: Non-parametric alternative to the t-test

The Wilcoxon rank-sum test (also called Mann-Whitney U) replaces the two-sample t-test when the data are not normal. Use it on mtcars$mpg split by am, then compare the p-value to the t-test result you saw earlier.

RCapstone 13: Wilcoxon rank-sum on mtcars
# Exercise: wilcox.test(mpg ~ am, data = mtcars) # Hint: the formula interface works exactly like t.test. # Write your code below:

  
Click to reveal solution
RCapstone 13 solution
ex13_wilcox <- wilcox.test(mpg ~ am, data = mtcars) ex13_wilcox$p.value #> [1] 0.001871

  

Explanation: The Wilcoxon p-value 0.0019 leads to the same conclusion as the t-test (p = 0.0014) from the two-group section, manual cars are more fuel-efficient. When both tests agree you have a robust finding, the signal is not an artefact of a normality assumption.

Exercise 14: Kruskal-Wallis as a non-parametric ANOVA

The Kruskal-Wallis test generalises Wilcoxon to three or more groups. Run it on PlantGrowth and compare the p-value to the ANOVA you ran in Exercise 11.

RCapstone 14: Kruskal-Wallis on PlantGrowth
# Exercise: kruskal.test(weight ~ group, data = PlantGrowth) # Hint: same formula interface as aov(). # Write your code below:

  
Click to reveal solution
RCapstone 14 solution
ex14_kw <- kruskal.test(weight ~ group, data = PlantGrowth) ex14_kw$p.value #> [1] 0.01842

  

Explanation: Kruskal-Wallis returns p = 0.018, close to the ANOVA's p = 0.016. When parametric and non-parametric tests agree you have extra confidence that the finding is not a consequence of the normality assumption.

Exercise 15: Pick the right test, check assumptions, report

You want to test whether ozone concentrations differ across months in airquality. Use the Shapiro-Wilk normality test to check each month's ozone values, pick a parametric or non-parametric test accordingly, run it, and state a one-sentence conclusion.

RCapstone 15: full workflow on airquality Ozone
# 1) Check normality of Ozone within each Month with shapiro.test() # 2) If any month is non-normal, use kruskal.test() instead of aov() # 3) Extract and report the p-value # Hints: # split(airquality$Ozone, airquality$Month) splits ozone by month. # lapply() iterates shapiro.test over each month. # Write your code below:

  
Click to reveal solution
RCapstone 15 solution
# Step 1: normality per month (drop NAs, some months are sparse) ozone_by_month <- split(airquality$Ozone, airquality$Month) ex15_shapiro <- lapply(ozone_by_month, function(x) shapiro.test(na.omit(x))$p.value) ex15_shapiro #> $`5` #> [1] 0.00179 #> $`6` #> [1] 0.00356 #> $`7` #> [1] 0.0983 #> $`8` #> [1] 0.00766 #> $`9` #> [1] 0.00261 # Step 2: most months fail normality, choose Kruskal-Wallis ex15_kw <- kruskal.test(Ozone ~ Month, data = airquality) ex15_kw$p.value #> [1] 6.901e-06

  

Explanation: Four of the five months fail the Shapiro-Wilk normality test at alpha = 0.05, so the Kruskal-Wallis non-parametric route is the honest choice. Its tiny p-value says ozone concentrations differ strongly across months, a real seasonal pattern, not random noise.

Complete Example: Picking and running the right test end-to-end

Now watch the full workflow unfold on a realistic problem. A pharma team wants to know whether a new blood pressure drug lowers systolic BP more than a placebo. They randomly assign 50 patients to each arm and measure BP after 8 weeks. The right test is the two-sample Welch t-test, but only after you rule out non-normality with Shapiro-Wilk.

RPharma trial end-to-end workflow
# 1) Simulate two arms set.seed(42) pharma_placebo <- rnorm(50, mean = 140, sd = 12) # placebo arm pharma_drug <- rnorm(50, mean = 134, sd = 12) # drug arm, 6 mmHg lower # 2) Check normality per arm shapiro.test(pharma_placebo)$p.value #> [1] 0.3822 shapiro.test(pharma_drug)$p.value #> [1] 0.9031 # 3) Both p-values > 0.05, normality OK, use Welch's t-test pharma_test <- t.test(pharma_placebo, pharma_drug, alternative = "greater") # 4) Extract and report pharma_test$p.value #> [1] 0.01297 pharma_test$conf.int #> [1] 1.310176 Inf

  

The one-sided p-value 0.013 gives you grounds to reject the null at alpha = 0.05, the drug lowers blood pressure more than the placebo. The 95% CI tells you the true mean difference is at least 1.3 mmHg. Written as a conclusion sentence: "Patients on the new drug showed a statistically significant reduction in systolic BP versus placebo (Welch's t(97.9) = 2.28, one-sided p = 0.013)."

Summary

How to pick the right hypothesis test for your data.

Figure 1: How to pick the right hypothesis test for your data.

Situation Function Default assumption
One group vs a fixed value t.test(x, mu = ...) Approx. normal
Two independent groups, means t.test(y ~ g, data = ...) Welch's, unequal var
Paired measurements t.test(..., paired = TRUE) Normal differences
One proportion prop.test(x, n, p = ...) Counts, not rates
Two proportions prop.test(c(x1, x2), c(n1, n2)) Counts per group
Counts vs expected chisq.test(obs, p = ...) Expected >= 5 per cell
Two-way categorical chisq.test(table) Expected >= 5 per cell
3+ group means aov(y ~ g); TukeyHSD() Normal + equal var
Non-normal, two groups wilcox.test(y ~ g) Similar shape
Non-normal, 3+ groups kruskal.test(y ~ g) Similar shape

Three takeaways to carry forward:

  • Every test returns a list. Access p.value, conf.int, and estimate directly instead of parsing console output.
  • Check assumptions before you trust the p-value. Shapiro-Wilk for normality, var.test() for equal variance, expected-count checks for chi-square.
  • State H0 before you run anything. Decide on one-sided vs two-sided up front, and pick alpha with your team, not after seeing the p-value.

References

  1. Wickham, H. & Grolemund, G. R for Data Science, 2nd Edition. Chapter on statistical inference. Link
  2. Dalgaard, P. Introductory Statistics with R, 2nd Edition. Springer (2008). Chapters 5-8 cover exactly these tests.
  3. R Core Team. Base R stats package manual, t.test, prop.test, chisq.test, aov, wilcox.test, kruskal.test. Link
  4. Kassambara, A. rstatix package documentation (tidy wrappers over base R tests). Link
  5. Diez, D., Cetinkaya-Rundel, M. & Barr, C. OpenIntro Statistics, 4th Edition. Inference chapters. Link
  6. NIST/SEMATECH e-Handbook of Statistical Methods. Chapter 7 on inference. Link

Continue Learning