Scenario: Split iris 70/30 into training and test sets reproducibly.
Difficulty: Beginner
RYour turn
set.seed(1)# your code here
Click to reveal solution
RSolution
set.seed(1)n <-nrow(iris)idx <-sample(seq_len(n), size =floor(0.7* n))train <- iris[idx, ]; test <- iris[-idx, ]c(train =nrow(train), test =nrow(test))
Explanation: sample without replacement gives a random index. set.seed makes it reproducible.
Exercise 1.2: Stratified split
Scenario: Same split but ensure class proportions are preserved per fold.
Difficulty: Intermediate
RYour turn
set.seed(1)# your code here
Click to reveal solution
RSolution
set.seed(1)idx <- caret::createDataPartition(iris$Species, p =0.7, list =FALSE)train <- iris[idx, ]; test <- iris[-idx, ]prop.table(table(train$Species))
Explanation: createDataPartition stratifies on the y argument. Critical for imbalanced classes; random split could leave some classes absent in a fold.
Explanation: Class probabilities are decided by k nearest neighbors. Critical preprocessing: scale features so distances are comparable.
Exercise 1.4: First decision tree
Scenario: Fit a decision tree for iris Species using all 4 predictors.
Difficulty: Intermediate
RYour turn
ex_1_4 <-# your code hereprint(ex_1_4)
Click to reveal solution
RSolution
ex_1_4 <- rpart::rpart(Species ~ ., data = iris)print(ex_1_4)
Explanation: rpart uses CART. Default stops based on cp (complexity). Inspect splits via the printed tree or rpart.plot::rpart.plot.
Exercise 1.5: Compute accuracy
Scenario: Given predictions and truth, compute accuracy.
Difficulty: Beginner
RYour turn
truth <-c("a","b","a","c","b","a")pred <-c("a","b","b","c","b","c")ex_1_5 <-# your code hereex_1_5
Click to reveal solution
RSolution
ex_1_5 <-mean(pred == truth)ex_1_5
Explanation: Mean of TRUE/FALSE comparisons gives the proportion correct.
Exercise 1.6: Confusion matrix
Scenario: Build a confusion matrix from the same predictions.
Difficulty: Intermediate
RYour turn
truth <-factor(c("a","b","a","c","b","a"))pred <-factor(c("a","b","b","c","b","c"))ex_1_6 <-# your code hereex_1_6
Click to reveal solution
RSolution
truth <-factor(c("a","b","a","c","b","a"))pred <-factor(c("a","b","b","c","b","c"))ex_1_6 <-table(truth = truth, pred = pred)ex_1_6
Explanation: table cross-tabulates truth vs pred. Diagonal = correct. caret::confusionMatrix adds per-class metrics.
Exercise 1.7: Per-class accuracy from confusion matrix
Scenario: From the confusion matrix in 1.6, compute precision per class.
Difficulty: Intermediate
RYour turn
cm <-table(truth =factor(c("a","b","a","c","b","a")), pred =factor(c("a","b","b","c","b","c")))ex_1_7 <-# your code hereex_1_7
Click to reveal solution
RSolution
cm <-table(truth =factor(c("a","b","a","c","b","a")), pred =factor(c("a","b","b","c","b","c")))ex_1_7 <-diag(cm) /colSums(cm)ex_1_7
Explanation: Precision = TP / (TP + FP) per class = diag / column sum. Recall = diag / row sum. F1 = 2PR / (P+R).
Exercise 1.8: Save and load a model
Scenario: Save an rpart model to disk and load it back.
Difficulty: Intermediate
RYour turn
fit <- rpart::rpart(Species ~ ., data = iris)# your code here
Click to reveal solution
RSolution
fit <- rpart::rpart(Species ~ ., data = iris)saveRDS(fit, "model.rds")loaded <-readRDS("model.rds")identical(predict(fit), predict(loaded))
Explanation: saveRDS for one R object; readRDS to restore. Standard for serializing models.
Section 2. Classification (10 problems)
Exercise 2.1: Logistic regression
Scenario: Predict am (0/1) from mpg in mtcars.
Difficulty: Intermediate
RYour turn
ex_2_1 <-# your code herecoef(ex_2_1)
Click to reveal solution
RSolution
ex_2_1 <-glm(am ~ mpg, data = mtcars, family = binomial)coef(ex_2_1)
Explanation: glm with family = binomial fits logistic regression. Coefficient is on log-odds scale; exp() gives odds ratio.
Exercise 2.2: Predict probabilities
Scenario: Get predicted probabilities from the model in 2.1.
Difficulty: Intermediate
RYour turn
fit <-glm(am ~ mpg, data = mtcars, family = binomial)ex_2_2 <-# your code herehead(ex_2_2)
Click to reveal solution
RSolution
fit <-glm(am ~ mpg, data = mtcars, family = binomial)ex_2_2 <-predict(fit, type ="response")head(ex_2_2)
Explanation: type = "response" returns probabilities; type = "link" returns log-odds; default is "link" for glm.
Exercise 2.3: Threshold to classify
Scenario: Convert probabilities to 0/1 predictions at threshold 0.5.
Difficulty: Beginner
RYour turn
fit <-glm(am ~ mpg, data = mtcars, family = binomial)prob <-predict(fit, type ="response")ex_2_3 <-# your code heretable(ex_2_3, mtcars$am)
Click to reveal solution
RSolution
fit <-glm(am ~ mpg, data = mtcars, family = binomial)prob <-predict(fit, type ="response")ex_2_3 <-as.integer(prob >0.5)table(ex_2_3, mtcars$am)
Explanation: Default threshold 0.5; tune for imbalanced classes or asymmetric costs.
Exercise 2.4: ROC curve and AUC
Scenario: ROC curve and AUC for the logistic model.
Difficulty: Advanced
RYour turn
fit <-glm(am ~ mpg + hp, data = mtcars, family = binomial)prob <-predict(fit, type ="response")ex_2_4 <-# your code hereex_2_4
Click to reveal solution
RSolution
fit <-glm(am ~ mpg + hp, data = mtcars, family = binomial)prob <-predict(fit, type ="response")roc_obj <- pROC::roc(mtcars$am, prob)ex_2_4 <- pROC::auc(roc_obj)ex_2_4
Explanation: ROC plots TPR vs FPR across thresholds. AUC is the area: 1 = perfect, 0.5 = random. Threshold-independent metric.
Exercise 2.5: Random forest classifier
Scenario: Random forest for iris Species.
Difficulty: Intermediate
RYour turn
set.seed(1)ex_2_5 <-# your code hereex_2_5
Click to reveal solution
RSolution
set.seed(1)ex_2_5 <- randomForest::randomForest(Species ~ ., data = iris)ex_2_5
Explanation: Bootstrap aggregating with random feature subsets per split. ntree default 500; mtry sqrt(p) for classification. Out-of-bag error printed.
Explanation: Replicates minority rows to match majority count. Alternatives: downSample (cut majority), SMOTE (synthetic), class weights in the model.
Exercise 2.9: SVM with caret
Scenario: Train an SVM with linear kernel on iris.
Difficulty: Advanced
RYour turn
set.seed(1)ex_2_9 <-# your code hereex_2_9
Click to reveal solution
RSolution
set.seed(1)ex_2_9 <- caret::train(Species ~ ., data = iris, method ="svmLinear")ex_2_9
Explanation: caret::train wraps many ML methods with consistent interface. Auto-tunes when method allows. tidymodels is the modern equivalent.
Exercise 2.10: Gradient boosting
Scenario: Fit a gradient-boosted classifier (gbm).
Difficulty: Advanced
RYour turn
ir <- irisir$Species_int <-as.integer(ir$Species =="virginica")set.seed(1)ex_2_10 <-# your code hereex_2_10
Click to reveal solution
RSolution
ir <- irisir$Species_int <-as.integer(ir$Species =="virginica")set.seed(1)ex_2_10 <- gbm::gbm(Species_int ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = ir, distribution ="bernoulli", n.trees =500, interaction.depth =3, verbose =FALSE)ex_2_10
Explanation: GBM trains many shallow trees, each correcting predecessors. interaction.depth controls tree complexity. xgboost::xgboost is the production-grade alternative.
Section 3. Regression ML (8 problems)
Exercise 3.1: Regression tree
Scenario: Predict mpg from mtcars predictors with rpart.
Difficulty: Intermediate
RYour turn
ex_3_1 <-# your code hereex_3_1
Click to reveal solution
RSolution
ex_3_1 <- rpart::rpart(mpg ~ ., data = mtcars)ex_3_1
Explanation: caret::resamples bundles CV results from multiple models for direct comparison. Always evaluate at the same resampling seed for fair comparison.
Exercise 4.8: Early stopping
Scenario: xgboost with early stopping on a validation fold.
Difficulty: Advanced
RYour turn
set.seed(1)n <-nrow(mtcars)idx <-sample(seq_len(n), 22)x_tr <-as.matrix(mtcars[idx, -1])y_tr <- mtcars$mpg[idx]x_te <-as.matrix(mtcars[-idx, -1])y_te <- mtcars$mpg[-idx]# your code here
Explanation: Watch test loss per round; stop when no improvement for early_stopping_rounds. Prevents overfitting and saves compute.
Section 5. Feature engineering (8 problems)
Exercise 5.1: One-hot encode a factor
Scenario: Convert factor to dummy variables.
Difficulty: Beginner
RYour turn
df <-tibble(color =c("red","blue","green","red","blue"))ex_5_1 <-# your code hereex_5_1
Click to reveal solution
RSolution
df <-tibble(color =c("red","blue","green","red","blue"))ex_5_1 <-model.matrix(~ color -1, data = df)ex_5_1
Explanation: model.matrix expands factors to dummy columns. -1 removes the intercept so all levels appear. lm/glm do this automatically; for ML methods you may need to do it manually.
Exercise 5.2: Min-max scale numeric features
Scenario: Rescale all numeric columns of iris to [0, 1].
Explanation: cut creates bins from breaks. include.lowest brings the minimum value into the first bin. ntile() is the dplyr alternative for equal-frequency bins.
Exercise 5.6: Interaction features
Scenario: Add a wt:hp interaction column manually.
Explanation: recipes is the tidymodels preprocessing engine. Each step is named and parameterizable. prep() learns from data; bake() applies. Essential for production workflows.
Section 6. End-to-end ML pipelines (8 problems)
Exercise 6.1: Iris classifier mini-pipeline
Scenario: Split, scale, fit kNN, evaluate.
Difficulty: Intermediate
RYour turn
set.seed(1)# your code here
Click to reveal solution
RSolution
set.seed(1)idx <- caret::createDataPartition(iris$Species, p =0.7, list =FALSE)tr <- iris[idx, ]; te <- iris[-idx, ]# Scale based on training onlymu <-sapply(tr[, 1:4], mean); sg <-sapply(tr[, 1:4], sd)tr_x <-sweep(sweep(tr[, 1:4], 2, mu), 2, sg, "/")te_x <-sweep(sweep(te[, 1:4], 2, mu), 2, sg, "/")pred <- class::knn(tr_x, te_x, tr$Species, k =5)mean(pred == te$Species)
Explanation: Critical: fit scaling on train only, apply same to test. sweep applies a row/column operation.
Exercise 6.2: Cross-validated tidymodels workflow
Scenario: tidymodels workflow with recipe + lm + 5-fold CV.
Difficulty: Advanced
RYour turn
library(tidymodels)set.seed(1)# your code here
Click to reveal solution
RSolution
library(tidymodels)set.seed(1)split <-initial_split(mtcars, prop =0.7)tr <-training(split); te <-testing(split)rec <-recipe(mpg ~ ., data = tr) |>step_normalize(all_numeric_predictors())mod <-linear_reg() |>set_engine("lm")wf <-workflow() |>add_recipe(rec) |>add_model(mod)cv_folds <-vfold_cv(tr, v =5)results <-fit_resamples(wf, cv_folds, metrics =metric_set(rmse, rsq))collect_metrics(results)
Scenario: Random forest, then plot variable importance.
Difficulty: Intermediate
RYour turn
set.seed(1)fit <- randomForest::randomForest(mpg ~ ., data = mtcars, importance =TRUE)# your code here
Click to reveal solution
RSolution
set.seed(1)fit <- randomForest::randomForest(mpg ~ ., data = mtcars, importance =TRUE)imp <-as.data.frame(randomForest::importance(fit))imp$var <-rownames(imp)ggplot(imp, aes(x =reorder(var, `%IncMSE`), y = `%IncMSE`)) +geom_col() +coord_flip() +labs(title ="RF variable importance", x =NULL)
Explanation: importance = TRUE enables permutation importance. %IncMSE measures how much CV-MSE rises when a feature is shuffled.
Exercise 6.4: Confusion matrix viz
Scenario: Visualize confusion matrix as a heatmap.
Difficulty: Advanced
RYour turn
set.seed(1)fit <- randomForest::randomForest(Species ~ ., data = iris)pred <-predict(fit)cm <-table(truth = iris$Species, pred = pred)# your code here
Click to reveal solution
RSolution
set.seed(1)fit <- randomForest::randomForest(Species ~ ., data = iris)pred <-predict(fit)cm <-table(truth = iris$Species, pred = pred)cm_df <-as.data.frame(cm)ggplot(cm_df, aes(pred, truth, fill = Freq)) +geom_tile() +geom_text(aes(label = Freq), color ="white", size =5) +scale_fill_gradient(low ="lightblue", high ="navy")
Explanation: Tile + text labels. Easy to scan vs raw table.
Exercise 6.5: Calibration check for probabilities
Scenario: Plot predicted probability vs actual rate in deciles.
Difficulty: Advanced
RYour turn
set.seed(1)fit <-glm(am ~ mpg + hp + wt, data = mtcars, family = binomial)prob <-predict(fit, type ="response")df <-tibble(prob = prob, actual = mtcars$am)# your code here
Click to reveal solution
RSolution
set.seed(1)fit <-glm(am ~ mpg + hp + wt, data = mtcars, family = binomial)prob <-predict(fit, type ="response")df <-tibble(prob = prob, actual = mtcars$am)calib <- df |>mutate(decile =ntile(prob, 10)) |>group_by(decile) |>summarise(mean_pred =mean(prob), actual_rate =mean(actual))ggplot(calib, aes(mean_pred, actual_rate)) +geom_point() +geom_abline(slope =1, linetype ="dashed")
Explanation: Well-calibrated model has predicted probability close to actual frequency. Off-diagonal points reveal miscalibration.
Exercise 6.6: Compare models with ROC curves
Scenario: Plot ROC curves for two models on the same axes.
Difficulty: Advanced
RYour turn
fit1 <-glm(am ~ mpg, data = mtcars, family = binomial)fit2 <-glm(am ~ mpg + hp + wt, data = mtcars, family = binomial)# your code here
Click to reveal solution
RSolution
fit1 <-glm(am ~ mpg, data = mtcars, family = binomial)fit2 <-glm(am ~ mpg + hp + wt, data = mtcars, family = binomial)r1 <- pROC::roc(mtcars$am, predict(fit1, type ="response"))r2 <- pROC::roc(mtcars$am, predict(fit2, type ="response"))plot(r1, col ="blue")plot(r2, col ="red", add =TRUE)legend("bottomright", legend =c("Simple","Full"), col =c("blue","red"), lty =1)
Explanation: ROC curves on same plot reveal which model dominates across thresholds.
Exercise 6.7: SHAP-style local explanation (caret)
Scenario: Explain a single prediction with feature attribution.
Difficulty: Advanced
RYour turn
set.seed(1)# your code here
Click to reveal solution
RSolution
set.seed(1)fit <- randomForest::randomForest(mpg ~ wt + hp, data = mtcars)# DALEX/iml provide model-agnostic SHAP-like explanationsexplainer <- DALEX::explain(fit, data = mtcars[, c("wt","hp")], y = mtcars$mpg, verbose =FALSE)single_obs <- mtcars["Mazda RX4", c("wt","hp")]DALEX::predict_parts(explainer, single_obs)
Explanation: DALEX offers model-agnostic local explanations: feature contribution to a single prediction. iml is an alternative.
Exercise 6.8: Save and serve a tidymodels workflow