Web Scraping Exercises in R: 15 Practice Problems

Fifteen practice problems on web scraping in R with rvest: reading HTML, CSS selectors, tables, attributes, polite scraping. Hidden solutions.

library(rvest)
library(dplyr)
library(stringr)
library(purrr)
library(readr)
library(robotstxt)

Exercise 1: Read an HTML page

Difficulty: Beginner.

Show solution
page <- read_html("https://example.com")
page

Exercise 2: Extract page title

Difficulty: Beginner.

Show solution
read_html("https://example.com") |> html_element("title") |> html_text()

Exercise 3: Extract h1

Difficulty: Beginner.

Show solution
read_html("https://example.com") |> html_element("h1") |> html_text()

Exercise 4: All paragraphs

Difficulty: Intermediate.

Show solution
read_html("https://example.com") |> html_elements("p") |> html_text()

Exercise 5: Links (a href)

Difficulty: Intermediate.

Show solution
read_html("https://example.com") |>
  html_elements("a") |> html_attr("href")

Exercise 6: CSS selector

Difficulty: Intermediate.

Show solution
# Class selector
# read_html(url) |> html_elements(".btn-primary") |> html_text()

Exercise 7: XPath

Difficulty: Advanced.

Show solution
# read_html(url) |> html_elements(xpath = "//div[@class='post']") |> html_text()

Exercise 8: Extract a table

Difficulty: Intermediate.

Show solution
# Wikipedia tables example (conceptual):
# read_html(url) |> html_table() |> _[[1]]

Exercise 9: Attribute value

Difficulty: Intermediate.

Show solution
read_html("https://example.com") |>
  html_element("a") |> html_attr("href")

Exercise 10: Loop pages with map_dfr

Difficulty: Advanced.

Show solution
urls <- paste0("https://example.com/page", 1:3)
# purrr::map_dfr(urls, ~ tibble(url = .x, title = read_html(.x) |> html_element("title") |> html_text()))

Exercise 11: Polite throttle

Difficulty: Advanced.

Show solution
# library(polite); session <- bow("https://example.com"); scrape(session)

Exercise 12: Robots.txt

Difficulty: Intermediate.

Show solution
# robotstxt::paths_allowed("https://example.com/page")

Exercise 13: Form fill with session

Difficulty: Advanced.

Show solution
# s <- session("https://example.com/login")
# f <- s |> html_form() |> pluck(1) |> html_form_set(user = "x", pass = "y")
# session_submit(s, f)

Exercise 14: Clean extracted text

Difficulty: Intermediate.

Show solution
texts <- c("  Hello\nworld ", " 100\n  ")
str_squish(texts)

Exercise 15: Save scraped data

Difficulty: Intermediate.

Show solution
df <- tibble(title = c("a","b"), url = c("u1","u2"))
readr::write_csv(df, "out.csv")

What to do next

  • API-Calls-Exercises (coming), structured data via APIs.
  • stringr-Exercises (shipped), cleanup scraped text.