Web Scraping Exercises in R: 15 Practice Problems
Fifteen practice problems on web scraping in R with rvest: reading HTML, CSS selectors, tables, attributes, polite scraping. Hidden solutions.
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
library(readr)
library(robotstxt)
Exercise 1: Read an HTML page
Difficulty: Beginner.
Show solution
page <- read_html("https://example.com")
page
Exercise 2: Extract page title
Difficulty: Beginner.
Show solution
read_html("https://example.com") |> html_element("title") |> html_text()
Exercise 3: Extract h1
Difficulty: Beginner.
Show solution
read_html("https://example.com") |> html_element("h1") |> html_text()
Exercise 4: All paragraphs
Difficulty: Intermediate.
Show solution
read_html("https://example.com") |> html_elements("p") |> html_text()
Exercise 5: Links (a href)
Difficulty: Intermediate.
Show solution
read_html("https://example.com") |>
html_elements("a") |> html_attr("href")
Exercise 6: CSS selector
Difficulty: Intermediate.
Show solution
# Class selector
# read_html(url) |> html_elements(".btn-primary") |> html_text()
Exercise 7: XPath
Difficulty: Advanced.
Show solution
# read_html(url) |> html_elements(xpath = "//div[@class='post']") |> html_text()
Exercise 8: Extract a table
Difficulty: Intermediate.
Show solution
# Wikipedia tables example (conceptual):
# read_html(url) |> html_table() |> _[[1]]
Exercise 9: Attribute value
Difficulty: Intermediate.
Show solution
read_html("https://example.com") |>
html_element("a") |> html_attr("href")
Exercise 10: Loop pages with map_dfr
Difficulty: Advanced.
Show solution
urls <- paste0("https://example.com/page", 1:3)
# purrr::map_dfr(urls, ~ tibble(url = .x, title = read_html(.x) |> html_element("title") |> html_text()))
Exercise 11: Polite throttle
Difficulty: Advanced.
Show solution
# library(polite); session <- bow("https://example.com"); scrape(session)
Exercise 12: Robots.txt
Difficulty: Intermediate.
Show solution
# robotstxt::paths_allowed("https://example.com/page")
Exercise 13: Form fill with session
Difficulty: Advanced.
Show solution
# s <- session("https://example.com/login")
# f <- s |> html_form() |> pluck(1) |> html_form_set(user = "x", pass = "y")
# session_submit(s, f)
Exercise 14: Clean extracted text
Difficulty: Intermediate.
Show solution
texts <- c(" Hello\nworld ", " 100\n ")
str_squish(texts)
Exercise 15: Save scraped data
Difficulty: Intermediate.
Show solution
df <- tibble(title = c("a","b"), url = c("u1","u2"))
readr::write_csv(df, "out.csv")
What to do next
- API-Calls-Exercises (coming), structured data via APIs.
- stringr-Exercises (shipped), cleanup scraped text.