Web Scraping in R with rvest: Extract Any Table or Text in 10 Minutes
rvest is an R package that reads HTML pages, selects elements with CSS selectors, and extracts text, tables, and attributes into clean data frames — letting you turn any website into structured data.
Introduction
The data you need often lives on a website, not in a CSV file. Maybe it is a table of country statistics on Wikipedia, a list of product prices on an e-commerce site, or research results published in HTML. Web scraping is the technique that extracts that data programmatically so you can analyse it in R.
The rvest package is the tidyverse's tool for web scraping. It wraps the libxml2 C library for fast HTML parsing and provides a small set of intuitive functions: read_html() to fetch a page, html_elements() to select nodes, and extraction helpers like html_text2(), html_table(), and html_attr() to pull out the data you need.
In this tutorial, you will learn how to read HTML into R, target specific elements with CSS selectors, extract text and tables, handle pagination across multiple pages, manage cookies with sessions, and scrape politely using the robotstxt and polite packages. By the end, you will have a complete scraping workflow from URL to clean data frame.
install.packages("rvest") and execute locally.
Figure 1: The rvest scraping pipeline: from URL to clean data frame.
What is web scraping and when should you use it?
Web scraping means writing code that downloads a web page's HTML and extracts specific data from it. Instead of copying and pasting by hand, your R script reads the page, finds the elements you care about, and returns them as vectors or data frames.
You should consider web scraping when three conditions are met. First, the data is publicly available on a website. Second, there is no API or downloadable file that provides the same data in a structured format. Third, the website's terms of service and robots.txt file do not prohibit automated access.
Before you scrape any site, check two things. Open https://example.com/robots.txt in your browser to see which paths are allowed or disallowed for bots. Then read the site's Terms of Service — some explicitly forbid scraping even on public pages.
Web scraping is appropriate for academic research on public data, monitoring publicly posted prices or statistics, collecting data for personal projects, and aggregating information that has no API. It is not appropriate for scraping behind login walls without permission, collecting personal data, or circumventing paywalls.
How do you read an HTML page into R with rvest?
Every rvest workflow starts with read_html(). This function takes a URL (or a local file path) and returns a parsed HTML document that you can query with other rvest functions.
Let's install and load rvest, then read a Wikipedia page.
The page object is an XML document stored in memory. It contains the entire DOM tree of the web page. You will never work with this object directly — instead, you pass it to html_elements() to select the parts you need.
You can also read HTML from a local file or from a character string. This is useful for testing your selectors without hitting a live server repeatedly.
read_html(), verify your CSS selectors work, then switch to the live URL. This saves time and avoids hitting the server unnecessarily.How do CSS selectors target HTML elements?
CSS selectors are patterns that identify specific HTML elements on a page. They are the same selectors used in web design to apply styles. In rvest, you pass a CSS selector string to html_elements() to pick out exactly the elements you want.
HTML elements have three main identifiers you can target. A tag name like p, h1, table, or a identifies the element type. A class (prefixed with .) groups elements that share a style or purpose. An id (prefixed with #) uniquely identifies one specific element on the page.

Figure 2: Five types of CSS selectors used to target HTML elements.
Here is a quick reference for the most common selector patterns.
| Selector | What It Matches | Example |
|---|---|---|
p |
All <p> elements |
All paragraphs |
.info |
Elements with class="info" |
<div class="info"> |
#main |
The element with id="main" |
<div id="main"> |
div > p |
<p> that is a direct child of <div> |
Nested paragraphs |
a[href] |
<a> elements that have an href attribute |
All links |
table.wikitable |
<table> with class="wikitable" |
Wikipedia tables |
Let's use these selectors on our Wikipedia page to extract different elements.
The html_elements() function always returns a list of matching nodes (a nodeset). If no elements match, you get an empty nodeset rather than an error. Use html_element() (singular) when you expect exactly one match — it returns NA instead of an empty set when nothing matches.
How do you extract text, tables, and attributes from HTML?
Once you have selected HTML elements with html_elements(), you need to pull out the actual data. The rvest package provides three main extraction functions: html_text2() for text content, html_table() for tables, and html_attr() for attributes like links and image sources.

Figure 3: How rvest functions map to the HTML DOM tree.
Extracting text with html_text2()
The html_text2() function extracts the visible text from HTML elements. It mimics how a browser renders text — collapsing whitespace, respecting line breaks, and stripping HTML tags.
html_text() preserves raw whitespace from the HTML source, which often includes invisible tabs and newlines. html_text2() returns clean, browser-like text. The only exception is when you specifically need the raw whitespace for parsing.Extracting tables with html_table()
Many websites display data in HTML <table> elements. The html_table() function converts these directly into R data frames — no manual parsing needed.
The html_table() function returns a tibble by default. If the table has merged cells or complex headers, you may need to set fill = TRUE to handle irregular structures.
Extracting attributes with html_attr()
HTML elements carry metadata in their attributes. Links have href, images have src, and many elements have class or id. The html_attr() function extracts a single attribute by name.
Use html_attrs() (plural) to get all attributes of an element as a named character vector. This is useful when you are exploring a page and do not know which attributes are available.
How do you scrape multiple pages with pagination?
Real-world scraping often involves multiple pages. A paginated website might show 20 results per page, with URLs like page=1, page=2, and so on. You handle this by building a vector of URLs and looping through them.
The key pattern is: construct the URLs, loop with a delay between requests, extract data from each page, and combine the results.
The tryCatch() function is essential for production scraping. Wrap each read_html() call in it so that a single failed page does not crash your entire loop.
How do sessions help with login-protected and multi-step scraping?
A regular read_html() call is stateless — it fetches one page and forgets everything. If the website sets cookies, requires login, or expects you to navigate through a sequence of pages, you need a session.
The session() function in rvest creates a persistent browser-like session that stores cookies, tracks your navigation history, and maintains headers between requests.
Sessions also handle form submission. If a page has a search box or login form, you can fill it in and submit it programmatically.
How do you scrape politely with robotstxt and rate limiting?
Responsible scraping goes beyond just adding Sys.sleep(). The R ecosystem provides two packages that make polite scraping easy: robotstxt for checking permissions and polite for automated rate-limited scraping.
The robotstxt package reads a website's robots.txt file and tells you whether your scraper is allowed to access a specific path.
The polite package wraps rvest and automatically respects robots.txt, enforces rate limits, and caches responses. The workflow has two steps: bow() to introduce yourself, and scrape() to fetch the page.
bow() + scrape(), the package reads the site's crawl delay, waits the required time between requests, and caches responses so repeated scrapes of the same URL do not hit the server again.Here is a summary of best practices for polite scraping.
| Practice | How to Implement |
|---|---|
| Check robots.txt | robotstxt::paths_allowed(url) |
| Rate limit requests | Sys.sleep(2) or use the polite package |
| Set a user agent | polite::bow(url, user_agent = "...") |
| Cache responses | polite does this automatically |
| Handle errors gracefully | tryCatch() around read_html() |
| Respect Terms of Service | Read manually before scraping |
Common Mistakes and How to Fix Them
Mistake 1: Using html_text() instead of html_text2()
❌ Wrong:
Why it is wrong: html_text() preserves raw HTML whitespace including tabs, newlines, and extra spaces. The output is messy and hard to work with.
✅ Correct:
Mistake 2: Forgetting Sys.sleep() in a scraping loop
❌ Wrong:
Why it is wrong: Rapid-fire requests can overwhelm the server, trigger rate-limiting defences, or get your IP address banned permanently.
✅ Correct:
Mistake 3: Expecting rvest to render JavaScript
❌ Wrong:
Why it is wrong: rvest downloads raw HTML and does not execute JavaScript. Single-page applications (SPAs) built with React, Vue, or Angular render their content with JavaScript, so the HTML source has no data in it.
✅ Correct:
Mistake 4: Using html_element() when you need html_elements()
❌ Wrong:
Why it is wrong: html_element() (singular) returns only the first matching element. If you need all paragraphs, you get just one and miss the rest.
✅ Correct:
Mistake 5: Hardcoding CSS selectors that change
❌ Wrong:
Why it is wrong: Auto-generated class names (common in React/Next.js sites) change with every deployment. Your scraper breaks silently next week.
✅ Correct:
Practice Exercises
Exercise 1: Extract a page title
Read the Wikipedia page for "Data science" and extract just the page title (the <h1> element text).
Click to reveal solution
Explanation: html_element() (singular) returns the first match. Since there is only one <h1> on the page, this is perfect. html_text2() extracts the visible text.
Exercise 2: Scrape a table and inspect it
Read any Wikipedia page that contains a table (e.g., "List of countries by population"). Extract the first table and print its column names and first 5 rows.
Click to reveal solution
Explanation: Wikipedia tables use the wikitable class. html_table() returns a list of data frames, so [[1]] picks the first table.
Exercise 3: Extract and filter links
Read the R programming Wikipedia page. Extract all links (<a> tags), get their href attributes, and filter to only external links that start with http.
Click to reveal solution
Explanation: html_attr("href") pulls the URL from each <a> tag. The grepl() call filters to links starting with http:// or https://, excluding internal wiki paths.
Exercise 4: Scrape multiple pages
The site https://quotes.toscrape.com/ has 10 pages of quotes at /page/1/ through /page/10/. Scrape pages 1-3, extract all quote texts, and combine them into a single character vector. Add a 2-second delay between requests.
Click to reveal solution
Explanation: The loop iterates through 3 URLs, extracts quotes from each using the .quote .text CSS selector, appends them to a vector, and waits 2 seconds between requests.
Putting It All Together
Let's build a complete scraping workflow from start to finish. We will scrape a table of R packages from CRAN's task view page, clean the data, and produce a summary.
This complete example demonstrates the full workflow: load rvest, read a page, select elements with CSS selectors, extract text, and organise the results into a clean tibble. You can adapt this pattern to scrape any website — just change the URL and CSS selectors.
Summary
Here is a quick reference for the rvest functions covered in this tutorial.
| Function | Purpose | Example |
|---|---|---|
read_html(url) |
Fetch and parse an HTML page | read_html("https://...") |
html_elements(css) |
Select all matching elements | html_elements("table.wikitable") |
html_element(css) |
Select the first matching element | html_element("h1") |
html_text2() |
Extract visible text (clean) | html_text2() |
html_table() |
Convert <table> to data frame |
html_table(fill = TRUE) |
html_attr(name) |
Extract one attribute | html_attr("href") |
html_attrs() |
Extract all attributes | html_attrs() |
session(url) |
Start a stateful session | session("https://...") |
session_jump_to(url) |
Navigate within a session | session_jump_to("/page/2") |
session_submit(form) |
Submit an HTML form | session_submit(sess, form) |
Key takeaways:
- rvest reads HTML with
read_html()and selects elements with CSS selectors viahtml_elements() - Use
html_text2()for text,html_table()for tables, andhtml_attr()for attributes - For pagination, loop through URLs with
Sys.sleep()between requests - For cookies and navigation, use
session()instead ofread_html() - Always check robots.txt with the robotstxt package before scraping
- The polite package automates rate limiting, robots.txt checking, and caching
FAQ
Can rvest scrape JavaScript-rendered pages?
No. rvest downloads raw HTML and does not execute JavaScript. For single-page applications built with React, Vue, or Angular, use the chromote package (headless Chrome) or RSelenium to render the page first, then pass the HTML to rvest for extraction.
What is the difference between html_element() and html_elements()?
html_elements() (plural) returns all matching nodes as a nodeset. html_element() (singular) returns only the first match. Use the plural form when scraping lists or tables. Use the singular form when you know there is only one match, like a page title.
Is web scraping legal?
It depends on the jurisdiction and the specific website. Scraping publicly available data is generally legal in most countries, but you must respect the website's Terms of Service and robots.txt directives. Never scrape personal data, copyrighted content behind paywalls, or data you agreed not to collect. When in doubt, consult a legal professional.
How fast can I scrape with rvest?
rvest itself is fast — parsing an HTML page takes milliseconds. The bottleneck is network latency and the rate limits you should impose. A typical polite scraping rate is 1 request every 2-5 seconds. The polite package reads the site's preferred crawl delay and enforces it automatically.
What are alternatives to rvest for web scraping in R?
For API-based data collection, use httr2. For JavaScript-heavy sites, use chromote or RSelenium. For large-scale scraping with built-in politeness, use the polite package (which wraps rvest). For XML documents, use the xml2 package directly. Python users who switch to R will find rvest comparable to BeautifulSoup.
References
- Wickham, H. — rvest: Easily Harvest (Scrape) Web Pages. CRAN. Link
- Wickham, H. & Grolemund, G. — R for Data Science, 2nd Edition. Chapter 24: Web Scraping. Link
- rvest package — Web scraping 101 vignette. Tidyverse. Link
- Heiss, A. — robotstxt: A robots.txt Parser and Webbot/Spider/Crawler Permissions Checker. CRAN. Link
- Perepolkin, D. — polite: Be Nice on the Web. CRAN. Link
- Cantrill, A. — SelectorGadget: CSS Selector Generation Tool. Link
- W3Schools — CSS Selectors Reference. Link
- Wickham, H. — httr2: Perform HTTP Requests and Process the Responses. CRAN. Link
What's Next?
Now that you can scrape data from websites, explore these related tutorials:
- REST APIs in R with httr2 — When data is available through an API, use httr2 instead of scraping HTML. Learn GET, POST, authentication, and pagination.
- DBI in R: Connect to Any Database — Store your scraped data in a database for efficient querying and long-term storage.
- dplyr filter & select — Clean and transform your scraped data frames with the most-used dplyr verbs.