R Data Frames: Every Operation You'll Need, With 10 Real Examples

A data frame is R's equivalent of a spreadsheet — a table where each column is a vector and each row is an observation. Most real-world R work involves data frames, and mastering them unlocks everything from data cleaning to statistical modeling.

This tutorial covers every essential data frame operation with 10 real-world examples. Every code block is interactive — click Run to execute, edit to experiment, and Reset to restore the original. Variables persist across blocks, so run them in order.

What Is a Data Frame?

A data frame is a list of vectors of equal length, arranged as columns. Each column can have a different type (numeric, character, logical), but all values within a column must be the same type.

# Create a simple data frame employees <- data.frame( name = c("Alice", "Bob", "Carol", "Dave", "Eve"), department = c("Sales", "Engineering", "Sales", "HR", "Engineering"), salary = c(55000, 72000, 58000, 48000, 75000), years = c(3, 7, 5, 2, 8), stringsAsFactors = FALSE ) employees

Think of it as a spreadsheet: columns are variables, rows are records.

Example 1: Explore a Built-In Dataset

R ships with dozens of built-in datasets. The mtcars dataset has specs for 32 cars from 1974.

# First 6 rows head(mtcars) # Last 3 rows tail(mtcars, 3) # Dimensions: rows x columns cat("Dimensions:", dim(mtcars), "\n") # Column names cat("Columns:", names(mtcars), "\n")

# Structure — the single most useful inspection function str(mtcars)

# Summary statistics for every column summary(mtcars)

The str() function is your best friend. It shows every column's type, the first few values, and the dimensions — all in one call.

Example 2: Access Columns

There are three ways to access a column. The $ syntax is the most readable and common.

# Method 1: $ notation (most common) cat("MPG values:", head(mtcars$mpg, 8), "\n\n") # Method 2: [[ ]] with column name cat("HP values:", head(mtcars[["hp"]], 8), "\n\n") # Method 3: [ , ] with column number cat("First column:", head(mtcars[, 1], 8), "\n")

Use [[ ]] when the column name is stored in a variable:

# Useful when column name is in a variable col_name <- "hp" cat("Mean of", col_name, ":", mean(mtcars[[col_name]]), "\n")

Example 3: Access Rows and Subsets

Use [row, column] syntax. Leave a side blank to get all rows or all columns.

# Single row mtcars[1, ] # Multiple rows mtcars[1:3, ] # Specific rows and columns mtcars[1:5, c("mpg", "hp", "wt")]

Example 4: Filter Rows by Condition

Filtering is the most common operation on data frames. Use logical conditions inside [ ].

# Cars with mpg > 25 efficient <- mtcars[mtcars$mpg > 25, ] cat("Efficient cars:", nrow(efficient), "\n") efficient[, c("mpg", "hp", "wt")]

# Cars with 6 cylinders AND automatic transmission (am = 0) subset_cars <- mtcars[mtcars$cyl == 6 & mtcars$am == 0, ] subset_cars[, c("mpg", "cyl", "am")] # Cleaner alternative: subset() function cat("\nUsing subset():\n") subset(mtcars, mpg > 25, select = c(mpg, hp, wt))

Example 5: Add and Remove Columns

# Work with a copy of the first 6 rows cars <- head(mtcars) # Add a new column: km per liter cars$kpl <- round(cars$mpg * 0.425, 1) cars[, c("mpg", "kpl")]

# Remove a column by setting it to NULL cars <- head(mtcars) cars$qsec <- NULL cat("Columns after removing qsec:", names(cars), "\n\n") # Select only the columns you want cars_small <- cars[, c("mpg", "cyl", "hp", "wt")] cars_small

Example 6: Add and Remove Rows

# Create a data frame team <- data.frame( name = c("Alice", "Bob"), score = c(92, 85), stringsAsFactors = FALSE ) cat("Original:\n") print(team) # Add a row with rbind() new_member <- data.frame(name = "Carol", score = 88, stringsAsFactors = FALSE) team <- rbind(team, new_member) cat("\nAfter adding Carol:\n") print(team) # Remove row 2 team <- team[-2, ] cat("\nAfter removing row 2:\n") print(team)

Example 7: Sort Data

# Sort by mpg (ascending) sorted <- mtcars[order(mtcars$mpg), ] cat("Lowest MPG cars:\n") head(sorted[, c("mpg", "hp")], 5)

# Sort by mpg (descending) with minus sign sorted_desc <- mtcars[order(-mtcars$mpg), ] cat("Highest MPG cars:\n") head(sorted_desc[, c("mpg", "hp")], 5) # Sort by two columns: cyl ascending, then mpg descending cat("\nSorted by cyl then mpg:\n") sorted2 <- mtcars[order(mtcars$cyl, -mtcars$mpg), ] head(sorted2[, c("cyl", "mpg", "hp")], 10)

The order() function returns the row positions in sorted order. The minus sign reverses the sort for numeric columns.

Example 8: Create Summary Statistics

# Basic stats for one column cat("Mean MPG:", mean(mtcars$mpg), "\n") cat("Median MPG:", median(mtcars$mpg), "\n") cat("Std Dev:", round(sd(mtcars$mpg), 2), "\n") cat("Range:", range(mtcars$mpg), "\n\n") # Apply a function to multiple columns at once cat("Column means:\n") sapply(mtcars[, c("mpg", "hp", "wt")], mean)

# Group-level statistics with aggregate() cat("Average MPG by cylinder count:\n") aggregate(mpg ~ cyl, data = mtcars, FUN = mean) # Multiple columns at once cat("\nMultiple stats by cylinder:\n") aggregate(cbind(mpg, hp, wt) ~ cyl, data = mtcars, FUN = mean)

The aggregate() function splits the data by groups and applies a function to each group. The formula mpg ~ cyl means "mpg grouped by cyl."

Example 9: Merge Two Data Frames

# Customer info customers <- data.frame( id = c(1, 2, 3, 4), name = c("Alice", "Bob", "Carol", "Dave"), stringsAsFactors = FALSE ) # Order info orders <- data.frame( id = c(2, 3, 3, 5), product = c("Widget", "Gadget", "Widget", "Gizmo"), amount = c(29.99, 49.99, 29.99, 19.99), stringsAsFactors = FALSE ) # Inner join (only matching rows) cat("Inner join:\n") merge(customers, orders, by = "id")

# Recreate the data customers <- data.frame( id = c(1, 2, 3, 4), name = c("Alice", "Bob", "Carol", "Dave"), stringsAsFactors = FALSE ) orders <- data.frame( id = c(2, 3, 3, 5), product = c("Widget", "Gadget", "Widget", "Gizmo"), amount = c(29.99, 49.99, 29.99, 19.99), stringsAsFactors = FALSE ) # Left join (keep all customers) cat("Left join:\n") print(merge(customers, orders, by = "id", all.x = TRUE)) # Full join (keep everything) cat("\nFull join:\n") print(merge(customers, orders, by = "id", all = TRUE))

Merge types at a glance:

Argument	Join Type	Keeps
(default)	Inner	Only matching rows
`all.x = TRUE`	Left	All rows from first table
`all.y = TRUE`	Right	All rows from second table
`all = TRUE`	Full	All rows from both tables

Example 10: Work with the Iris Dataset

The iris dataset has measurements for 150 flowers across 3 species. Let's combine everything we've learned.

# Explore the dataset str(iris) cat("\n") head(iris)

# How many of each species? cat("Species counts:\n") print(table(iris$Species)) # Average measurements by species cat("\nMeans by species:\n") aggregate(. ~ Species, data = iris, FUN = mean)

# Which flower has the longest petal? longest <- iris[which.max(iris$Petal.Length), ] cat("Longest petal:\n") print(longest) # Filter: setosa flowers with sepal width > 3.5 wide_setosa <- subset(iris, Species == "setosa" & Sepal.Width > 3.5) cat("\nWide setosa flowers:", nrow(wide_setosa), "\n") head(wide_setosa)

# Add a new column: petal area estimate iris$Petal.Area <- round(iris$Petal.Length * iris$Petal.Width, 2) # Average petal area by species cat("Average petal area by species:\n") aggregate(Petal.Area ~ Species, data = iris, FUN = mean)

Quick Reference: Essential Data Frame Functions

Function	Purpose	Example
`data.frame()`	Create a data frame	`data.frame(x = 1:3, y = c("a","b","c"))`
`head()` / `tail()`	First/last rows	`head(df, 10)`
`str()`	Structure overview	`str(df)`
`summary()`	Column statistics	`summary(df)`
`dim()`	Rows x columns	`dim(df)`
`nrow()` / `ncol()`	Row/column count	`nrow(df)`
`names()`	Column names	`names(df)`
`subset()`	Filter rows/columns	`subset(df, x > 5)`
`order()`	Sort rows	`df[order(df$x), ]`
`merge()`	Join two data frames	`merge(df1, df2, by = "id")`
`aggregate()`	Group summaries	`aggregate(y ~ x, data = df, FUN = mean)`
`rbind()` / `cbind()`	Add rows/columns	`rbind(df1, df2)`
`sapply()`	Apply function to columns	`sapply(df, mean)`

Practice Exercises

Exercise 1: Build and Inspect

Create a data frame with 5 products (name, price, quantity). Print its structure and summary.

# Your code here

Show Solution

products <- data.frame(
  name = c("Laptop", "Mouse", "Keyboard", "Monitor", "Headset"),
  price = c(999, 29, 79, 349, 89),
  quantity = c(10, 50, 30, 15, 25),
  stringsAsFactors = FALSE
)
str(products)
summary(products)

Exercise 2: Filter and Aggregate

Using mtcars, find the average horsepower of cars with more than 20 mpg.

# Your code here

Show Solution

efficient <- mtcars[mtcars$mpg > 20, ]
cat("Number of efficient cars:", nrow(efficient), "\n")
cat("Average horsepower:", round(mean(efficient$hp), 1), "\n")

Exercise 3: Sort and Rank

Sort the iris dataset by Petal.Length in descending order and show the top 5 flowers.

# Your code here

Show Solution

sorted_iris <- iris[order(-iris$Petal.Length), ]
head(sorted_iris[, c("Species", "Petal.Length", "Petal.Width")], 5)

Exercise 4: Merge Challenge

Create two data frames — one with student names and majors, another with student names and GPAs. Merge them and find the highest GPA per major.

# Your code here

Show Solution

students <- data.frame(
  name = c("Alice", "Bob", "Carol", "Dave", "Eve"),
  major = c("CS", "Math", "CS", "Math", "CS"),
  stringsAsFactors = FALSE
)
grades <- data.frame(
  name = c("Alice", "Bob", "Carol", "Dave", "Eve"),
  gpa = c(3.8, 3.5, 3.9, 3.7, 3.6),
  stringsAsFactors = FALSE
)

combined <- merge(students, grades, by = "name")
aggregate(gpa ~ major, data = combined, FUN = max)

FAQ

What is the difference between a data frame and a matrix?

A data frame can have columns of different types (numeric, character, logical). A matrix must have all values of the same type. Use data frames for real-world datasets. Use matrices for mathematical operations.

What is a tibble?

A tibble is a modern version of the data frame from the tidyverse. It prints more neatly, never converts strings to factors, and never changes column names. Create one with tibble::tibble() or convert with tibble::as_tibble().

How do I handle large data frames that are slow?

For data frames with millions of rows, consider the data.table package, which is much faster for grouping, filtering, and joining. Alternatively, use dplyr from the tidyverse for a clean syntax with good performance.

How do I save a data frame to a CSV file?

Use write.csv(df, "filename.csv", row.names = FALSE). The row.names = FALSE prevents R from adding a row number column. To read it back, use read.csv("filename.csv").

How do I convert a matrix to a data frame?

Use as.data.frame(my_matrix). Column names will be V1, V2, etc. unless the matrix had column names.

Conclusion

Data frames are where most R work happens. You now know how to create them, inspect them with str() and summary(), access rows and columns, filter with conditions, add and remove columns, sort, aggregate by groups, and merge multiple tables. These ten operations cover 90% of everyday data manipulation.

For more powerful data wrangling, explore dplyr — it provides a cleaner syntax for the same operations you learned here, plus powerful tools like group_by() and mutate().

r-statistics.co by Selva Prabhakaran

R Data Frames: Every Operation You'll Need, With 10 Real Examples

What Is a Data Frame?

Example 1: Explore a Built-In Dataset

Example 2: Access Columns

Example 3: Access Rows and Subsets

Example 4: Filter Rows by Condition

Example 5: Add and Remove Columns

Example 6: Add and Remove Rows

Example 7: Sort Data

Example 8: Create Summary Statistics

Example 9: Merge Two Data Frames

Example 10: Work with the Iris Dataset

Quick Reference: Essential Data Frame Functions

Practice Exercises

Exercise 1: Build and Inspect

Exercise 2: Filter and Aggregate

Exercise 3: Sort and Rank

Exercise 4: Merge Challenge

FAQ

What is the difference between a data frame and a matrix?

What is a tibble?

How do I handle large data frames that are slow?

How do I save a data frame to a CSV file?

How do I convert a matrix to a data frame?

Conclusion

On this page