r-statistics.co by Selva Prabhakaran


Loading R environment... This may take a few seconds on first load.

R Data Frames: Every Operation You'll Need, With 10 Real Examples

A data frame is R's equivalent of a spreadsheet — a table where each column is a vector and each row is an observation. Most real-world R work involves data frames, and mastering them unlocks everything from data cleaning to statistical modeling.

This tutorial covers every essential data frame operation with 10 real-world examples. Every code block is interactive — click Run to execute, edit to experiment, and Reset to restore the original. Variables persist across blocks, so run them in order.

What Is a Data Frame?

A data frame is a list of vectors of equal length, arranged as columns. Each column can have a different type (numeric, character, logical), but all values within a column must be the same type.

R
# Create a simple data frame employees <- data.frame( name = c("Alice", "Bob", "Carol", "Dave", "Eve"), department = c("Sales", "Engineering", "Sales", "HR", "Engineering"), salary = c(55000, 72000, 58000, 48000, 75000), years = c(3, 7, 5, 2, 8), stringsAsFactors = FALSE ) employees

    

Think of it as a spreadsheet: columns are variables, rows are records.

Example 1: Explore a Built-In Dataset

R ships with dozens of built-in datasets. The mtcars dataset has specs for 32 cars from 1974.

R
# First 6 rows head(mtcars) # Last 3 rows tail(mtcars, 3) # Dimensions: rows x columns cat("Dimensions:", dim(mtcars), "\n") # Column names cat("Columns:", names(mtcars), "\n")

    
R
# Structure — the single most useful inspection function str(mtcars)

    
R
# Summary statistics for every column summary(mtcars)

    

The str() function is your best friend. It shows every column's type, the first few values, and the dimensions — all in one call.

Example 2: Access Columns

There are three ways to access a column. The $ syntax is the most readable and common.

R
# Method 1: $ notation (most common) cat("MPG values:", head(mtcars$mpg, 8), "\n\n") # Method 2: [[ ]] with column name cat("HP values:", head(mtcars[["hp"]], 8), "\n\n") # Method 3: [ , ] with column number cat("First column:", head(mtcars[, 1], 8), "\n")

    

Use [[ ]] when the column name is stored in a variable:

R
# Useful when column name is in a variable col_name <- "hp" cat("Mean of", col_name, ":", mean(mtcars[[col_name]]), "\n")

    

Example 3: Access Rows and Subsets

Use [row, column] syntax. Leave a side blank to get all rows or all columns.

R
# Single row mtcars[1, ] # Multiple rows mtcars[1:3, ] # Specific rows and columns mtcars[1:5, c("mpg", "hp", "wt")]

    

Example 4: Filter Rows by Condition

Filtering is the most common operation on data frames. Use logical conditions inside [ ].

R
# Cars with mpg > 25 efficient <- mtcars[mtcars$mpg > 25, ] cat("Efficient cars:", nrow(efficient), "\n") efficient[, c("mpg", "hp", "wt")]

    
R
# Cars with 6 cylinders AND automatic transmission (am = 0) subset_cars <- mtcars[mtcars$cyl == 6 & mtcars$am == 0, ] subset_cars[, c("mpg", "cyl", "am")] # Cleaner alternative: subset() function cat("\nUsing subset():\n") subset(mtcars, mpg > 25, select = c(mpg, hp, wt))

    

Example 5: Add and Remove Columns

R
# Work with a copy of the first 6 rows cars <- head(mtcars) # Add a new column: km per liter cars$kpl <- round(cars$mpg * 0.425, 1) cars[, c("mpg", "kpl")]

    
R
# Remove a column by setting it to NULL cars <- head(mtcars) cars$qsec <- NULL cat("Columns after removing qsec:", names(cars), "\n\n") # Select only the columns you want cars_small <- cars[, c("mpg", "cyl", "hp", "wt")] cars_small

    

Example 6: Add and Remove Rows

R
# Create a data frame team <- data.frame( name = c("Alice", "Bob"), score = c(92, 85), stringsAsFactors = FALSE ) cat("Original:\n") print(team) # Add a row with rbind() new_member <- data.frame(name = "Carol", score = 88, stringsAsFactors = FALSE) team <- rbind(team, new_member) cat("\nAfter adding Carol:\n") print(team) # Remove row 2 team <- team[-2, ] cat("\nAfter removing row 2:\n") print(team)

    

Example 7: Sort Data

R
# Sort by mpg (ascending) sorted <- mtcars[order(mtcars$mpg), ] cat("Lowest MPG cars:\n") head(sorted[, c("mpg", "hp")], 5)

    
R
# Sort by mpg (descending) with minus sign sorted_desc <- mtcars[order(-mtcars$mpg), ] cat("Highest MPG cars:\n") head(sorted_desc[, c("mpg", "hp")], 5) # Sort by two columns: cyl ascending, then mpg descending cat("\nSorted by cyl then mpg:\n") sorted2 <- mtcars[order(mtcars$cyl, -mtcars$mpg), ] head(sorted2[, c("cyl", "mpg", "hp")], 10)

    

The order() function returns the row positions in sorted order. The minus sign reverses the sort for numeric columns.

Example 8: Create Summary Statistics

R
# Basic stats for one column cat("Mean MPG:", mean(mtcars$mpg), "\n") cat("Median MPG:", median(mtcars$mpg), "\n") cat("Std Dev:", round(sd(mtcars$mpg), 2), "\n") cat("Range:", range(mtcars$mpg), "\n\n") # Apply a function to multiple columns at once cat("Column means:\n") sapply(mtcars[, c("mpg", "hp", "wt")], mean)

    
R
# Group-level statistics with aggregate() cat("Average MPG by cylinder count:\n") aggregate(mpg ~ cyl, data = mtcars, FUN = mean) # Multiple columns at once cat("\nMultiple stats by cylinder:\n") aggregate(cbind(mpg, hp, wt) ~ cyl, data = mtcars, FUN = mean)

    

The aggregate() function splits the data by groups and applies a function to each group. The formula mpg ~ cyl means "mpg grouped by cyl."

Example 9: Merge Two Data Frames

R
# Customer info customers <- data.frame( id = c(1, 2, 3, 4), name = c("Alice", "Bob", "Carol", "Dave"), stringsAsFactors = FALSE ) # Order info orders <- data.frame( id = c(2, 3, 3, 5), product = c("Widget", "Gadget", "Widget", "Gizmo"), amount = c(29.99, 49.99, 29.99, 19.99), stringsAsFactors = FALSE ) # Inner join (only matching rows) cat("Inner join:\n") merge(customers, orders, by = "id")

    
R
# Recreate the data customers <- data.frame( id = c(1, 2, 3, 4), name = c("Alice", "Bob", "Carol", "Dave"), stringsAsFactors = FALSE ) orders <- data.frame( id = c(2, 3, 3, 5), product = c("Widget", "Gadget", "Widget", "Gizmo"), amount = c(29.99, 49.99, 29.99, 19.99), stringsAsFactors = FALSE ) # Left join (keep all customers) cat("Left join:\n") print(merge(customers, orders, by = "id", all.x = TRUE)) # Full join (keep everything) cat("\nFull join:\n") print(merge(customers, orders, by = "id", all = TRUE))

    

Merge types at a glance:

ArgumentJoin TypeKeeps
(default)InnerOnly matching rows
all.x = TRUELeftAll rows from first table
all.y = TRUERightAll rows from second table
all = TRUEFullAll rows from both tables

Example 10: Work with the Iris Dataset

The iris dataset has measurements for 150 flowers across 3 species. Let's combine everything we've learned.

R
# Explore the dataset str(iris) cat("\n") head(iris)

    
R
# How many of each species? cat("Species counts:\n") print(table(iris$Species)) # Average measurements by species cat("\nMeans by species:\n") aggregate(. ~ Species, data = iris, FUN = mean)

    
R
# Which flower has the longest petal? longest <- iris[which.max(iris$Petal.Length), ] cat("Longest petal:\n") print(longest) # Filter: setosa flowers with sepal width > 3.5 wide_setosa <- subset(iris, Species == "setosa" & Sepal.Width > 3.5) cat("\nWide setosa flowers:", nrow(wide_setosa), "\n") head(wide_setosa)

    
R
# Add a new column: petal area estimate iris$Petal.Area <- round(iris$Petal.Length * iris$Petal.Width, 2) # Average petal area by species cat("Average petal area by species:\n") aggregate(Petal.Area ~ Species, data = iris, FUN = mean)

    

Quick Reference: Essential Data Frame Functions

FunctionPurposeExample
data.frame()Create a data framedata.frame(x = 1:3, y = c("a","b","c"))
head() / tail()First/last rowshead(df, 10)
str()Structure overviewstr(df)
summary()Column statisticssummary(df)
dim()Rows x columnsdim(df)
nrow() / ncol()Row/column countnrow(df)
names()Column namesnames(df)
subset()Filter rows/columnssubset(df, x > 5)
order()Sort rowsdf[order(df$x), ]
merge()Join two data framesmerge(df1, df2, by = "id")
aggregate()Group summariesaggregate(y ~ x, data = df, FUN = mean)
rbind() / cbind()Add rows/columnsrbind(df1, df2)
sapply()Apply function to columnssapply(df, mean)

Practice Exercises

Exercise 1: Build and Inspect

Create a data frame with 5 products (name, price, quantity). Print its structure and summary.

R
# Your code here

    
Show Solution
products <- data.frame(
  name = c("Laptop", "Mouse", "Keyboard", "Monitor", "Headset"),
  price = c(999, 29, 79, 349, 89),
  quantity = c(10, 50, 30, 15, 25),
  stringsAsFactors = FALSE
)
str(products)
summary(products)

Exercise 2: Filter and Aggregate

Using mtcars, find the average horsepower of cars with more than 20 mpg.

R
# Your code here

    
Show Solution
efficient <- mtcars[mtcars$mpg > 20, ]
cat("Number of efficient cars:", nrow(efficient), "\n")
cat("Average horsepower:", round(mean(efficient$hp), 1), "\n")

Exercise 3: Sort and Rank

Sort the iris dataset by Petal.Length in descending order and show the top 5 flowers.

R
# Your code here

    
Show Solution
sorted_iris <- iris[order(-iris$Petal.Length), ]
head(sorted_iris[, c("Species", "Petal.Length", "Petal.Width")], 5)

Exercise 4: Merge Challenge

Create two data frames — one with student names and majors, another with student names and GPAs. Merge them and find the highest GPA per major.

R
# Your code here

    
Show Solution
students <- data.frame(
  name = c("Alice", "Bob", "Carol", "Dave", "Eve"),
  major = c("CS", "Math", "CS", "Math", "CS"),
  stringsAsFactors = FALSE
)
grades <- data.frame(
  name = c("Alice", "Bob", "Carol", "Dave", "Eve"),
  gpa = c(3.8, 3.5, 3.9, 3.7, 3.6),
  stringsAsFactors = FALSE
)

combined <- merge(students, grades, by = "name")
aggregate(gpa ~ major, data = combined, FUN = max)

FAQ

What is the difference between a data frame and a matrix?

A data frame can have columns of different types (numeric, character, logical). A matrix must have all values of the same type. Use data frames for real-world datasets. Use matrices for mathematical operations.

What is a tibble?

A tibble is a modern version of the data frame from the tidyverse. It prints more neatly, never converts strings to factors, and never changes column names. Create one with tibble::tibble() or convert with tibble::as_tibble().

How do I handle large data frames that are slow?

For data frames with millions of rows, consider the data.table package, which is much faster for grouping, filtering, and joining. Alternatively, use dplyr from the tidyverse for a clean syntax with good performance.

How do I save a data frame to a CSV file?

Use write.csv(df, "filename.csv", row.names = FALSE). The row.names = FALSE prevents R from adding a row number column. To read it back, use read.csv("filename.csv").

How do I convert a matrix to a data frame?

Use as.data.frame(my_matrix). Column names will be V1, V2, etc. unless the matrix had column names.

Conclusion

Data frames are where most R work happens. You now know how to create them, inspect them with str() and summary(), access rows and columns, filter with conditions, add and remove columns, sort, aggregate by groups, and merge multiple tables. These ten operations cover 90% of everyday data manipulation.

For more powerful data wrangling, explore dplyr — it provides a cleaner syntax for the same operations you learned here, plus powerful tools like group_by() and mutate().