R Data Frames: Every Operation You'll Need, With 10 Real Examples

A data frame in R is a table of rows and columns, like a spreadsheet or a database table, where each column can be a different type. It's the single most important data structure for real analysis work, and it powers everything from lm() to ggplot2.

What is an R data frame and how do you build one?

Under the hood, a data frame is just a list of equal-length vectors, with each vector being one column. That's why columns can be different types (numeric, character, logical) but a column itself must be one type. Let's build one and look at it.

RBuild employees with four columns
employees <- data.frame( name = c("Alice", "Bob", "Carol", "David", "Eve"), age = c(29, 42, 31, 55, 38), salary = c(65000, 82000, 70000, 95000, 78000), remote = c(TRUE, FALSE, TRUE, FALSE, TRUE) ) employees #> name age salary remote #> 1 Alice 29 65000 TRUE #> 2 Bob 42 82000 FALSE #> 3 Carol 31 70000 TRUE #> 4 David 55 95000 FALSE #> 5 Eve 38 78000 TRUE

  

Five rows, four columns, two types (numeric and logical plus character). That's a typical data frame in a realistic shape. You'll build many of these from scratch for demos, tests, and quick prototypes, and many more from read.csv() when you load real data.

Anatomy of an R data frame

Figure 1: A data frame is a list of equal-length column vectors. Each column has a type; each row spans all columns.

Tip
Older R (before 4.0) used stringsAsFactors = TRUE by default, which silently turned character columns into factors. That default is gone, but you'll still see tutorials setting stringsAsFactors = FALSE explicitly, harmless, but no longer required.

Try it: Build a data frame with three cities, their populations, and whether they're coastal.

RExercise: Build excities
# your turn ex_cities <- data.frame( city = c("___", "___", "___"), population = c(___, ___, ___), coastal = c(___, ___, ___) )

  
Click to reveal solution
RBuild excities solution
ex_cities <- data.frame( city = c("Mumbai", "Bengaluru", "Chennai"), population = c(20411000, 12765000, 11324000), coastal = c(TRUE, FALSE, TRUE) ) ex_cities #> city population coastal #> 1 Mumbai 20411000 TRUE #> 2 Bengaluru 12765000 FALSE #> 3 Chennai 11324000 TRUE

  

Each argument to data.frame() becomes one column, and the three input vectors all have length 3 so the rows line up automatically. Because each column can hold a different type, character, numeric, logical, you get a small mixed-type table in one call.

How do you inspect a data frame?

Before you touch a new data frame, you want a quick profile: how big is it, what types are the columns, what's in the first few rows? R gives you a handful of inspection functions that together answer every "what am I looking at?" question.

RProfile a data frame with str
dim(employees) #> [1] 5 4 nrow(employees) #> [1] 5 ncol(employees) #> [1] 4 str(employees) #> 'data.frame': 5 obs. of 4 variables: #> $ name : chr "Alice" "Bob" "Carol" "David" ... #> $ age : num 29 42 31 55 38 #> $ salary: num 65000 82000 70000 95000 78000 #> $ remote: logi TRUE FALSE TRUE FALSE TRUE head(employees, 3) #> name age salary remote #> 1 Alice 29 65000 TRUE #> 2 Bob 42 82000 FALSE #> 3 Carol 31 70000 TRUE summary(employees$salary) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 65000 70000 78000 78000 82000 95000

  

str() is the workhorse, it tells you shape, column names, types, and a preview in one line per column. summary() gives five-number summaries for numeric columns and counts for factors/characters.

Try it: Print the names of all columns in employees using names() or colnames().

RExercise: List column names
# one line names(employees)

  
Click to reveal solution
RList column names solution
names(employees) #> [1] "name" "age" "salary" "remote"

  

names() returns the column names of a data frame as a character vector, which is the same thing colnames() gives you. Prefer names() in base R because it's shorter and works on lists too.

How do you select columns?

There are four ways to pull a column out of a data frame, and you'll see all of them in real code. Let's work through them, because the differences matter, especially when one returns a vector and another returns a one-column data frame.

RFour ways to select a column
employees$salary #> [1] 65000 82000 70000 95000 78000 employees[["salary"]] #> [1] 65000 82000 70000 95000 78000 employees[, "salary"] #> [1] 65000 82000 70000 95000 78000 employees[, c("name", "salary")] #> name salary #> 1 Alice 65000 #> 2 Bob 82000 #> 3 Carol 70000 #> 4 David 95000 #> 5 Eve 78000

  

The first three return the column as a plain vector. The fourth, selecting multiple columns, returns a smaller data frame. That asymmetry is a classic R gotcha: df[, "x"] is a vector, but df[, c("x", "y")] is a data frame.

Key Insight
Use df$col for quick interactive work, df[["col"]] when the column name is in a variable, and df[, cols] (with a character vector) when you want a sub-data-frame with multiple columns.

Try it: Pull just the age column out of employees as a vector, then compute its mean.

RExercise: Mean of the age column
ex_mean_age <- mean(employees$___) ex_mean_age

  
Click to reveal solution
RMean of the age column solution
ex_mean_age <- mean(employees$age) ex_mean_age #> [1] 39

  

employees$age returns the age column as a plain numeric vector, which is exactly what mean() expects. The mean of c(29, 42, 31, 55, 38) is 195/5 = 39.

How do you filter rows?

Filtering rows is the operation you'll do most often. In base R, you index with a logical vector in the row position: df[condition, ]. The comma is critical, it says "all columns." Leave it out and R gets confused.

RFilter rows with a logical index
employees[employees$age > 35, ] #> name age salary remote #> 2 Bob 42 82000 FALSE #> 4 David 55 95000 FALSE #> 5 Eve 38 78000 TRUE employees[employees$remote == TRUE, ] #> name age salary remote #> 1 Alice 29 65000 TRUE #> 3 Carol 31 70000 TRUE #> 5 Eve 38 78000 TRUE employees[employees$salary > 70000 & employees$age < 50, ] #> name age salary remote #> 2 Bob 42 82000 FALSE #> 5 Eve 38 78000 TRUE

  

Combine conditions with & (and) and | (or). Never use && or ||, those are scalar operators for single TRUE/FALSE values, not vectors.

Warning
df[df$x > 5] (without the comma) silently picks columns whose position matches the logical vector's TRUEs, not rows. Always include the comma: df[df$x > 5, ].

Try it: Filter employees to rows where salary is below 75000.

RExercise: Filter by salary threshold
# include the comma! employees[employees$salary < 75000, ]

  
Click to reveal solution
RFilter by salary threshold solution
employees[employees$salary < 75000, ] #> name age salary remote #> 1 Alice 29 65000 TRUE #> 3 Carol 31 70000 TRUE

  

employees$salary < 75000 is a logical vector aligned with the rows, and using it in the row position of [ , ] keeps only the rows that match. Remember the trailing comma, it tells R "all columns" and is the difference between filtering rows and accidentally picking columns.

How do you add or modify columns?

Adding a column is the same syntax as selecting one, just assign into it. This works whether the column already exists (modify) or doesn't (create). The new column can be a constant, a vector, or a vectorized computation on existing columns.

RAdd bonus and seniority columns
employees$bonus <- employees$salary * 0.10 employees #> name age salary remote bonus #> 1 Alice 29 65000 TRUE 6500 #> 2 Bob 42 82000 FALSE 8200 #> 3 Carol 31 70000 TRUE 7000 #> 4 David 55 95000 FALSE 9500 #> 5 Eve 38 78000 TRUE 7800 employees$seniority <- ifelse(employees$age >= 40, "senior", "junior") employees$seniority #> [1] "junior" "senior" "junior" "senior" "junior" employees$salary <- employees$salary + employees$bonus employees$salary #> [1] 71500 90200 77000 104500 85800

  

ifelse() is the vectorized counterpart to the scalar if/else, it applies the condition element-by-element across the vector. To drop a column, assign NULL: employees$bonus <- NULL.

Try it: Add a column high_earner that is TRUE when salary is above 80000.

RExercise: Flag high earners
employees$high_earner <- employees$salary > ___ employees

  
Click to reveal solution
RFlag high earners solution
employees$high_earner <- employees$salary > 80000 employees #> name age salary remote high_earner #> 1 Alice 29 65000 TRUE FALSE #> 2 Bob 42 82000 FALSE TRUE #> 3 Carol 31 70000 TRUE FALSE #> 4 David 55 95000 FALSE TRUE #> 5 Eve 38 78000 TRUE FALSE

  

Assigning into a new name, employees$high_earner, appends a column because the name doesn't exist yet. The right-hand side is a vectorized comparison, so R broadcasts > 80000 across every salary and stores the logical result as the new column.

How do you sort and rank rows?

Sorting a data frame means reordering its rows by one or more columns. Base R uses order(), which returns the positions that would put a vector in sorted order. You then use those positions as a row index.

RSort by salary ascending and descending
employees[order(employees$salary), ] #> name age salary remote bonus seniority #> 3 Carol 31 77000 TRUE 7000 junior #> 1 Alice 29 71500 TRUE 6500 junior #> 5 Eve 38 85800 TRUE 7800 junior #> 2 Bob 42 90200 FALSE 8200 senior #> 4 David 55 104500 FALSE 9500 senior employees[order(-employees$salary), ] #> name age salary remote bonus seniority #> 4 David 55 104500 FALSE 9500 senior #> 2 Bob 42 90200 FALSE 8200 senior #> 5 Eve 38 85800 TRUE 7800 junior #> 3 Carol 31 77000 TRUE 7000 junior #> 1 Alice 29 71500 TRUE 6500 junior employees[order(employees$seniority, -employees$salary), ] #> name age salary remote bonus seniority #> 5 Eve 38 85800 TRUE 7800 junior #> 3 Carol 31 77000 TRUE 7000 junior #> 1 Alice 29 71500 TRUE 6500 junior #> 4 David 55 104500 FALSE 9500 senior #> 2 Bob 42 90200 FALSE 8200 senior

  

A negative sign in front of a numeric column reverses that column's sort order. For character/factor columns, wrap them in order(..., decreasing = TRUE) instead.

Try it: Sort employees by age ascending.

RExercise: Sort employees by age
employees[order(employees$___), ]

  
Click to reveal solution
RSort employees by age solution
employees[order(employees$age), ] #> name age salary remote #> 1 Alice 29 65000 TRUE #> 3 Carol 31 70000 TRUE #> 5 Eve 38 78000 TRUE #> 2 Bob 42 82000 FALSE #> 4 David 55 95000 FALSE

  

order(employees$age) returns the row positions that would sort age ascending, c(1, 3, 5, 2, 4), and indexing with them reorders the data frame. Notice the row numbers on the left: they carry over from the original, which is how you can tell sorting rearranged rows rather than rebuilding the data frame.

How do you summarize by group with aggregate()?

The "split-apply-combine" pattern, group rows by a column, compute something on each group, combine the results, is the core of most analyses. Base R's aggregate() does it in one call.

RAggregate salary by seniority
aggregate(salary ~ seniority, data = employees, FUN = mean) #> seniority salary #> 1 junior 78100 #> 2 senior 97350 aggregate(cbind(salary, age) ~ seniority, data = employees, FUN = mean) #> seniority salary age #> 1 junior 78100 32.67 #> 2 senior 97350 48.50 aggregate(salary ~ remote, data = employees, FUN = function(x) c(mean = mean(x), n = length(x))) #> remote salary.mean salary.n #> 1 FALSE 97350.00 2.00 #> 2 TRUE 78100.00 3.00

  

The ~ is formula syntax: "compute this on the left, grouped by that on the right." You can supply any function, sum, median, sd, or a custom one. For heavier aggregation work you'll eventually reach for dplyr::group_by() + summarise(), but aggregate() is perfect when you want zero dependencies.

Note
aggregate() drops rows where the grouping column is NA. If you need to keep them, convert NA to a sentinel string first.

Try it: Compute the maximum salary grouped by remote.

RExercise: Max salary by remote
aggregate(salary ~ remote, data = employees, FUN = ___)

  
Click to reveal solution
RMax salary by remote solution
aggregate(salary ~ remote, data = employees, FUN = max) #> remote salary #> 1 FALSE 95000 #> 2 TRUE 78000

  

The formula salary ~ remote says "compute on salary, split by remote", and passing max as FUN applies it to each group. David (non-remote) has the highest overall salary at 95000, while Eve tops the remote group at 78000.

Practice Exercises

Exercise 1: Top earners by department

Build a data frame of 8 employees across 3 departments, then return the top earner in each.

RExercise: Top earner per department
df <- data.frame( name = c("A","B","C","D","E","F","G","H"), dept = c("eng","eng","sales","sales","sales","hr","hr","eng"), pay = c(90, 110, 75, 82, 68, 72, 78, 120) ) # return top earner per dept

  
Show solution
RTop earner per department solution
top <- do.call(rbind, by(df, df$dept, function(g) g[which.max(g$pay), ])) top #> name dept pay #> eng H eng 120 #> hr G hr 78 #> sales B sales 82

  

Exercise 2: Filter and summarize

From mtcars, return the mean mpg for 4-cylinder cars weighing under 2.5.

Show solution
RMean mpg for light 4-cyl cars
mean(mtcars[mtcars$cyl == 4 & mtcars$wt < 2.5, "mpg"]) #> [1] 28.01429

  

Exercise 3: Add a categorical column

Add a column mpg_class to mtcars: "low" if mpg < 18, "mid" if 18-25, "high" if > 25. Then count the rows in each class.

Show solution
RBin mpg into classes with cut
mtcars$mpg_class <- cut(mtcars$mpg, breaks = c(-Inf, 18, 25, Inf), labels = c("low", "mid", "high")) table(mtcars$mpg_class) #> #> low mid high #> 12 14 6

  

Putting It All Together

A complete mini-analysis on iris, load it, inspect, filter, add a derived column, sort, and summarize by species.

REnd-to-end iris analysis pipeline
data(iris) str(iris) #> 'data.frame': 150 obs. of 5 variables: #> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... #> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... #> $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... #> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.2 0.2 0.2 0.1 ... #> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 big <- iris[iris$Sepal.Length > 6, ] nrow(big) #> [1] 61 big$petal_ratio <- big$Petal.Length / big$Petal.Width aggregate(petal_ratio ~ Species, data = big, FUN = mean) #> Species petal_ratio #> 1 versicolor 3.242105 #> 2 virginica 2.782927 head(big[order(-big$petal_ratio), c("Species", "petal_ratio")], 3) #> Species petal_ratio #> 58 versicolor 4.071429 #> 80 versicolor 3.846154 #> 91 versicolor 3.818182

  

Five operations, five lines, and every single one is a base-R idiom you'll reach for daily.

Summary

Operation Syntax
Create data.frame(col1 = ..., col2 = ...)
Inspect str(), dim(), head(), summary()
Select column df$col, df[["col"]], df[, "col"]
Filter rows df[df$col > 5, ], don't forget the comma
Add column df$new <- ...
Drop column df$col <- NULL
Sort df[order(df$col), ]
Group summary aggregate(y ~ g, data = df, FUN = mean)

References

  1. R Language Definition, Data Frames
  2. Advanced R, Data Frames by Hadley Wickham
  3. R for Data Science, modern tidyverse perspective on tables
  4. Base R Cheat Sheet (RStudio)
  5. An Introduction to R, Chapter 6: Lists and data frames

Continue Learning

{% endraw %}