base tapply() in R: Apply Functions to Grouped Vector Subsets
The tapply() function in base R applies a function to subsets of a vector defined by one or more grouping factors. It returns an array of per-group results, making it the base R workhorse for grouped summaries without loading any package.
tapply(mtcars$mpg, mtcars$cyl, mean) # mean mpg per cylinder count tapply(mtcars$mpg, mtcars$cyl, sum) # sum within group tapply(mtcars$mpg, list(mtcars$cyl, mtcars$am), mean) # two grouping factors tapply(mtcars$mpg, mtcars$cyl, length) # count per group tapply(mtcars$mpg, mtcars$cyl, function(x) max(x)-min(x)) # custom function tapply(x, g, mean, na.rm = TRUE) # pass args to FUN tapply(x, g, mean, default = 0) # fill empty groups
Need explanation? Read on for examples and pitfalls.
What tapply() does in one sentence
tapply() is a grouped function applier for vectors. You pass a vector of values, one or more parallel factors that label each value's group, and a function. It returns the result of running that function on every subset of values that share a group label.
split() to break the vector by factor, then applies your function to each chunk, then arranges the answers into an array keyed by factor level. This mental model explains every other behavior in this guide.Syntax
Five arguments matter in everyday use. X is the numeric or character vector to summarize. INDEX is a factor (or list of factors) the same length as X that labels each element's group. FUN is the function applied to each subset. Extra ... arguments are forwarded to FUN (this is how you pass na.rm = TRUE). simplify = TRUE returns an array; simplify = FALSE returns a list.
Five common patterns
These five recipes cover 95 percent of tapply() calls in practice. Each example uses the built-in mtcars dataset so you can copy and run them as-is.
1. Mean of a numeric column by one factor
The output is a named numeric vector. Names come from the unique factor levels in INDEX. Reading order matches levels(factor(mtcars$cyl)), which is sorted ascending.
2. Multiple summary statistics with a custom function
Any function that takes a vector and returns a scalar works. To return multiple statistics per group, return a vector and tapply() will stack the results into a matrix.
3. Two grouping factors
Pass INDEX as a list. Rows are the first factor's levels (cyl), columns the second (am). Cell [1,2] is the mean mpg for 4-cylinder cars with automatic transmission coded as 1.
4. Count of values per group
length returns the size of each subset. This is equivalent to table(mtcars$cyl) for a single factor, but tapply() extends naturally to multi-factor counts.
5. Pass extra arguments to FUN
Any argument after FUN is forwarded to FUN. This is the cleanest way to skip missing values without writing an anonymous function.
tapply() first, then graduate. For a single vector and 1-2 factors, tapply() is shorter than aggregate() and ships with base R. Switch to aggregate() or dplyr once you need to summarize multiple columns at once or chain further operations.tapply vs aggregate vs by vs ave
Four base R functions handle grouped operations, each with a niche. Knowing which one returns what shape saves debugging time later.
| Function | Input | Output | Use when |
|---|---|---|---|
tapply(x, g, fun) |
Vector + factor(s) | Array (named, possibly multi-dim) | One column, one or more factors |
aggregate(df, by, fun) |
Data frame + factor list | Data frame | Multiple columns at once |
by(df, g, fun) |
Data frame + factor | List indexed by group | Per-group operation on whole subframe |
ave(x, g, FUN) |
Vector + factor | Vector of same length as input | Add group statistic back as a column |
tapply() is for collapsing a vector into a smaller array. ave() is the mirror: it returns a result the same length as the input, repeating the group statistic for every row. Pick tapply() when you want a summary table and ave() when you want to attach the group mean to each original row.
Common pitfalls
Empty factor levels yield NA cells. If INDEX is a factor with levels that have zero observations in X, the corresponding cell is NA (or whatever you pass as default). This trips up code that assumes the output has no missing values.
Set default = 0 (added in R 4.0) to fill empty groups with a sensible value instead of NA.
Function output shape must be consistent across groups. If FUN returns a 2-element vector for some groups and a 3-element vector for others, tapply() falls back to a list of mixed shapes. Always return the same number of values from FUN for every subset, or use simplify = FALSE and post-process explicitly.
Character INDEX gets coerced to factor. R sorts character factor levels alphabetically by default, which may not match your intended display order. Wrap with factor(g, levels = c("low", "mid", "high")) to control ordering.
Try it yourself
Try it: Use tapply() to compute the median horsepower (hp) of mtcars grouped by the number of gears (gear). Save the result to ex_hp_by_gear.
Click to reveal solution
Explanation: tapply() splits mtcars$hp by the values in mtcars$gear, then applies median to each subset. The output is a named numeric vector with one entry per unique gear count.
Related apply functions
The apply family covers different input shapes. Each has a niche; mixing them up is the most common base R confusion.
apply(m, MARGIN, FUN): matrix or array, by row (1) or column (2).lapply(lst, FUN): list or vector, always returns a list.sapply(lst, FUN): list or vector, simplifies result to a vector or matrix.vapply(lst, FUN, FUN.VALUE): likesapply()but type-safe via a template.mapply(FUN, ..., MoreArgs): multivariate; passes parallel elements from multiple arguments.
Use tapply() when your input is a single vector and the grouping comes from a parallel factor. Everything else in the family ignores grouping factors entirely. For a comparison of those siblings, see base sapply() and base lapply(). The official reference is the base R apply family documentation.
FAQ
What is tapply() used for in R?
tapply() computes a function over subsets of a vector defined by one or more grouping factors. It is the base R answer to "give me a summary statistic per group" without loading dplyr or data.table. Typical uses include mean values per category, counts per level, range or standard deviation by bucket, and any custom function that takes a vector and returns a scalar.
What is the difference between apply and tapply?
apply() operates on matrix or array rows and columns, treating each row or column as the input to your function. tapply() operates on a single vector and partitions it by a factor, applying your function to each partition. apply() cares about the geometry of a 2D object; tapply() cares about group membership in a 1D object.
Can tapply() handle more than one grouping variable?
Yes. Pass a list of two or more factors as the INDEX argument. The result is a multidimensional array with one dimension per factor. Two factors give you a matrix, three factors give a 3D array, and so on. Cells with no observations return NA or the value you pass to default.
How does tapply() differ from aggregate()?
aggregate() works on data frames and summarizes multiple columns at once, returning a data frame. tapply() works on a single vector and returns an array. If you have one column to summarize, tapply() is shorter and faster. If you have many columns or want a tidy data frame output for downstream operations, use aggregate() or dplyr::summarise().
Why does tapply() return NA for some groups?
This happens when your INDEX is a factor with levels that have no observations in your data. Those cells get NA by default. Pass default = 0 (or any other sentinel value) to fill them. Alternatively, drop empty levels first with droplevels() before calling tapply().