dplyr filter() in R: Subset Rows by Condition
The filter() function in dplyr keeps rows that satisfy a logical condition and drops the rest. You can combine multiple conditions with &, |, !, or by listing them as separate arguments (treated as AND).
filter(df, mpg > 20) # single condition filter(df, mpg > 20, cyl == 4) # AND filter(df, mpg > 20 | cyl == 4) # OR filter(df, cyl %in% c(4, 6)) # set membership filter(df, between(hp, 100, 200)) # range filter(df, !is.na(x), x > 5) # NA-safe filter(df, x == max(x), .by = grp) # by group
Need explanation? Read on for examples and pitfalls.
What filter() does in one sentence
filter() is a row subsetter. You hand it a data frame and one or more logical conditions, and it returns the rows where every condition evaluates to TRUE. Conditions can be simple comparisons (mpg > 20), set membership (cyl %in% c(4, 6)), range checks (between(hp, 100, 200)), or compound expressions joined with & and |.
Unlike base R df[df$x > 5, ], filter() understands the data frame implicitly: you write column names as bare expressions, no $ or quoting. This is why it slots cleanly into a pipeline.
Syntax
filter() takes a data frame plus one or more logical expressions. Multiple expressions are combined with AND by default. Use &, |, and ! to express more complex logic explicitly.
The full signature is:
filter(.data, ..., .by = NULL, .preserve = FALSE)
.data is the data frame. The ... argument takes one or more logical expressions. The optional .by argument lets you group on the fly without group_by(). The return value has the same columns as the input, but only the rows where all conditions are TRUE.
filter(mtcars, mpg > 20, cyl == 4) is identical to filter(mtcars, mpg > 20 & cyl == 4). Pick whichever reads cleaner; comma form is more idiomatic in pipelines.Seven common patterns
1. Filter by a single condition
2. Combine conditions with AND
The comma form (mpg > 20, cyl == 4) is shorthand for &. Both keep rows where both conditions are TRUE.
3. Combine conditions with OR
For OR you must use | explicitly. There is no comma shorthand for OR.
4. Membership tests with %in%
%in% checks whether a value belongs to a set. It is the readable way to express "x is one of these N values" without writing a long chain of ==s connected by |.
5. Range checks with between()
between(x, lo, hi) is a fast inclusive range check. It is equivalent to x >= lo & x <= hi but reads cleaner and runs faster on large vectors.
6. Filter NA-safely
NA in a condition propagates: NA > 0 returns NA, not TRUE or FALSE. filter() drops NA rows by default (treats them as not matching), but if you want to be explicit, combine with !is.na().
7. Filter within groups using .by
The .by argument groups on the fly for the duration of the call. The result is automatically ungrouped, unlike group_by() |> filter().
.by and group_by() produce the same result for grouped filters, but .by does not leave the data grouped afterwards. Use .by for one-off grouped operations to avoid surprising downstream behavior; use group_by() when subsequent verbs in the pipeline also need the grouping.filter() vs base R row subsetting
filter() reads as English; base R bracket subsetting reads as algebra. That is the real difference. Both produce identical results; the choice is style and pipeline ergonomics.
| Task | dplyr | Base R | ||
|---|---|---|---|---|
| Single condition | filter(df, x > 5) |
df[df$x > 5, ] |
||
| AND | filter(df, x > 5, y < 10) |
df[df$x > 5 & df$y < 10, ] |
||
| OR | `filter(df, x > 5 | y < 10)` | `df[df$x > 5 | df$y < 10, ]` |
| Membership | filter(df, x %in% c(1,2,3)) |
df[df$x %in% c(1,2,3), ] |
||
| Range | filter(df, between(x, 1, 10)) |
df[df$x >= 1 & df$x <= 10, ] |
||
| NA-safe | filter(df, !is.na(x), x > 5) |
df[!is.na(df$x) & df$x > 5, ] |
When to use which:
- Use
filter()inside any pipeline that uses other dplyr verbs. - Use base R
[, ]for one-line scripts with no other tidyverse code, or when squeezing the last drop of speed for very large in-memory data.
Common pitfalls
Pitfall 1: using = instead of ==. filter(mtcars, cyl = 4) errors with "unused argument". Use == for equality. This is the single most common dplyr mistake.
Pitfall 2: NA in conditions silently drops rows. A row where x is NA will never satisfy x > 5 (the comparison returns NA, which filter() treats as not matching). If you want NA rows kept, use is.na(x) | x > 5. If you want them excluded explicitly, write !is.na(x), x > 5.
filter() with select() is the most common dplyr error after = vs ==. filter() picks rows by condition; select() picks columns by name. If you write filter(mtcars, mpg, cyl) (no condition), R errors because mpg and cyl are not logical vectors. If you wanted the columns, use select().Pitfall 3: chained & is faster than separate filter calls in some cases. filter(df, a > 0, b > 0) and filter(df, a > 0) |> filter(b > 0) produce the same result, but the single call evaluates conditions in one pass. For large data, prefer the comma form.
Try it yourself
Try it: Filter mtcars to keep only cars with cyl == 4 AND mpg > 25. Save the result to ex_filtered and print the row count.
Click to reveal solution
Explanation: Comma-separated conditions inside filter() combine with AND. The result keeps only rows satisfying both. Equivalent to filter(mtcars, cyl == 4 & mpg > 25).
Related dplyr functions
After mastering filter(), look at:
slice(),slice_head(),slice_tail(),slice_min(),slice_max(): row selection by position or sorted valuedistinct(): remove duplicate rowsarrange(): sort rows (does not subset)between(),if_any(),if_all(): helpers for compound row conditionsdplyr::filter()vsstats::filter(): the latter is for time-series filtering, unrelated. Usedplyr::filter()explicitly when both packages are loaded.
FAQ
How do I filter for multiple conditions in dplyr?
List them comma-separated for AND: filter(df, x > 5, y < 10). Use | for OR: filter(df, x > 5 | y < 10). Negate with !: filter(df, !(x > 5)). Combine freely with parentheses: filter(df, (x > 5 & y < 10) | z == "A").
What is the difference between filter() and subset() in R?
subset() is a base R function with similar behavior, but it uses non-standard evaluation in ways that differ subtly from dplyr. The R documentation for subset() advises against using it in scripts. filter() is the modern, predictable replacement.
How do I filter rows where a column is NA?
Use is.na() inside filter(): filter(df, is.na(x)) returns rows where x is NA. To EXCLUDE NA rows, use filter(df, !is.na(x)). The drop_na() function from tidyr does this for multiple columns at once.
Can I filter using a regular expression?
Yes, with grepl() or str_detect() inside the condition: filter(df, grepl("pattern", x)) or filter(df, str_detect(x, "pattern")). Both return logical vectors that filter() accepts.
How do I filter the top N rows by a column?
Use slice_max() or slice_min(), not filter(). slice_max(df, mpg, n = 5) returns the top 5 rows by mpg. filter() is for arbitrary conditions, not for ranking-based selection.