dplyr filter() and select(): Subset Rows & Columns with Precision
filter() keeps rows that match a condition. select() picks columns by name, position, or pattern. Together they're the most-used dplyr verbs — you'll call them in almost every analysis.
Base R subsetting with df[df$col > 5, c("a", "b")] works but gets ugly fast, especially with multiple conditions. dplyr's filter and select are readable, pipe-friendly, and come with powerful helper functions that save typing.
filter(): Keep Rows by Condition
Start with a simple comparison. filter() takes a data frame and one or more logical conditions. It returns only the rows where every condition is TRUE.
Multiple Conditions (AND)
Separate conditions with commas — they're combined with AND (all must be true).
OR Conditions
Use | for OR — rows matching either condition are kept.
%in% for Multiple Values
When checking against several values, %in% is cleaner than chaining |.
Negation with !
between() for Ranges
Filtering with NA
filter() automatically drops rows where the condition evaluates to NA. If you want to keep NAs, be explicit.
String Filtering with grepl / str_detect
select(): Pick Columns
By Name
By Range
Exclude Columns with -
Select Helpers
These functions match columns by name patterns — no need to type every column.
Select by Type with where()
Rename While Selecting
Reorder: Move Columns to Front
Combining filter and select
The real power comes from chaining filter and select in a pipeline.
filter() vs subset() vs base R [
| Method | Syntax | Drops NA? | Pipe-friendly? |
|---|---|---|---|
filter() |
filter(df, x > 5) |
Yes | Yes |
subset() |
subset(df, x > 5) |
Yes | Awkward |
[ |
df[df$x > 5, ] |
No (keeps NA rows) | No |
Use
filter()in tidyverse workflows. Use[only when you need to preserve NA rows or avoid dependencies.
Practice Exercises
Exercise 1: Complex Filter
Find all 4-cylinder cars with above-average mpg and manual transmission (am == 1).
Click to reveal solution
```rExercise 2: Select by Pattern and Type
From iris, select Species plus all columns containing "Width". Then select only numeric columns and compute their means.
Click to reveal solution
```rExercise 3: Negated Filter with NA
From this dataset, keep rows where score is NOT NA and grade is NOT "F".
Click to reveal solution
```rSummary
| Function | Purpose | Example | ||
|---|---|---|---|---|
filter(condition) |
Keep matching rows | filter(mpg > 20) |
||
filter(c1, c2) |
AND conditions | filter(cyl == 4, am == 1) |
||
| `filter(c1 \ | c2)` | OR conditions | `filter(cyl == 4 \ | cyl == 6)` |
filter(x %in% vals) |
Match set | filter(cyl %in% c(4, 6)) |
||
select(cols) |
Pick columns | select(mpg, hp) |
||
select(-cols) |
Drop columns | select(-c(vs, am)) |
||
starts_with("x") |
Name prefix | select(starts_with("Sepal")) |
||
ends_with("x") |
Name suffix | select(ends_with("Width")) |
||
contains("x") |
Name contains | select(contains("ar")) |
||
where(fn) |
By column type | select(where(is.numeric)) |
||
everything() |
All remaining | select(id, everything()) |
FAQ
What's the difference between filter() and subset()?
Functionally similar for basic use. filter() is faster, works with grouped data frames, supports tidy evaluation, and integrates with the pipe. Always prefer filter() in tidyverse code.
Can filter() use regular expressions?
Not directly. Use grepl("pattern", col) or stringr::str_detect(col, "pattern") inside filter: filter(str_detect(name, "^A")).
How do I select columns by position number?
select(1, 3, 5) or select(1:5). But name-based selection is safer — column positions can change if the data source is updated.
Does filter() modify the original data frame?
No. Like all dplyr verbs, filter() returns a new data frame. The original is unchanged. Assign the result to save it: result <- df |> filter(...).
What's Next?
- dplyr mutate & rename — create and modify columns
- dplyr group_by & summarise — aggregate data by group
- R Joins — combine multiple data frames