dplyr filter() and select(): Subset Exactly the Data You Need, Every Time
In dplyr, filter() keeps the rows that match a condition, and select() keeps the columns you name. Together they're the first two verbs you'll reach for in any data analysis, and they replace half a dozen clunky base-R patterns with two clean, composable calls.
How do you filter rows with dplyr::filter()?
filter() takes a data frame and one or more conditions. It keeps the rows where every condition evaluates to TRUE. You refer to columns by their bare names, no $, no quotes.
Compare that to base R: mtcars[mtcars$mpg > 25, ]. The dplyr version is shorter, and it composes cleanly with the pipe. The real wins show up with multiple conditions.
Commas between conditions mean "and." Every row must satisfy all of them. This is the single most readable way to write multi-condition filters in R.
filter(), dplyr uses tidy evaluation: column names are bare identifiers. That's why filter(mtcars, mpg > 25) works but filter(mtcars, "mpg" > 25) doesn't, the second is comparing a string to a number.Try it: Filter mtcars to rows where gear == 4 and carb == 4.
Click to reveal solution
Each comma-separated condition is ANDed together, so only rows with both gear == 4 and carb == 4 survive. Both column names resolve by tidy evaluation inside filter(), which is why you don't need mtcars$gear, dplyr scopes the names to the data frame on the left of the pipe.
How do you combine filter conditions with &, |, and !?
Commas mean "and", that's the common case. For "or" and "not," use the standard logical operators: | for or, ! for not.
The %in% operator is your friend for "is this value one of these values" checks, cleaner than chaining == with |.
&& or || inside filter(). Those are scalar operators, they return one value even if you give them vectors, and they'll silently filter wrong. Always use the vectorized & and |.Try it: Filter to rows where am == 1 OR gear == 5.
Click to reveal solution
| is the vectorised OR, each row is kept if either side is TRUE. Because every manual car (am == 1) also has gear of 4 or 5, the OR mostly matches the same rows as am == 1, giving 13 cars. Don't swap | for ||: || is the scalar shortcut used in if statements and it'll only look at the first row.
How do you handle NA in filters?
NA propagates through comparisons: NA > 5 is NA, not FALSE. filter() drops rows where the condition is NA, safer than base R, which sometimes returns mystery NA rows. But you still need is.na() to explicitly select missing rows.
!is.na(x) is the idiomatic "keep the non-missing rows" filter. You'll write it in nearly every analysis that loads real data.
Try it: On starwars, keep only rows where mass is not NA.
Click to reveal solution
!is.na(mass) is a logical vector that's TRUE wherever mass has a value, and filter() keeps exactly those rows. Of the 87 characters in starwars, 59 have a recorded mass, the other 28 drop out. This pattern is the cleanest way to guarantee downstream numeric work won't choke on NAs.
How do you pick columns with select()?
select() keeps (or drops) columns by name. The simplest form lists the columns you want:
A minus sign drops columns instead:
Ranges work with :, the colon operator picks every column between two names inclusive:
Try it: Select just mpg, wt, and hp from mtcars.
Click to reveal solution
select() keeps columns in the order you list them, so you can reorder while subsetting, the output shows mpg first, then wt, then hp, even though the original data frame has wt after hp. Row names (Mazda RX4, ...) travel with the rows.
What are the column selection helpers (starts_with, ends_with, contains)?
Typing column names gets old fast when tables have 50+ columns. dplyr's tidyselect helpers let you pick by pattern.
And where() lets you pick by column type:
select(where(is.numeric)) plus summarise(across(everything(), mean)) is how you compute summary stats on every numeric column in one line. Memorize this pattern, you'll use it constantly.Try it: From iris, select all columns whose name starts with "Petal".
Click to reveal solution
starts_with("Petal") matches any column name that begins with Petal, so Petal.Length and Petal.Width are kept and the three Sepal/Species columns are dropped. tidyselect helpers like this keep working as new columns are added, so you don't have to update your select call when the schema grows.
How do you rename columns with select() and rename()?
select() can rename columns inline: the syntax is new_name = old_name. If you only want to rename without dropping anything, use rename().
rename() keeps every other column; select() drops anything you didn't list.
Try it: Rename wt to weight in mtcars.
Click to reveal solution
rename() takes new_name = old_name pairs and changes the name in place, every other column stays put. Unlike select(weight = wt), you don't have to list every column you want to keep, which is exactly what you want when you only need to retitle one thing.
How do you combine filter() and select() in a pipeline?
The pair composes naturally. Filter first (reduce rows), then select (reduce columns), or vice versa, it doesn't affect the result but it can affect memory for huge tables.
Three verbs, four lines, and you've answered "which 4-cylinder cars have the best mileage, showing just the relevant columns." That's dplyr at its best.
filter() |> select() is the conventional one, keep it even when either order works. Consistency makes pipelines faster to read for the next person (or future you).Try it: From mtcars, filter to gear == 4, then keep only mpg, hp, and wt.
Click to reveal solution
The pipe hands the filtered data frame to select(), so the column pick operates only on the 12 four-gear cars, not the full 32 rows. Doing filter() before select() is the conventional order because it usually shrinks the row count first, which is cheaper and matches how you'd describe the query in English.
Practice Exercises
Exercise 1: Top fuel-efficient manuals
From mtcars, return the names and mpg of manual-transmission cars (am == 1) with mpg above the overall median.
Show solution
Exercise 2: Select by type and pattern
From iris, select every numeric column whose name contains "Length".
Show solution
Exercise 3: Exclude and rename
From mtcars, drop the vs, am, and carb columns, then rename mpg to miles_per_gallon.
Show solution
Putting It All Together
A complete mini-analysis on starwars: find the tallest human characters with known homeworlds, keeping only the columns we care about.
Three filters (species, no NA height, no NA homeworld), a column selection, a sort, a top-5. Every line reads top-to-bottom as one thought.
Summary
| Verb | What it does | Key pattern | ||
|---|---|---|---|---|
filter(cond, cond, ...) |
Keep rows where all conditions are TRUE |
filter(df, x > 5, y == "a") |
||
| `filter(cond \ | cond)` | Keep rows matching any condition | `filter(df, x > 5 \ | y == "a")` |
filter(!is.na(x)) |
Drop missing values in a column | Essential cleanup idiom | ||
select(a, b, c) |
Keep named columns | Explicit, clearest form | ||
select(-x, -y) |
Drop named columns | Inverse selection | ||
select(starts_with("x")) |
Keep by name pattern | Also ends_with, contains |
||
select(where(is.numeric)) |
Keep by type | Combine with across() later |
||
rename(new = old) |
Rename without dropping | Use when selecting would drop too much |
References
- dplyr package documentation
- R for Data Science, Data Transformation
- tidyselect helpers reference
- dplyr cheat sheet (RStudio)
- Tidyverse Style Guide
Continue Learning
- dplyr mutate(): Create New Columns, the natural next verb after filter/select.
- dplyr group_by() + summarise(), aggregate filtered data.
- R Pipe Operator: %>% vs |>, the glue that connects dplyr verbs.
Further Reading
- data.table vs dplyr in R: Head-to-Head Performance Benchmark: Which Is Right for You?
- dplyr Exercises: 15 Data Manipulation Practice Problems (With Solutions)
- dplyr across(): Apply the Same Function to Multiple Columns at Once
- data.table Exercises: 12 High-Performance Data Manipulation Problems, Solved Step-by-Step
- dplyr filter() & select() Exercises: 12 Practice Problems, Solved Step-by-Step)