dplyr filter() and select(): Subset Exactly the Data You Need, Every Time

In dplyr, filter() keeps the rows that match a condition, and select() keeps the columns you name. Together they're the first two verbs you'll reach for in any data analysis, and they replace half a dozen clunky base-R patterns with two clean, composable calls.

How do you filter rows with dplyr::filter()?

filter() takes a data frame and one or more conditions. It keeps the rows where every condition evaluates to TRUE. You refer to columns by their bare names, no $, no quotes.

RFilter mtcars rows by mpg
library(dplyr) filter(mtcars, mpg > 25) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 #> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 #> Toyota Corolla 33.9 4 71.1 79 4.22 1.835 19.90 1 1 4 1 #> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 #> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 #> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

  

Compare that to base R: mtcars[mtcars$mpg > 25, ]. The dplyr version is shorter, and it composes cleanly with the pipe. The real wins show up with multiple conditions.

RFilter with two comma conditions
mtcars |> filter(mpg > 20, cyl == 4) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 #> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 #> Merc 230 22.8 4 140.8 95 3.92 22.90 0 0 4 2 #> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 #> ...

  

Commas between conditions mean "and." Every row must satisfy all of them. This is the single most readable way to write multi-condition filters in R.

Key Insight
Inside filter(), dplyr uses tidy evaluation: column names are bare identifiers. That's why filter(mtcars, mpg > 25) works but filter(mtcars, "mpg" > 25) doesn't, the second is comparing a string to a number.

Try it: Filter mtcars to rows where gear == 4 and carb == 4.

RExercise: Filter by gear and carb
library(dplyr) mtcars |> filter(gear == ___, carb == ___)

  
Click to reveal solution
RFilter by gear and carb solution
library(dplyr) mtcars |> filter(gear == 4, carb == 4) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 21.0 6 160.0 110 3.90 2.62 16.46 0 1 4 4 #> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.88 17.02 0 1 4 4 #> Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4 #> Merc 280C 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4 #> Volvo 142E 21.4 4 121.0 109 4.11 2.78 18.60 1 1 4 2

  

Each comma-separated condition is ANDed together, so only rows with both gear == 4 and carb == 4 survive. Both column names resolve by tidy evaluation inside filter(), which is why you don't need mtcars$gear, dplyr scopes the names to the data frame on the left of the pipe.

How do you combine filter conditions with &, |, and !?

Commas mean "and", that's the common case. For "or" and "not," use the standard logical operators: | for or, ! for not.

ROR, NOT, and in combinations
# OR, either condition matches mtcars |> filter(cyl == 8 | mpg > 30) # NOT, exclude a group mtcars |> filter(!(cyl == 8)) # Complex, (cyl 4 or 6) AND high mpg mtcars |> filter(cyl %in% c(4, 6), mpg > 25)

  

The %in% operator is your friend for "is this value one of these values" checks, cleaner than chaining == with |.

Rbetween and strdetect filters
# Between, numeric range mtcars |> filter(between(hp, 100, 150)) # String matching with stringr::str_detect() library(stringr) starwars |> filter(str_detect(name, "Luke")) #> # A tibble: 1 x 14 #> name height mass ... #> 1 Luke Skywalker 172 77 ...

  
Warning
Don't use && or || inside filter(). Those are scalar operators, they return one value even if you give them vectors, and they'll silently filter wrong. Always use the vectorized & and |.

Try it: Filter to rows where am == 1 OR gear == 5.

RExercise: am or gear filter
mtcars |> filter(am == 1 | gear == ___)

  
Click to reveal solution
Ram or gear filter solution
library(dplyr) mtcars |> filter(am == 1 | gear == 5) |> nrow() #> [1] 13

  

| is the vectorised OR, each row is kept if either side is TRUE. Because every manual car (am == 1) also has gear of 4 or 5, the OR mostly matches the same rows as am == 1, giving 13 cars. Don't swap | for ||: || is the scalar shortcut used in if statements and it'll only look at the first row.

How do you handle NA in filters?

NA propagates through comparisons: NA > 5 is NA, not FALSE. filter() drops rows where the condition is NA, safer than base R, which sometimes returns mystery NA rows. But you still need is.na() to explicitly select missing rows.

RFilter around NA safely
df <- tibble(x = c(1, 2, NA, 4, NA)) df |> filter(x > 2) #> # A tibble: 1 x 1 #> x #> <dbl> #> 1 4 df |> filter(is.na(x)) #> # A tibble: 2 x 1 #> x #> <dbl> #> 1 NA #> 2 NA df |> filter(!is.na(x)) #> # A tibble: 3 x 1 #> x #> <dbl> #> 1 1 #> 2 2 #> 3 4

  

!is.na(x) is the idiomatic "keep the non-missing rows" filter. You'll write it in nearly every analysis that loads real data.

Try it: On starwars, keep only rows where mass is not NA.

RExercise: Drop NA mass rows
starwars |> filter(!is.na(___))

  
Click to reveal solution
RDrop NA mass solution
library(dplyr) starwars |> filter(!is.na(mass)) |> nrow() #> [1] 59

  

!is.na(mass) is a logical vector that's TRUE wherever mass has a value, and filter() keeps exactly those rows. Of the 87 characters in starwars, 59 have a recorded mass, the other 28 drop out. This pattern is the cleanest way to guarantee downstream numeric work won't choke on NAs.

How do you pick columns with select()?

select() keeps (or drops) columns by name. The simplest form lists the columns you want:

RPick columns with select
mtcars |> select(mpg, cyl, hp) |> head(3) #> mpg cyl hp #> Mazda RX4 21 6 110 #> Mazda RX4 Wag 21 6 110 #> Datsun 710 22.8 4 93

  

A minus sign drops columns instead:

RDrop columns with minus sign
mtcars |> select(-vs, -am, -gear, -carb) |> head(3) #> mpg cyl disp hp drat wt qsec #> Mazda RX4 21 6 160 110 3.90 2.620 16.46 #> Mazda RX4 Wag 21 6 160 110 3.90 2.875 17.02 #> Datsun 710 22.8 4 108 93 3.85 2.320 18.61

  

Ranges work with :, the colon operator picks every column between two names inclusive:

RSelect column ranges with colon
mtcars |> select(mpg:drat) |> head(3) #> mpg cyl disp hp drat #> Mazda RX4 21 6 160 110 3.90 #> Mazda RX4 Wag 21 6 160 110 3.90 #> Datsun 710 22.8 4 108 93 3.85

  

Try it: Select just mpg, wt, and hp from mtcars.

RExercise: Pick three columns
mtcars |> select(___, ___, ___) |> head()

  
Click to reveal solution
RPick three columns solution
library(dplyr) mtcars |> select(mpg, wt, hp) |> head() #> mpg wt hp #> Mazda RX4 21.0 2.620 110 #> Mazda RX4 Wag 21.0 2.875 110 #> Datsun 710 22.8 2.320 93 #> Hornet 4 Drive 21.4 3.215 110 #> Hornet Sportabout 18.7 3.440 175 #> Valiant 18.1 3.460 105

  

select() keeps columns in the order you list them, so you can reorder while subsetting, the output shows mpg first, then wt, then hp, even though the original data frame has wt after hp. Row names (Mazda RX4, ...) travel with the rows.

What are the column selection helpers (starts_with, ends_with, contains)?

Typing column names gets old fast when tables have 50+ columns. dplyr's tidyselect helpers let you pick by pattern.

Rstartswith, endswith, contains helpers
iris |> select(starts_with("Sepal")) |> head(3) #> Sepal.Length Sepal.Width #> 1 5.1 3.5 #> 2 4.9 3.0 #> 3 4.7 3.2 iris |> select(ends_with("Width")) |> head(3) #> Sepal.Width Petal.Width #> 1 3.5 0.2 #> 2 3.0 0.2 #> 3 3.2 0.2 iris |> select(contains("eng")) |> head(3) #> Sepal.Length Petal.Length #> 1 5.1 1.4 #> 2 4.9 1.4 #> 3 4.7 1.3

  

And where() lets you pick by column type:

Rselect with where helper
iris |> select(where(is.numeric)) |> head(3) #> Sepal.Length Sepal.Width Petal.Length Petal.Width #> 1 5.1 3.5 1.4 0.2 #> 2 4.9 3.0 1.4 0.2 #> 3 4.7 3.2 1.3 0.2

  
Tip
select(where(is.numeric)) plus summarise(across(everything(), mean)) is how you compute summary stats on every numeric column in one line. Memorize this pattern, you'll use it constantly.

Try it: From iris, select all columns whose name starts with "Petal".

RExercise: startswith variant
iris |> select(starts_with("___")) |> head()

  
Click to reveal solution
Rstartswith Petal solution
library(dplyr) iris |> select(starts_with("Petal")) |> head() #> Petal.Length Petal.Width #> 1 1.4 0.2 #> 2 1.4 0.2 #> 3 1.3 0.2 #> 4 1.5 0.2 #> 5 1.4 0.2 #> 6 1.7 0.4

  

starts_with("Petal") matches any column name that begins with Petal, so Petal.Length and Petal.Width are kept and the three Sepal/Species columns are dropped. tidyselect helpers like this keep working as new columns are added, so you don't have to update your select call when the schema grows.

How do you rename columns with select() and rename()?

select() can rename columns inline: the syntax is new_name = old_name. If you only want to rename without dropping anything, use rename().

RInline rename in select and rename
mtcars |> select(miles_per_gallon = mpg, cylinders = cyl, horsepower = hp) |> head(3) #> miles_per_gallon cylinders horsepower #> Mazda RX4 21.0 6 110 #> Mazda RX4 Wag 21.0 6 110 #> Datsun 710 22.8 4 93 mtcars |> rename(miles_per_gallon = mpg) |> head(3) |> colnames() #> [1] "miles_per_gallon" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" #> [10] "gear" "carb"

  

rename() keeps every other column; select() drops anything you didn't list.

Try it: Rename wt to weight in mtcars.

RExercise: Rename wt to weight
mtcars |> rename(weight = ___) |> head()

  
Click to reveal solution
RRename wt to weight solution
library(dplyr) mtcars |> rename(weight = wt) |> head() #> mpg cyl disp hp drat weight qsec vs am gear carb #> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 #> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 #> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 #> ...

  

rename() takes new_name = old_name pairs and changes the name in place, every other column stays put. Unlike select(weight = wt), you don't have to list every column you want to keep, which is exactly what you want when you only need to retitle one thing.

How do you combine filter() and select() in a pipeline?

The pair composes naturally. Filter first (reduce rows), then select (reduce columns), or vice versa, it doesn't affect the result but it can affect memory for huge tables.

Rfilter then select then arrange
mtcars |> filter(cyl == 4, mpg > 25) |> select(mpg, wt, hp, gear) |> arrange(desc(mpg)) #> mpg wt hp gear #> Toyota Corolla 33.9 1.835 79 4 #> Fiat 128 32.4 2.200 66 4 #> Honda Civic 30.4 1.615 52 4 #> Lotus Europa 30.4 1.513 113 5 #> Fiat X1-9 27.3 1.935 66 4 #> Porsche 914-2 26.0 2.140 91 5

  

Three verbs, four lines, and you've answered "which 4-cylinder cars have the best mileage, showing just the relevant columns." That's dplyr at its best.

Note
The order filter() |> select() is the conventional one, keep it even when either order works. Consistency makes pipelines faster to read for the next person (or future you).

Try it: From mtcars, filter to gear == 4, then keep only mpg, hp, and wt.

RExercise: Gear filter plus columns
mtcars |> filter(gear == ___) |> select(mpg, hp, ___)

  
Click to reveal solution
RGear filter plus columns solution
library(dplyr) mtcars |> filter(gear == 4) |> select(mpg, hp, wt) #> mpg hp wt #> Mazda RX4 21.0 110 2.620 #> Mazda RX4 Wag 21.0 110 2.875 #> Datsun 710 22.8 93 2.320 #> Fiat 128 32.4 66 2.200 #> Honda Civic 30.4 52 1.615 #> ...

  

The pipe hands the filtered data frame to select(), so the column pick operates only on the 12 four-gear cars, not the full 32 rows. Doing filter() before select() is the conventional order because it usually shrinks the row count first, which is cheaper and matches how you'd describe the query in English.

Practice Exercises

Exercise 1: Top fuel-efficient manuals

From mtcars, return the names and mpg of manual-transmission cars (am == 1) with mpg above the overall median.

Show solution
RAbove-median manual cars solution
library(dplyr) med <- median(mtcars$mpg) mtcars |> tibble::rownames_to_column("model") |> filter(am == 1, mpg > med) |> select(model, mpg) |> arrange(desc(mpg))

  

Exercise 2: Select by type and pattern

From iris, select every numeric column whose name contains "Length".

Show solution
RNumeric Length columns solution
iris |> select(where(is.numeric) & contains("Length")) #> Sepal.Length Petal.Length #> 1 5.1 1.4 #> ...

  

Exercise 3: Exclude and rename

From mtcars, drop the vs, am, and carb columns, then rename mpg to miles_per_gallon.

Show solution
RDrop and rename solution
mtcars |> select(-vs, -am, -carb) |> rename(miles_per_gallon = mpg) |> head()

  

Putting It All Together

A complete mini-analysis on starwars: find the tallest human characters with known homeworlds, keeping only the columns we care about.

REnd-to-end tallest humans pipeline
library(dplyr) starwars |> filter( species == "Human", !is.na(height), !is.na(homeworld) ) |> select(name, height, mass, homeworld, gender) |> arrange(desc(height)) |> head(5) #> # A tibble: 5 x 5 #> name height mass homeworld gender #> <chr> <int> <dbl> <chr> <chr> #> 1 Darth Vader 202 136 Tatooine masculine #> 2 Qui-Gon Jinn 193 89 unknown masculine #> 3 Dooku 193 80 Serenno masculine #> 4 Bail Prestor Organa 191 NA Alderaan masculine #> 5 Anakin Skywalker 188 84 Tatooine masculine

  

Three filters (species, no NA height, no NA homeworld), a column selection, a sort, a top-5. Every line reads top-to-bottom as one thought.

Summary

Verb What it does Key pattern
filter(cond, cond, ...) Keep rows where all conditions are TRUE filter(df, x > 5, y == "a")
`filter(cond \ cond)` Keep rows matching any condition `filter(df, x > 5 \ y == "a")`
filter(!is.na(x)) Drop missing values in a column Essential cleanup idiom
select(a, b, c) Keep named columns Explicit, clearest form
select(-x, -y) Drop named columns Inverse selection
select(starts_with("x")) Keep by name pattern Also ends_with, contains
select(where(is.numeric)) Keep by type Combine with across() later
rename(new = old) Rename without dropping Use when selecting would drop too much

References

  1. dplyr package documentation
  2. R for Data Science, Data Transformation
  3. tidyselect helpers reference
  4. dplyr cheat sheet (RStudio)
  5. Tidyverse Style Guide

Continue Learning

{% endraw %}