dplyr filter() and select(): Subset Exactly the Data You Need, Every Time

In dplyr, filter() keeps the rows that match a condition, and select() keeps the columns you name. Together they're the first two verbs you'll reach for in any data analysis, and they replace half a dozen clunky base-R patterns with two clean, composable calls.

By Selva Prabhakaran · Published May 13, 2026 · Last updated May 13, 2026

How do you filter rows with dplyr::filter()?

filter() takes a data frame and one or more conditions. It keeps the rows where every condition evaluates to TRUE. You refer to columns by their bare names, no $, no quotes.

RFilter mtcars rows by mpg

library(dplyr) library(tibble) filter(mtcars, mpg > 25) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 #> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 #> Toyota Corolla 33.9 4 71.1 79 4.22 1.835 19.90 1 1 4 1 #> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 #> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 #> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

Compare that to base R: mtcars[mtcars$mpg > 25, ]. The dplyr version is shorter, and it composes cleanly with the pipe. The real wins show up with multiple conditions.

RFilter with two comma conditions

mtcars |> filter(mpg > 20, cyl == 4) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 #> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 #> Merc 230 22.8 4 140.8 95 3.92 22.90 0 0 4 2 #> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 #> ...

Commas between conditions mean "and." Every row must satisfy all of them. This is the single most readable way to write multi-condition filters in R.

Key Insight

Inside filter(), dplyr uses tidy evaluation: column names are bare identifiers. That's why filter(mtcars, mpg > 25) works but filter(mtcars, "mpg" > 25) doesn't, the second is comparing a string to a number.

Try it: Filter mtcars to rows where gear == 4 and carb == 4.

RExercise: Filter by gear and carb

library(dplyr) mtcars |> filter(gear == ___, carb == ___)

Click to reveal solution

RFilter by gear and carb solution

library(dplyr) mtcars |> filter(gear == 4, carb == 4) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 21.0 6 160.0 110 3.90 2.62 16.46 0 1 4 4 #> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.88 17.02 0 1 4 4 #> Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4 #> Merc 280C 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4 #> Volvo 142E 21.4 4 121.0 109 4.11 2.78 18.60 1 1 4 2

Each comma-separated condition is ANDed together, so only rows with both gear == 4 and carb == 4 survive. Both column names resolve by tidy evaluation inside filter(), which is why you don't need mtcars$gear, dplyr scopes the names to the data frame on the left of the pipe.

How do you combine filter conditions with `&`, `|`, and `!`?

Commas mean "and", that's the common case. For "or" and "not," use the standard logical operators: | for or, ! for not.

ROR, NOT, and in combinations

# OR, either condition matches mtcars |> filter(cyl == 8 | mpg > 30) # NOT, exclude a group mtcars |> filter(!(cyl == 8)) # Complex, (cyl 4 or 6) AND high mpg mtcars |> filter(cyl %in% c(4, 6), mpg > 25)

The %in% operator is your friend for "is this value one of these values" checks, cleaner than chaining == with |.

Rbetween and strdetect filters

# Between, numeric range mtcars |> filter(between(hp, 100, 150)) # String matching with stringr::str_detect() library(stringr) starwars |> filter(str_detect(name, "Luke")) #> # A tibble: 1 x 14 #> name height mass ... #> 1 Luke Skywalker 172 77 ...

Warning

Don't use && or || inside filter(). Those are scalar operators, they return one value even if you give them vectors, and they'll silently filter wrong. Always use the vectorized & and |.

Try it: Filter to rows where am == 1 OR gear == 5.

RExercise: am or gear filter

mtcars |> filter(am == 1 | gear == ___)

Click to reveal solution

Ram or gear filter solution

library(dplyr) mtcars |> filter(am == 1 | gear == 5) |> nrow() #> [1] 13

| is the vectorised OR, each row is kept if either side is TRUE. Because every manual car (am == 1) also has gear of 4 or 5, the OR mostly matches the same rows as am == 1, giving 13 cars. Don't swap | for ||: || is the scalar shortcut used in if statements and it'll only look at the first row.

How do you handle NA in filters?

NA propagates through comparisons: NA > 5 is NA, not FALSE. filter() drops rows where the condition is NA, safer than base R, which sometimes returns mystery NA rows. But you still need is.na() to explicitly select missing rows.

RFilter around NA safely

df <- tibble(x = c(1, 2, NA, 4, NA)) df |> filter(x > 2) #> # A tibble: 1 x 1 #> x #> <dbl> #> 1 4 df |> filter(is.na(x)) #> # A tibble: 2 x 1 #> x #> <dbl> #> 1 NA #> 2 NA df |> filter(!is.na(x)) #> # A tibble: 3 x 1 #> x #> <dbl> #> 1 1 #> 2 2 #> 3 4

!is.na(x) is the idiomatic "keep the non-missing rows" filter. You'll write it in nearly every analysis that loads real data.

Try it: On starwars, keep only rows where mass is not NA.

RExercise: Drop NA mass rows

starwars |> filter(!is.na(___))

Click to reveal solution

RDrop NA mass solution

library(dplyr) starwars |> filter(!is.na(mass)) |> nrow() #> [1] 59

!is.na(mass) is a logical vector that's TRUE wherever mass has a value, and filter() keeps exactly those rows. Of the 87 characters in starwars, 59 have a recorded mass, the other 28 drop out. This pattern is the cleanest way to guarantee downstream numeric work won't choke on NAs.

How do you pick columns with select()?

select() keeps (or drops) columns by name. The simplest form lists the columns you want:

RPick columns with select

mtcars |> select(mpg, cyl, hp) |> head(3) #> mpg cyl hp #> Mazda RX4 21 6 110 #> Mazda RX4 Wag 21 6 110 #> Datsun 710 22.8 4 93

A minus sign drops columns instead:

RDrop columns with minus sign

mtcars |> select(-vs, -am, -gear, -carb) |> head(3) #> mpg cyl disp hp drat wt qsec #> Mazda RX4 21 6 160 110 3.90 2.620 16.46 #> Mazda RX4 Wag 21 6 160 110 3.90 2.875 17.02 #> Datsun 710 22.8 4 108 93 3.85 2.320 18.61

Ranges work with :, the colon operator picks every column between two names inclusive:

RSelect column ranges with colon

mtcars |> select(mpg:drat) |> head(3) #> mpg cyl disp hp drat #> Mazda RX4 21 6 160 110 3.90 #> Mazda RX4 Wag 21 6 160 110 3.90 #> Datsun 710 22.8 4 108 93 3.85

Try it: Select just mpg, wt, and hp from mtcars.

RExercise: Pick three columns

mtcars |> select(___, ___, ___) |> head()

Click to reveal solution

RPick three columns solution

library(dplyr) mtcars |> select(mpg, wt, hp) |> head() #> mpg wt hp #> Mazda RX4 21.0 2.620 110 #> Mazda RX4 Wag 21.0 2.875 110 #> Datsun 710 22.8 2.320 93 #> Hornet 4 Drive 21.4 3.215 110 #> Hornet Sportabout 18.7 3.440 175 #> Valiant 18.1 3.460 105

select() keeps columns in the order you list them, so you can reorder while subsetting, the output shows mpg first, then wt, then hp, even though the original data frame has wt after hp. Row names (Mazda RX4, ...) travel with the rows.

What are the column selection helpers (starts_with, ends_with, contains)?

Typing column names gets old fast when tables have 50+ columns. dplyr's tidyselect helpers let you pick by pattern.

Rstartswith, endswith, contains helpers

iris |> select(starts_with("Sepal")) |> head(3) #> Sepal.Length Sepal.Width #> 1 5.1 3.5 #> 2 4.9 3.0 #> 3 4.7 3.2 iris |> select(ends_with("Width")) |> head(3) #> Sepal.Width Petal.Width #> 1 3.5 0.2 #> 2 3.0 0.2 #> 3 3.2 0.2 iris |> select(contains("eng")) |> head(3) #> Sepal.Length Petal.Length #> 1 5.1 1.4 #> 2 4.9 1.4 #> 3 4.7 1.3

And where() lets you pick by column type:

Rselect with where helper

iris |> select(where(is.numeric)) |> head(3) #> Sepal.Length Sepal.Width Petal.Length Petal.Width #> 1 5.1 3.5 1.4 0.2 #> 2 4.9 3.0 1.4 0.2 #> 3 4.7 3.2 1.3 0.2

Tip

select(where(is.numeric)) plus summarise(across(everything(), mean)) is how you compute summary stats on every numeric column in one line. Memorize this pattern, you'll use it constantly.

Try it: From iris, select all columns whose name starts with "Petal".

RExercise: startswith variant

iris |> select(starts_with("___")) |> head()

Click to reveal solution

Rstartswith Petal solution

library(dplyr) iris |> select(starts_with("Petal")) |> head() #> Petal.Length Petal.Width #> 1 1.4 0.2 #> 2 1.4 0.2 #> 3 1.3 0.2 #> 4 1.5 0.2 #> 5 1.4 0.2 #> 6 1.7 0.4

starts_with("Petal") matches any column name that begins with Petal, so Petal.Length and Petal.Width are kept and the three Sepal/Species columns are dropped. tidyselect helpers like this keep working as new columns are added, so you don't have to update your select call when the schema grows.

How do you rename columns with select() and rename()?

select() can rename columns inline: the syntax is new_name = old_name. If you only want to rename without dropping anything, use rename().

RInline rename in select and rename

mtcars |> select(miles_per_gallon = mpg, cylinders = cyl, horsepower = hp) |> head(3) #> miles_per_gallon cylinders horsepower #> Mazda RX4 21.0 6 110 #> Mazda RX4 Wag 21.0 6 110 #> Datsun 710 22.8 4 93 mtcars |> rename(miles_per_gallon = mpg) |> head(3) |> colnames() #> [1] "miles_per_gallon" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" #> [10] "gear" "carb"

rename() keeps every other column; select() drops anything you didn't list.

Try it: Rename wt to weight in mtcars.

RExercise: Rename wt to weight

mtcars |> rename(weight = ___) |> head()

Click to reveal solution

RRename wt to weight solution

library(dplyr) mtcars |> rename(weight = wt) |> head() #> mpg cyl disp hp drat weight qsec vs am gear carb #> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 #> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 #> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 #> ...

rename() takes new_name = old_name pairs and changes the name in place, every other column stays put. Unlike select(weight = wt), you don't have to list every column you want to keep, which is exactly what you want when you only need to retitle one thing.

How do you combine filter() and select() in a pipeline?

The pair composes naturally. Filter first (reduce rows), then select (reduce columns), or vice versa, it doesn't affect the result but it can affect memory for huge tables.

Rfilter then select then arrange

mtcars |> filter(cyl == 4, mpg > 25) |> select(mpg, wt, hp, gear) |> arrange(desc(mpg)) #> mpg wt hp gear #> Toyota Corolla 33.9 1.835 79 4 #> Fiat 128 32.4 2.200 66 4 #> Honda Civic 30.4 1.615 52 4 #> Lotus Europa 30.4 1.513 113 5 #> Fiat X1-9 27.3 1.935 66 4 #> Porsche 914-2 26.0 2.140 91 5

Three verbs, four lines, and you've answered "which 4-cylinder cars have the best mileage, showing just the relevant columns." That's dplyr at its best.

Note

The order filter() |> select() is the conventional one, keep it even when either order works. Consistency makes pipelines faster to read for the next person (or future you).

Try it: From mtcars, filter to gear == 4, then keep only mpg, hp, and wt.

RExercise: Gear filter plus columns

mtcars |> filter(gear == ___) |> select(mpg, hp, ___)

Click to reveal solution

RGear filter plus columns solution

library(dplyr) mtcars |> filter(gear == 4) |> select(mpg, hp, wt) #> mpg hp wt #> Mazda RX4 21.0 110 2.620 #> Mazda RX4 Wag 21.0 110 2.875 #> Datsun 710 22.8 93 2.320 #> Fiat 128 32.4 66 2.200 #> Honda Civic 30.4 52 1.615 #> ...

The pipe hands the filtered data frame to select(), so the column pick operates only on the 12 four-gear cars, not the full 32 rows. Doing filter() before select() is the conventional order because it usually shrinks the row count first, which is cheaper and matches how you'd describe the query in English.

Practice Exercises

Exercise 1: Top fuel-efficient manuals

From mtcars, return the names and mpg of manual-transmission cars (am == 1) with mpg above the overall median.

Show solution

RAbove-median manual cars solution

library(dplyr) med <- median(mtcars$mpg) mtcars |> tibble::rownames_to_column("model") |> filter(am == 1, mpg > med) |> select(model, mpg) |> arrange(desc(mpg))

Exercise 2: Select by type and pattern

From iris, select every numeric column whose name contains "Length".

Show solution

RNumeric Length columns solution

iris |> select(where(is.numeric) & contains("Length")) #> Sepal.Length Petal.Length #> 1 5.1 1.4 #> ...

Exercise 3: Exclude and rename

From mtcars, drop the vs, am, and carb columns, then rename mpg to miles_per_gallon.

Show solution

RDrop and rename solution

mtcars |> select(-vs, -am, -carb) |> rename(miles_per_gallon = mpg) |> head()

Putting It All Together

A complete mini-analysis on starwars: find the tallest human characters with known homeworlds, keeping only the columns we care about.

REnd-to-end tallest humans pipeline

library(dplyr) starwars |> filter( species == "Human", !is.na(height), !is.na(homeworld) ) |> select(name, height, mass, homeworld, gender) |> arrange(desc(height)) |> head(5) #> # A tibble: 5 x 5 #> name height mass homeworld gender #> <chr> <int> <dbl> <chr> <chr> #> 1 Darth Vader 202 136 Tatooine masculine #> 2 Qui-Gon Jinn 193 89 unknown masculine #> 3 Dooku 193 80 Serenno masculine #> 4 Bail Prestor Organa 191 NA Alderaan masculine #> 5 Anakin Skywalker 188 84 Tatooine masculine

Three filters (species, no NA height, no NA homeworld), a column selection, a sort, a top-5. Every line reads top-to-bottom as one thought.

Summary

Verb	What it does	Key pattern
`filter(cond, cond, ...)`	Keep rows where all conditions are `TRUE`	`filter(df, x > 5, y == "a")`
`filter(cond \	cond)`	Keep rows matching any condition	`filter(df, x > 5 \	y == "a")`
`filter(!is.na(x))`	Drop missing values in a column	Essential cleanup idiom
`select(a, b, c)`	Keep named columns	Explicit, clearest form
`select(-x, -y)`	Drop named columns	Inverse selection
`select(starts_with("x"))`	Keep by name pattern	Also `ends_with`, `contains`
`select(where(is.numeric))`	Keep by type	Combine with `across()` later
`rename(new = old)`	Rename without dropping	Use when selecting would drop too much

References

Continue Learning

dplyr mutate(): Create New Columns, the natural next verb after filter/select.
dplyr group_by() + summarise(), aggregate filtered data.
R Pipe Operator: %>% vs |>, the glue that connects dplyr verbs.

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

dplyr filter() and select(): Subset Exactly the Data You Need, Every Time

How do you filter rows with dplyr::filter()?

How do you combine filter conditions with `&`, `|`, and `!`?

How do you handle NA in filters?

How do you pick columns with select()?

What are the column selection helpers (starts_with, ends_with, contains)?

How do you rename columns with select() and rename()?

How do you combine filter() and select() in a pipeline?

Practice Exercises

Exercise 1: Top fuel-efficient manuals

Exercise 2: Select by type and pattern

Exercise 3: Exclude and rename

Putting It All Together

Summary

References

Continue Learning

Further Reading

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

dplyr filter() and select(): Subset Exactly the Data You Need, Every Time

How do you filter rows with dplyr::filter()?

How do you combine filter conditions with &, |, and !?

How do you handle NA in filters?

How do you pick columns with select()?

What are the column selection helpers (starts_with, ends_with, contains)?

How do you rename columns with select() and rename()?

How do you combine filter() and select() in a pipeline?

Practice Exercises

Exercise 1: Top fuel-efficient manuals

Exercise 2: Select by type and pattern

Exercise 3: Exclude and rename

Putting It All Together

Summary

References

Continue Learning

Further Reading

Related Tutorials

How do you combine filter conditions with `&`, `|`, and `!`?