Missing Values in R: Detect, Count, Remove, and Impute NA — Complete Playbook
Missing values in R are represented by NA, a special logical constant that silently propagates through arithmetic and comparisons. You handle them with is.na(), complete.cases(), na.omit(), and imputation methods like median replacement or mice.
A single NA can turn a clean mean() into NA and mask a real problem across your entire analysis. This post teaches you to detect, count, remove, and impute missing values with runnable examples, plus when to pick each approach.
Introduction
Real data is messy. Respondents skip questions. Sensors drop readings. Joins leave gaps. R marks every such hole with NA, and the value spreads through calculations unless you handle it. If you skip this handling, your summary statistics lie quietly, and your models either crash or learn from biased data.
R actually has four typed NAs: NA (logical), NA_integer_, NA_real_, and NA_character_. You rarely need to pick one by hand because R coerces between them. What matters is knowing how NA behaves and what to do about it.

Figure 1: NA silently propagates through arithmetic, comparison, and aggregation.
In this tutorial you will detect missing values, count them per column, remove them two different ways, and fill them in when removal would waste data. Every code block runs in your browser, and variables carry over between blocks like a notebook. Click Run on the first block, then work top to bottom.
What does NA mean in R, and why does it spread?
NA means "we do not know this value." Because the value is unknown, any calculation that touches it is also unknown. This is why 5 + NA returns NA instead of a guess. It is also why NA == NA returns NA, not TRUE. Two unknowns might be the same or might not be, so R refuses to commit.
This propagation is a feature, not a bug. It forces you to notice missingness instead of burying it inside a sum. Most aggregation functions accept na.rm = TRUE to opt out, but the default is strict on purpose.
Let's load the built-in airquality dataset, which contains real missing values in its Ozone and Solar.R columns, and watch NA propagate.
The first mean() call returns NA because Ozone contains missing days. Adding na.rm = TRUE drops the NAs before averaging and gives you a real number. Notice that without na.rm, R does not warn you, it just returns NA. If you piped that NA into a plot or a model, you would spend an hour debugging a silent failure.
na.rm = TRUE, you are making a modeling decision. Do it knowingly.Here is how NA compares to R's other special values:
| Value | Means | Example |
|---|---|---|
NA |
Unknown / missing | Missing survey answer |
NULL |
Does not exist | An empty list element |
NaN |
Not a number | 0 / 0 |
Inf / -Inf |
Infinity | 1 / 0 |
NULL vanishes from the vector entirely because it represents absence of the slot, not an unknown value. NaN is technically a missing number, so is.na(NaN) returns TRUE. is.na(NULL) returns an empty logical, not FALSE, because there is no element to test.
How do you detect NA values in R?
The rule of thumb is simple: never use == to test for NA. Use is.na(). The == operator compares two known values, and NA is not a known value, so x == NA always returns NA, never TRUE.
is.na() returns a logical vector the same shape as its input. TRUE marks a missing position, FALSE marks a present one. For a quick yes/no check across a whole object, use anyNA(), which short-circuits as soon as it finds one NA.
The x == NA call fails silently, returning a vector of NAs that looks plausible but tells you nothing. The is.na(x) call returns the correct mask. On the full data frame, anyNA(aq) tells you NAs exist somewhere, and anyNA(aq$Wind) confirms the Wind column has none.
How do you count missing values per column?
For data frames, a per-column NA count is almost always more useful than a total. It tells you which columns need attention. The pattern is colSums(is.na(df)), which adds up the TRUE values in each column's NA mask.
Ozone has 37 missing days out of 153 rows, about 24% of the column. Solar.R is missing 7 days, under 5%. Wind, Temp, Month, and Day are complete. That single table tells you Ozone is the problem column and Solar.R is light enough to drop or fill with little cost.
You can do the same job in dplyr, which fits nicely into a tidyverse pipeline:
The across(everything(), ...) call applies the NA-counting function to every column. The result is a one-row data frame, which you can pivot longer if you need a tall report. Either form works, pick the one that matches the rest of your code.
How do you remove rows with NA values?
Three tools handle removal, and they differ in how much control they give you.
na.omit(df) drops every row that has at least one NA anywhere. It is the blunt instrument. complete.cases(df) returns a logical mask, which you can combine with other filters. tidyr::drop_na() lets you name specific columns, so you only drop rows where those columns are missing.
na.omit() dropped 42 rows because it requires every column to be present. drop_na(Ozone) only dropped the 37 Ozone-missing rows, keeping 5 rows that have a Solar.R NA but a present Ozone. The two give you different datasets and different downstream conclusions.
When should you impute instead of dropping?
Dropping rows is cheap to implement but expensive in information. Each dropped row takes all its other columns with it. If you are losing more than a few percent of your data, impute instead. Imputation fills NAs with plausible values so the rest of the row stays in your analysis.
The right imputation method depends on why the values are missing. Statisticians group missingness into three mechanisms.

Figure 2: Three missingness mechanisms: MCAR, MAR, and MNAR.
MCAR (Missing Completely At Random) means the missingness does not depend on anything. A sensor fails randomly. You can drop or impute safely. MAR (Missing At Random) means missingness depends on observed data. Richer respondents hide income more often, but you can see who is rich. You can impute using the other variables. MNAR (Missing Not At Random) means missingness depends on the missing value itself. People with very high income hide it. No standard imputation is correct, and you usually need to model the missingness process.
Use a simple rule of thumb for which strategy to apply:

Figure 3: Decision tree for choosing drop, impute, or flag.
For most analyses, a median imputation is a reasonable starting point for numeric columns. It is robust to outliers and preserves the central tendency of the column. Here is how you apply it in dplyr:
The Ozone mean dropped from 42.1 to 37.8 after imputation because the median (31.5) is pulled below the mean by the column's right-skew. This is the trade-off of median imputation: it preserves the row count but shrinks variance and can bias summaries. For a quick analysis it is fine. For a published model, consider multiple imputation.
mice package generates several plausible datasets, fits your model to each, and pools the results to honor the uncertainty introduced by imputation. It is not pre-compiled for the in-browser code runner used on this page, so run it in your local R session.Complete Example: Cleaning airquality end-to-end
Let's apply the full playbook. We audit the data, decide on a strategy, and produce a cleaned frame ready for modeling.
The pipeline produced a frame with 146 rows (7 dropped for Solar.R) and zero NAs. Ozone kept all its 146 rows via imputation rather than losing another 35 to removal. The decisions match the rates: light missingness got dropped, heavy missingness got filled. Always document which columns you imputed and with what method, so a reviewer can challenge the choices.
Common Mistakes and How to Fix Them
Mistake 1: Using x == NA instead of is.na(x)
❌ Wrong:
Why it is wrong: x == NA compares to an unknown value, so every comparison returns NA. You get no match and no error.
✅ Correct:
Mistake 2: Calling mean() or sum() without na.rm on data with NAs
❌ Wrong:
Why it is wrong: A single NA turns the result into NA. If you pipe this downstream, everything breaks silently.
✅ Correct:
Mistake 3: Using na.omit() on the whole frame when only one column has NAs
❌ Wrong:
Why it is wrong: You lose rows that were only missing the sparse columns, even though most of their values were present.
✅ Correct:
Mistake 4: Treating the character string "NA" as missing
❌ Wrong:
Why it is wrong: The string "NA" is a four-character word, not the missing sentinel. is.na() correctly says there are zero missing values.
✅ Correct:
Mistake 5: Imputing before splitting into train and test
❌ Wrong:
Why it is wrong: The median of the combined set uses test rows. Your model sees test information at training time, inflating your reported performance.
✅ Correct: Split first, compute the median on the train set only, then apply that same value to both sets.
Practice Exercises
Exercise 1: Count total missing values in airquality
Write one line that returns the total count of NA values across the whole airquality data frame. Save it to my_total_na.
Click to reveal solution
Explanation: is.na(airquality) returns a logical matrix of the same shape. sum() treats TRUE as 1, so summing gives the total count.
Exercise 2: Keep only rows that are complete in Ozone and Solar.R
Filter airquality to rows where both Ozone and Solar.R are present. Save the result to my_rows. It should have 111 rows.
Click to reveal solution
Explanation: drop_na() with specific column arguments only drops rows where those named columns are NA. Wind, Temp, Month, Day are complete already, so this matches na.omit() here.
Exercise 3: Fill Ozone with its monthly median
For each month, replace missing Ozone values with the median Ozone of that month. Save the result to my_ozone_fill. Verify it has no NAs in Ozone.
Click to reveal solution
Explanation: group_by(Month) makes the median computed within each month. Hot months get the hot-month median, cool months get the cool-month median. This is a stronger imputation than a single global median because it preserves monthly seasonality.
Summary
| Task | Function | Returns |
|---|---|---|
| Detect NA | is.na(x) |
Logical vector, same shape as x |
| Any NA present? | anyNA(x) |
Single TRUE/FALSE |
| Count NAs per column | colSums(is.na(df)) |
Named numeric vector |
| Drop all NA rows | na.omit(df) |
Data frame, rows with any NA removed |
| Mask complete rows | complete.cases(df) |
Logical vector of rows |
| Drop rows NA in chosen cols | tidyr::drop_na(df, col1, col2) |
Data frame |
| Simple imputation | ifelse(is.na(x), median(x, na.rm = TRUE), x) |
Vector with NAs filled |
| Multiple imputation | mice::mice(df) (local R) |
Multiply imputed dataset |
The headline rule: detect with is.na(), count with colSums(is.na(df)), and choose drop vs impute by the missingness rate and mechanism.
FAQ
What is the difference between NA and NULL in R?
NA means a value is unknown. NULL means the value does not exist at all. NA keeps its slot in a vector, NULL vanishes. Use NA for missing survey answers, NULL to remove a list element.
Why does NA == NA return NA and not TRUE?
Because NA means "unknown." Two unknowns might or might not be equal, so R refuses to claim they are equal. Always use is.na() to test for missingness instead of ==.
Should I always use na.rm = TRUE?
No. Use it knowingly. Passing na.rm = TRUE silences missingness, which can hide data-quality problems. Always audit NA rates first, then decide whether removing them is safe for your question.
Is median imputation ever a bad idea?
Yes. It shrinks variance and can bias relationships between variables because every imputed row gets the same value. For predictive models that care about uncertainty or correlation structure, use multiple imputation (mice) instead.
Can I replace NA with 0?
Only if 0 means "none" in your domain, such as "zero items sold." If 0 is a real measurement, replacing NA with 0 invents data and shifts your mean. Use the median, a group median, or mice instead.
References
- R Core Team. An Introduction to R — Missing Values section. Link
- UCLA OARC Statistical Consulting. How does R handle missing values? Link
- van Buuren, S. mice: Multivariate Imputation by Chained Equations (package documentation). Link
- van Buuren, S. Flexible Imputation of Missing Data, 2nd ed. Link
- Wickham, H. & Grolemund, G. R for Data Science (2e) — Chapter on missing values. Link
- tidyr reference.
drop_na()documentation. Link - dplyr reference.
across()documentation. Link
What's Next?
- dplyr mutate & rename — Build on the
mutate(across(...))pattern used here for imputation. - pivot_longer and pivot_wider — Reshape the NA audit table into a long format for reporting.
- R Joins — Joins are a common source of NAs; handling them well starts with this playbook.