dplyr lag() in R: Look at the Previous Row's Value
The lag() function in dplyr returns the value from the row N positions BEFORE the current row, padding with NA at the start. It is the mirror of lead() and the standard tool for "previous-row" comparisons.
lag(x) # previous value (n=1) lag(x, n = 2) # 2 rows back lag(x, default = 0) # fill start with 0 instead of NA lag(x, order_by = ts) # respect timestamp order df |> mutate(prev_val = lag(value)) df |> group_by(g) |> mutate(prev_val = lag(value)) diff(x) # quick first-differences (length n-1)
Need explanation? Read on for examples and pitfalls.
What lag() does in one sentence
lag(x, n = 1, default = NA) returns a vector where each position holds the value n rows BEFORE the current position; the first n positions are filled with default. It is the natural complement to lead().
The most common use case: time-series differencing, "change from previous period" calculations.
Syntax
lag(x, n = 1, default = NA, order_by = NULL). n is the lag amount; default fills the leading NA slots.
x - lag(x) is the canonical first-difference idiom in dplyr. It computes "change from previous row" while keeping the data frame's length intact (the first row gets NA).Five common patterns
1. Previous-row value
2. First-difference (period-over-period change)
The first row has NA because there is no previous period.
3. Percentage change
Standard finance / sales metric.
4. Per-group lag
Each user's first row has NA because there is no previous row within the group.
5. Lag with a custom default
Useful when you want the first row's "previous" to be a baseline instead of NA.
arrange() the data BEFORE using lag. lag is purely positional. "Previous row" only makes sense if rows are sorted by time (or another meaningful order). Without sorting, lag returns whatever happens to be the previous physical row, which may be meaningless.lag() vs lead() vs diff()
Three approaches to "change between rows" in R.
| Function | Output length | Best for |
|---|---|---|
lag(x) |
Same as x | dplyr pipelines; per-group |
lead(x) |
Same as x | "Next row" comparisons |
diff(x) |
n-1 | Quick differencing; not pipeline-friendly |
data.table::shift(x) |
Same | Very fast for big data |
When to use which:
lagfor dplyr pipelines and per-group differencing.leadfor forward-looking comparisons.difffor one-shot vector operations (loses one element).
A practical workflow
The "period-over-period change" pattern is the most common lag use case.
For each symbol's chronological prices, compute daily return. Without lag, this would require a self-join.
For multi-period changes:
n = 7 and n = 30 give weekly and monthly changes.
Common pitfalls
Pitfall 1: forgetting to arrange. lag is positional. Without arrange(date_col), "previous row" is whatever happened to be loaded first.
Pitfall 2: per-group surprise. On grouped tibbles, lag resets at each group's start. The first row of every group has NA. Often desired but sometimes a bug source.
default is NA, which propagates through arithmetic. x - lag(x) returns NA at the first row. Use default = 0 or filter NAs downstream if you need a numeric result.Why lag matters for time-series in dplyr
Without lag, computing changes across rows requires self-joins or manual indexing, both of which break the dplyr pipeline. lag turns "change since yesterday" or "compare to previous quarter" into a single mutate call. For per-group computation (each user, each symbol, each region), pair lag with group_by and arrange. The combination is so common that financial, marketing, and operational analytics all rely on it. Once you internalize the pattern, time-series transforms in R feel as natural as in SQL window functions.
Try it yourself
Try it: Compute the day-over-day percentage change in mtcars$mpg (treating row order as time order). Save to ex_pct.
Click to reveal solution
Explanation: First row is NA (no previous). Subsequent rows show pct change from the previous mpg value.
Related dplyr functions
After mastering lag, look at:
lead(): next row's value (mirror)first(),last(),nth(): pick specific positionscumsum(),cummean(), etc: cumulative aggregatesgroup_by(): per-group window operationsarrange(): sort before lag/leadslider::slide_dbl(): rolling-window operations
For multi-period lags or rolling differences, slider::slide_dbl() generalizes the pattern.
FAQ
What does lag do in dplyr?
lag(x, n = 1) returns a vector where each position holds the value n rows BEFORE the current position. The first n positions are filled with NA (or default).
What is the difference between lag and lead in dplyr?
lag(x) looks at the PREVIOUS row. lead(x) looks at the NEXT row. Mirror operations.
How do I compute first-differences with lag?
x - lag(x) gives the change from the previous row to the current. The first position is NA because there is no previous row.
Why does my lag result have NAs at the start?
Because there is no row before the first position. lag defaults default = NA for those slots. Set default = 0 or another value to avoid NA.
How do I lag within groups?
df |> group_by(g) |> mutate(prev = lag(x)). group_by makes lag reset at each group boundary; the first row per group has NA.