dplyr cume_dist() in R: Empirical Cumulative Distribution
The cume_dist() function in dplyr returns the empirical cumulative distribution: the proportion of values less than or equal to each value, in the range (0, 1].
cume_dist(c(10, 20, 30, 40)) # 0.25, 0.5, 0.75, 1 cume_dist(desc(x)) # reverse direction df |> mutate(cd = cume_dist(score)) df |> group_by(g) |> mutate(cd = cume_dist(score)) percent_rank(c(10, 20, 30, 40)) # 0, 0.33, 0.67, 1 (different) ecdf(x)(x) # base R equivalent
Need explanation? Read on for examples and pitfalls.
What cume_dist() does in one sentence
cume_dist(x) returns count(values <= x_i) / n for each element. This is the empirical cumulative distribution function (ECDF) at each observation. Values range over (0, 1]; the maximum value always gets 1.
It is the dplyr / pipeline-friendly version of ecdf(x)(x).
Syntax
cume_dist(x). NAs stay NA. Output is in (0, 1].
cume_dist(x) answers "what fraction of all values are AT MOST this one?". Useful for percentile reasoning ("this customer is in the top 5%") and ECDF plots.Five common patterns
1. Cumulative distribution
The smallest gets 1/n; the largest always gets 1.
2. Identify top-percentile observations
3. Per-group cumulative distribution
4. Compare to percent_rank
percent_rank = (rank - 1) / (n - 1). cume_dist = rank / n. Both in [0, 1] but with different lower bounds.
5. Build a quantile-binned column
For direct binning, ntile(value, 4) is more concise.
cume_dist(x) is the empirical CDF evaluated at each observation. It answers "of all values, what proportion are at most this one?". The smallest value gets 1/n; the largest gets exactly 1. This makes it the natural choice for percentile reasoning.cume_dist() vs percent_rank() vs ntile() vs ecdf()
Four functions for relative-position computation in R.
| Function | Output range | Formula | Best for |
|---|---|---|---|
cume_dist(x) |
(0, 1] | rank / n | ECDF; "at most this fraction" |
percent_rank(x) |
[0, 1] | (rank - 1) / (n - 1) | "Strictly below this fraction" |
ntile(x, n) |
1 to n | bin index | Quantile binning |
ecdf(x)(x) |
(0, 1] | Same as cume_dist | Base R equivalent |
When to use which:
cume_distinside dplyr pipelines.ecdf(x)(x)in base R; identical result.percent_rankif your convention puts min at 0.ntilefor direct quantile bins.
A practical workflow
The "percentile flag" pattern is the cume_dist sweet spot.
Adds boolean flags for top and bottom percentiles. Useful for outlier detection and stratified analysis.
For per-segment percentiles:
Each segment's cumulative distribution scales independently.
Common pitfalls
Pitfall 1: confusing with percent_rank. Both return values in [0, 1] but with different formulas. cume_dist is rank / n (min = 1/n, max = 1). percent_rank is (rank - 1) / (n - 1) (min = 0, max = 1). Pick based on convention.
Pitfall 2: tied values share cume_dist. Three tied values all get the same value (the max rank among them, divided by n).
cume_dist is NOT a kernel-density estimate. It is the EMPIRICAL distribution at observed points only. For smooth density estimation, use density(x) (base R) or geom_density (ggplot2).Try it yourself
Try it: Mark cars in mtcars as "high mpg" if their cume_dist for mpg is >= 0.75. Save to ex_high_mpg.
Click to reveal solution
Explanation: cume_dist(mpg) returns each row's percentile. >= 0.75 keeps the top quartile. 25% of 32 = 8 cars.
Related dplyr functions
After mastering cume_dist, look at:
percent_rank(): alternative formula (different convention)ntile(): bin into n equal-count groupsmin_rank()/dense_rank(): integer rankquantile(): explicit percentile valuesecdf(): base R empirical CDFggplot2::stat_ecdf(): visualize ECDF
For percentile-based analysis, cume_dist + filter is the cleanest dplyr idiom.
FAQ
What does cume_dist do in dplyr?
cume_dist(x) returns the empirical cumulative distribution at each value: the proportion of all values that are less than or equal to that value. Output is in (0, 1].
What is the difference between cume_dist and percent_rank?
Different formulas. cume_dist = rank / n. percent_rank = (rank - 1) / (n - 1). Both range over [0, 1] (or (0, 1]) but assign different values per rank position.
How do I find the top 5% by cume_dist?
filter(cume_dist(x) >= 0.95) keeps rows in the top 5% percentile.
Does cume_dist match the ECDF function?
Yes. cume_dist(x) returns the same values as ecdf(x)(x) in base R. dplyr's version integrates with summarise and group_by.
How do I get cume_dist within groups?
df |> group_by(g) |> mutate(cd = cume_dist(value)). Each group has its own cumulative distribution.