furrr Package in R: Parallel purrr with future Backend
The furrr package gives every purrr mapping function a parallel twin, swap map() for future_map(), set a plan(), and your code runs across multiple CPU cores with no other changes.
How do you convert a purrr workflow to furrr?
If your purrr pipeline already works correctly, you're one function swap away from running it in parallel. Let's load furrr and see the difference immediately.
Each element of 1:5 was squared and had 10 added to it. On your local machine with plan(multisession), these five computations ran across two worker processes. The output is identical to what purrr::map() would return, a list of results, one per input.
plan(multisession) call works but falls back to sequential execution here. Results are identical to local R, only the timing differs when you run this on your own machine.The naming convention is simple: take any purrr function, add future_ in front. Here's a type-specific variant that returns a numeric vector instead of a list.
The results match exactly. Every purrr variant has a furrr counterpart: future_map_chr(), future_map_lgl(), future_map_int(), future_map_dfr(), future_imap(), future_map2(), future_pmap(), and future_walk(). The suffix rules are the same as purrr, _dbl means "return a double vector," _chr means "return a character vector," and so on.
Try it: Use future_map_chr() to paste "Item-" before each element of c("A", "B", "C").
Click to reveal solution
Explanation: paste0() concatenates "Item-" with each element. future_map_chr() enforces that every iteration returns a single character string, collecting them into one character vector.
What does plan() do and which backend should you choose?
The plan() function from the future package tells R how to execute futures, the units of work that furrr creates behind the scenes. Without a plan, everything runs sequentially. With one, your iterations spread across cores.
availableCores() detects how many logical CPU cores your machine has. Setting workers = n_cores - 1 reserves one core so your computer stays responsive while R crunches data in the background.
When you're done with parallel work, reset to sequential to release worker processes and free memory.
plan(sequential) after a parallel section releases the worker R sessions. Leaving workers idle wastes memory, especially if each one loaded large datasets.Here's how the three main plans compare:
| Plan | Platform | How it works | Overhead | Best for |
|---|---|---|---|---|
sequential |
All | Same R session, no parallelism | None | Debugging, small tasks |
multisession |
All (Win/Mac/Linux) | Spawns new R sessions via sockets | Medium (data copied) | Cross-platform production code |
multicore |
Linux/Mac only | Forks the current process | Low (shared memory) | Servers and scripts (not RStudio GUI) |
Most users should stick with multisession. It works everywhere, handles packages and globals safely, and the overhead is only noticeable for very lightweight operations.
Try it: Write code that checks how many cores you have, then sets a plan using exactly half of them (rounded down).
Click to reveal solution
Explanation: floor() rounds down so you always get a whole number of workers. The plan() call immediately reconfigures how furrr dispatches work.
How do future_map2() and future_pmap() handle multiple inputs?
When your function needs two inputs per iteration, use future_map2(). When it needs three or more, use future_pmap(). These mirror purrr's map2() and pmap() exactly.
future_map2_dbl() walks down both vectors in lockstep, first iteration gets w = 0.3, v = 100, second gets w = 0.5, v = 200, and so on. The _dbl suffix guarantees a numeric vector back.
For three or more inputs, pass them as a data frame (or named list) to future_pmap().
Each row of the data frame becomes one function call. The column names must match the function's argument names. This is especially powerful for parameter sweeps and simulation grids.
Try it: Use future_map2_chr() to paste first and last names together from two vectors.
Click to reveal solution
Explanation: paste() joins two strings with a space by default. future_map2_chr() walks both vectors in parallel and collects the character results.
How do you control seeds, globals, and chunking with furrr_options()?
When your parallel code involves randomness, you need reproducible results. When it references large objects from your environment, you need to control what gets shipped to workers. furrr_options() handles both.
With seed = 123, furrr generates a parallel-safe RNG stream (L'Ecuyer-CMRG) that produces identical results regardless of how many workers you use. Change workers = 2 to workers = 4 and the numbers stay the same.
rnorm(), sample(), or any random function inside future_map(), always pass furrr_options(seed = <integer>). Otherwise your analysis won't be reproducible.You can also control which variables get sent to workers with the globals argument.
By default, furrr auto-detects globals (variables your function references from the parent environment). The explicit globals = "my_lookup" is useful when auto-detection picks up too much, say a 2 GB data frame your function doesn't actually need.
Here's a quick reference for all furrr_options() parameters:
| Parameter | Type | Default | Purpose |
|---|---|---|---|
seed |
Integer/TRUE/FALSE | FALSE |
Reproducible parallel RNG. Set to an integer for fixed results. |
globals |
TRUE/char vec/list | TRUE (auto) |
Which variables to send to workers. |
packages |
Character vector | NULL |
Packages to attach on each worker. |
scheduling |
Integer/Inf | 1 |
Futures per worker. 1 = balanced. Inf = one per element. |
chunk_size |
Integer/NULL | NULL |
Elements per chunk. Overrides scheduling. |
stdout |
Logical | TRUE |
Relay cat()/print() output from workers. |
conditions |
Character | All | Which conditions (messages/warnings) to relay. |
Try it: Use future_map_dbl() with furrr_options(seed = 42) to draw one random uniform value (runif(1)) per iteration across 5 iterations. Verify you get the same result if you run it twice.
Click to reveal solution
Explanation: furrr_options(seed = 42) creates a parallel-safe L'Ecuyer-CMRG RNG stream. Each iteration gets its own sub-stream, so results are deterministic regardless of worker count.
When is furrr slower than purrr and how do you avoid the overhead trap?
Parallel processing isn't free. Every plan(multisession) call spawns separate R sessions, and every future_map() call serializes your data, ships it to workers, runs the function, and ships results back. For lightweight operations, this overhead dwarfs the computation.
The parallel version is ~17x slower for this trivial task. Spawning workers and moving 1000 integers back and forth costs more than the computation itself.
Now compare with a heavier task where each iteration does real work.
With 20 bootstrap resamples of 10,000 observations each, furrr starts pulling ahead. The heavier the per-iteration work, the bigger the win.
system.time() and compare elapsed times. If parallel is slower, your iterations are too lightweight, keep them sequential.Here are the rules of thumb for when parallelizing pays off:
- Each iteration takes more than ~100 milliseconds, model fitting, simulation, resampling, heavy I/O
- You have more than ~10 iterations, too few iterations can't amortize the startup cost
- Data per iteration is small to medium, serializing a 1 GB data frame to each worker kills the speed gain
- The function has no side effects, parallel workers can't reliably write to the same file or modify shared state
Try it: Predict which task benefits more from parallelization: (A) computing sqrt() on 500 numbers, or (B) running lm() on 50 subsets of 1000 rows. No code needed, just reason about iteration weight.
Click to reveal solution
Explanation: The overhead of spawning workers and moving data dwarfs sqrt(). But lm() does enough computation per call that 50 iterations across 4 workers finishes meaningfully faster.
Practice Exercises
Exercise 1: Summarise multiple data frames in parallel
You have a list of 5 data frames, each containing numeric columns. Use future_map() to compute column means for each data frame, then combine the results into a single summary data frame.
Click to reveal solution
Explanation: future_map_dfr() runs the function on each index in parallel and row-binds the resulting data frames. Each iteration extracts one data frame from the list, computes means, and returns a one-row summary.
Exercise 2: Reproducible Monte Carlo simulation with future_pmap()
Build a parameter grid with 4 scenarios (varying n, mean, and sd). Use future_pmap_dfr() with furrr_options(seed = 100) to draw n random normal values per scenario, compute the sample mean and standard deviation, and return a results data frame. Run the pipeline twice and verify the results are identical.
Click to reveal solution
Explanation: future_pmap_dfr() iterates over each row of my_grid, passing n, mean, and sd as named arguments. The seed = 100 option ensures L'Ecuyer-CMRG streams are identical across runs, making the random draws reproducible even across different worker counts.
Complete Example
Let's put everything together: a parameter-sweep simulation that runs across cores, uses seed control for reproducibility, and summarizes results with dplyr.
The simulation confirms what statistics predicts: the standard error of the mean decreases as sample size grows (roughly proportional to 1/sqrt(n)). With n = 1000, estimates cluster tightly around the true mean. The furrr_options(seed = 42) ensures anyone running this code gets the same numbers.
Summary
| Concept | Key takeaway |
|---|---|
| Core idea | furrr = purrr + parallel execution via future |
| Function naming | future_ prefix: map() → future_map(), map2() → future_map2(), etc. |
| Setting up parallelism | plan(multisession, workers = N), works on all platforms |
| Resetting | plan(sequential), releases workers and memory |
| Multiple inputs | future_map2() for 2 inputs, future_pmap() for 3+ |
| Reproducibility | furrr_options(seed = N) for deterministic random results |
| Globals control | furrr_options(globals = ...) to limit data shipped to workers |
| When to use | Each iteration takes >100ms, >10 iterations, moderate data size |
| When NOT to use | Lightweight math, tiny vectors, giant objects per iteration |
| Always do | Benchmark with system.time() before committing to parallel |
References
- Vaughan, D., furrr: Apply Mapping Functions in Parallel using Futures. Official package site. Link
- Bengtsson, H., future: Unified Parallel and Distributed Processing in R for Everyone. Link
- Wickham, H., purrr: Functional Programming Tools. Link
- furrr CRAN Reference Manual. Link
- Bengtsson, H., "A Future for R: A Comprehensive Overview." The R Journal (2021). Link
- Wickham, H. & Grolemund, G., R for Data Science, 2nd Edition. O'Reilly (2023). Chapter 26: Iteration. Link
- Dancho, M., "Tidy Parallel Processing in R with furrr." Business Science (2021). Link
- furrr GitHub repository, source code, issues, and development version. Link
Continue Learning
- purrr map() in R: Every Variant Explained, the parent tutorial covering map(), map2(), imap(), and pmap() in depth
- R Anonymous Functions: The \(x) Syntax, the compact lambda syntax used inside map and future_map calls
- Functional Programming in R, the broader landscape of FP concepts including closures, higher-order functions, and function factories