Heatmap in R: Build and Customize with ggplot2 geom_tile()
A heatmap encodes a numeric matrix as a grid of colored tiles — rows on one axis, columns on the other, and fill color encoding the value at each cell. In ggplot2, geom_tile() builds heatmaps with the same grammar as every other chart type.
Introduction
Heatmaps are the right tool when you have a two-dimensional grid of values and you want readers to spot patterns — which cells are high, which are low, and where the extremes cluster. Common applications include correlation matrices (which variables move together?), time-by-category grids (which months had the highest sales in each region?), and gene expression matrices in bioinformatics.
The ggplot2 approach requires your data in long (tidy) format: one row per cell, with columns for the row identifier, the column identifier, and the fill value. If your data starts as a wide matrix (rows = observations, columns = variables), you need to reshape it first — and tidyr::pivot_longer() handles that in one line.
In this tutorial you will learn:
- How
geom_tile()builds a heatmap from long-format data - How to reshape wide data into long format with
pivot_longer() - How to choose sequential vs. diverging color scales
- How to add numeric labels inside each tile
- How to clean up the theme for a polished final chart
How Does geom_tile() Build a Heatmap?
geom_tile() draws a rectangle at every combination of x and y, filled by the fill aesthetic. If your data has one row for each (x, y) pair, you get a complete grid with no gaps.
Let's start with a direct demonstration using the airquality dataset — monthly averages of ozone, temperature, and wind, restructured as a grid:
Each tile's color encodes the value at that (Month, Variable) intersection. The default color scale (grey-to-dark-blue) shows higher values as darker — but we'll improve that shortly.
color = "white" and linewidth = 0.5 add thin white borders between tiles — making the grid structure visible and preventing adjacent colors from blending visually.
KEY INSIGHT:
geom_tile()expects your data in long format — one row per cell. If you pass a wide matrix directly toggplot(), you'll get a chart with only one tile per row (one y-level per observation). Always reshape to long format first.
Try it: Remove color = "white" from geom_tile(). How does the heatmap look without tile borders?
How Do You Reshape Wide Data to Long Format?
Most real-world data starts wide — each variable is its own column, each row is an observation. A correlation matrix is a classic example: the row and column names are the same set of variables.
pivot_longer() from tidyr converts wide to long with three key arguments: cols (which columns to pivot), names_to (the new column that will hold the old column names), and values_to (the new column that will hold the values).
as.table() on a matrix produces a three-column data frame automatically — a shortcut that avoids pivot_longer() for square matrices. For non-square wide data (e.g., a month × region sales grid), use:
TIP: For a correlation matrix specifically,
as.data.frame(as.table(cor(df)))is the fastest path to a three-column long format. For any other wide matrix,pivot_longer()is the standard tool.
Try it: Compute the correlation matrix of just the numeric columns in iris (exclude Species). Convert it to long format using as.data.frame(as.table(cor(...))).
How Do You Choose the Right Color Scale for a Heatmap?
Color scale choice is critical for heatmaps. The wrong scale can hide patterns or create false impressions of direction.
Use a sequential scale when your values run in one direction (all positive, or all negative) with no meaningful midpoint:
Use a diverging scale when your values have a meaningful midpoint — most commonly zero. Correlations range from -1 to +1 with 0 as the neutral midpoint:
midpoint = 0 centers the white color exactly at zero. limits = c(-1, 1) forces the color scale to span the full correlation range symmetrically — without this, ggplot2 sets the limits to the data's actual range, which may not be ±1 and will offset the midpoint.
WARNING: Never use a sequential (single-direction) color scale for correlation or any data with a meaningful zero. A sequential scale from white to blue makes -0.9 look similar to +0.1 (both pale), completely obscuring the sign of the relationship.
Try it: Change the low and high colors in scale_fill_gradient2() to "#1a9850" (green) and "#d73027" (red). Does the correlation matrix still read clearly?
How Do You Add Text Labels Inside Heatmap Tiles?
When your grid is small enough (typically under 10×10 cells), printing the exact value inside each tile lets readers get precise numbers without estimating from the color scale.
The color = abs(Correlation) > 0.5 trick switches label color from dark grey (on pale tiles) to white (on strongly colored tiles) — ensuring labels are always readable regardless of tile intensity. scale_color_manual() with guide = "none" maps TRUE/FALSE to "white"/"grey20" without adding a legend.
sprintf("%.2f", Correlation) formats each number to exactly 2 decimal places — consistent with correlation coefficient conventions.
TIP: For large heatmaps (20×20+), text labels become too small to read and clutter the chart. Switch to a clean tile-only heatmap with a well-chosen color scale, relying on interactive tooltips (via
plotly::ggplotly()) when readers need exact values.
Try it: Change size = 2.8 to size = 4. Do the labels fit inside the tiles, or do they overflow?
Common Mistakes and How to Fix Them
Mistake 1: Passing wide-format data directly to geom_tile()
❌ ggplot(wide_matrix, aes(x = ?, y = ?, fill = ?)) — a wide matrix doesn't have separate row/column/value columns for ggplot2 to use.
✅ Convert to long format first: pivot_longer() for general wide data, or as.data.frame(as.table(mat)) for square matrices.
Mistake 2: Using a sequential color scale for diverging data
❌ Using scale_fill_viridis_c() on a correlation matrix. Negative correlations (-0.8) and near-zero ones (0.05) both appear pale/cool — hiding the sign difference.
✅ Use scale_fill_gradient2(low, mid, high, midpoint = 0) for any data centered at zero. Always set limits to be symmetric: limits = c(-1, 1) for correlations.
Mistake 3: Forgetting limits on the diverging scale
❌ Without limits = c(-1, 1), ggplot2 sets the scale limits to the data's actual range. If your highest correlation is 0.8, the midpoint (white) will appear at 0.4, not 0 — making moderate positive correlations look neutral.
✅ Always set limits = c(-max_abs, max_abs) for diverging scales to keep the midpoint at the true zero.
Mistake 4: Grid lines showing through tile borders
❌ theme_minimal() includes a grid by default — the grid lines sit behind the tiles but show through the color = "white" tile borders, creating a double-line effect.
✅ Add theme(panel.grid = element_blank()) to remove the grid entirely. The tile borders are enough structure.
Mistake 5: Unordered axes hiding patterns
❌ Default alphabetical variable ordering on both axes makes it hard to see whether correlated variables cluster together.
✅ Reorder axes by hierarchical clustering: hclust(dist(cor_mat)) gives a dendrogram order that groups similar variables together. Apply with scale_x_discrete(limits = ordered_vars).
Practice Exercises
Exercise 1: Monthly airline passenger heatmap
Using the AirPassengers time series, convert to a data frame with month and year columns. Create a heatmap of passenger count by month (y) and year (x) using a sequential viridis palette. Is there a clear seasonal pattern?
Exercise 2: Labeled iris correlation heatmap
Compute the correlation matrix for all four numeric columns of iris. Convert to long format and create a labeled heatmap with:
- Diverging color scale (blue-white-red)
- Correlation values as text inside each tile
- Rotated x-axis labels
- No grid lines
Complete Example
A complete, publication-ready correlation heatmap of mtcars with value labels, clean theme, and title:
scale_x_discrete(position = "top") moves the x-axis labels to the top of the chart — the standard convention for correlation matrices, matching how most statistical software formats them.
Summary
| Task | Code |
|---|---|
| Basic heatmap | geom_tile(aes(fill = value)) |
| Tile borders | geom_tile(color = "white", linewidth = 0.5) |
| Sequential color | scale_fill_viridis_c(option = "plasma") |
| Diverging color | scale_fill_gradient2(low, mid, high, midpoint = 0, limits = c(-1,1)) |
| Text labels | geom_text(aes(label = sprintf("%.2f", value))) |
| Rotate x labels | theme(axis.text.x = element_text(angle = 45, hjust = 1)) |
| Remove grid | theme(panel.grid = element_blank()) |
| Labels at top | scale_x_discrete(position = "top") |
| Wide to long | pivot_longer(df, cols = -id_col, names_to = "var", values_to = "val") |
| Matrix to long | as.data.frame(as.table(cor_matrix)) |
Key rules:
- Long format (one row per cell) is required — reshape wide data first
- Sequential scale for all-positive/all-negative data; diverging scale for data centered at zero
- Always set
limits = c(-max, max)on diverging scales to keep the midpoint at zero - Remove the panel grid with
theme(panel.grid = element_blank())— tile borders are sufficient structure
FAQ
What is the difference between geom_tile() and geom_raster()?
Both draw rectangular tiles. geom_tile() accepts width and height aesthetics — tiles can be different sizes. geom_raster() assumes all tiles are the same size (faster for large grids). For standard heatmaps with uniform tile size, geom_raster() is slightly faster; for variable-size tiles, use geom_tile().
How do I reorder rows and columns by clustering?
Compute hierarchical clustering: ord <- hclust(dist(cor_matrix))$order. Then reorder the factor levels: factor(var, levels = colnames(cor_matrix)[ord]). Apply to both Var1 and Var2 to group correlated variables together.
My heatmap has missing tiles — why?
Missing tiles appear when your long-format data is missing some (x, y) combinations. Either your source data is incomplete, or the pivot_longer() call skipped some columns. Check is.na(value) in your long data and handle missing values before plotting.
How do I add a dendrogram to the heatmap?
geom_tile() alone can't add dendrograms. Use the pheatmap package (pheatmap()) or ComplexHeatmap for Bioconductor for heatmaps with row and column dendrograms. These are dedicated heatmap packages with built-in clustering and annotation.
How do I make a symmetric correlation matrix show only the lower triangle?
Filter the long data to keep only rows where Var1 >= Var2 (or <=) before plotting: cor_long[cor_long$Var1 >= cor_long$Var2, ]. This removes the upper triangle and diagonal, halving the number of tiles.
References
- ggplot2 reference —
geom_tile(). https://ggplot2.tidyverse.org/reference/geom_tile.html - Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer. https://ggplot2-book.org/
- tidyr reference —
pivot_longer(). https://tidyr.tidyverse.org/reference/pivot_longer.html - Wilke, C. O. (2019). Fundamentals of Data Visualization, Chapter 12: Visualizing Associations. https://clauswilke.com/dataviz/
- R Graph Gallery — Heatmaps. https://r-graph-gallery.com/heatmap.html
What's Next?
- ggplot2 Scatter Plots — the parent tutorial on
geom_point()for exploring relationships between two continuous variables. - R Color Theory — choosing sequential, diverging, and qualitative palettes with ColorBrewer and viridis.
- Bubble Chart in R — add a size dimension to scatter plots, extending two-variable exploration to three.