Pair Plots in R: GGally ggpairs() for Multivariate Exploration
A pair plot displays every pairwise relationship in a dataset on a single grid — scatter plots below the diagonal, correlation coefficients above, and distributions along the diagonal — so you can spot multivariate patterns without writing a separate plot for each combination.
What Does a Pair Plot Show and Why Do You Need One?
When you have four or more numeric variables, checking them two at a time with individual scatter plots gets tedious fast. A pair plot arranges every combination into one matrix: scatter plots, correlations, and density curves in a single function call. Patterns that would take a dozen individual plots to find jump out immediately.
That single line of code produced 16 panels. The diagonal shows how each variable is distributed on its own. The lower triangle gives you scatter plots — the raw shape of each relationship. The upper triangle condenses each relationship into a correlation coefficient so you can compare strengths at a glance.
Here's how to read each region. The diagonal tells you whether a variable is roughly normal, skewed, or bimodal. The lower scatter plots reveal linearity, clusters, and outliers. The upper correlations quantify the direction and strength: values near +1 or -1 mean a strong linear relationship, while values near 0 mean little to no linear pattern.
The first pair (Petal.Length and Petal.Width) has a correlation of 0.96 — nearly a straight line. The second pair (Sepal.Width and Petal.Length) is -0.43, a moderate negative relationship. The pair plot showed both of these instantly, without you having to check each pair individually.
Try it: Create a pair plot of mtcars using columns mpg, disp, hp, and wt. Which pair has the strongest correlation?
Click to reveal solution
Explanation: The disp (engine displacement) and wt (weight) pair shows the highest positive correlation at 0.888 — heavier cars tend to have larger engines. The strongest negative correlation is mpg vs wt at -0.868.
How Do You Customize Which Variables Appear?
Not every column belongs in a pair plot. Identifier columns, date fields, or highly correlated duplicates just add noise. The columns argument lets you pick exactly which variables to display — either by position or by name.
Using column names is clearer than positions, especially when you share code with collaborators. But numeric indices work just as well when you're exploring interactively.
Notice how the 3x3 matrix is easier to read than the 4x4 we started with. Each cell gets more space, and the scatter plots are large enough to spot individual outliers.
Try it: Create a pair plot of airquality with just Ozone, Solar.R, Wind, and Temp. Which pair shows the clearest relationship?
Click to reveal solution
Explanation: Ozone and Temp have the strongest correlation (0.699). Hotter days produce more ozone — a well-known atmospheric chemistry relationship. Wind shows a negative correlation with Ozone because windy days disperse pollutants.
How Do You Add Color by Group to Reveal Hidden Patterns?
The real power of pair plots kicks in when you map a categorical variable to color. Suddenly, what looked like one blob of data separates into distinct clusters — and relationships that seemed weak in the full data might be strong within each group.
Without color, the Sepal.Width density on the diagonal looked bimodal and confusing. With color, you can see that setosa has wider sepals than the other two species — the "two humps" were really two species overlapping.
The lower scatter plots are even more revealing. Petal.Length vs Petal.Width looked like a single strong line before. With color, you see three tight clusters arranged along that line. Each species occupies its own region of the measurement space.
Now you can see that 4-cylinder cars cluster in the high-mpg, low-weight corner while 8-cylinder cars fill the opposite corner. The overall negative correlation between mpg and weight is partly driven by this grouping.
Try it: Color the mtcars pair plot by gear (as factor). Which relationship looks most different when grouped vs ungrouped?
Click to reveal solution
Explanation: Gear count is a rough proxy for transmission type and car purpose. The grouping reveals that much of the overall mpg-wt correlation comes from the difference between car categories, not just physics.
How Do You Control Upper, Lower, and Diagonal Panels?
The default panels are a great starting point, but you can swap any of them out. The upper, lower, and diag arguments each accept a named list with keys for continuous, combo, and discrete — matching the variable-type combination in that cell.
The loess smooth lines in the lower panels make non-linear trends easier to spot. And histograms on the diagonal give you a concrete sense of bin counts rather than the abstract density curve.
Here are the most useful panel options for continuous variables:
| Panel position | Option | What it shows |
|---|---|---|
| Upper/Lower | "points" |
Raw scatter plot |
| Upper/Lower | "smooth" |
Scatter + loess curve |
| Upper/Lower | "smooth_loess" |
Same as "smooth" |
| Upper/Lower | "cor" |
Correlation coefficient |
| Upper/Lower | "density" |
2D density contours |
| Upper/Lower | "blank" |
Empty (hides the panel) |
| Diagonal | "densityDiag" |
Density curve (default) |
| Diagonal | "barDiag" |
Histogram |
| Diagonal | "blankDiag" |
Empty |
To pass extra arguments to a panel function, use wrap(). This is how you control things like smoothing method, point transparency, or color.
The wrap() function is your gateway to fine-grained control. The first argument is the panel function name as a string, and everything after that gets passed through to that function at render time.
upper = list(continuous = "blank") removes the upper triangle entirely. This speeds up rendering and reduces visual clutter when you only want scatter plots.Try it: Create a pair plot where the lower triangle shows 2D density contours ("density") and the diagonal shows histograms ("barDiag"). Use iris columns 1:4.
Click to reveal solution
Explanation: Density contours work like topographic maps — each ring encloses a region of equal data density. They're especially useful when you have overlapping points that scatter plots can't resolve.
How Do You Handle Mixed Variable Types (Numeric + Categorical)?
Real datasets almost always have a mix of numeric and categorical columns. When ggpairs encounters this mix, it automatically chooses "combo" plots — visualizations designed for one numeric and one categorical variable.
The combo panels are the interesting ones. When a numeric variable meets a categorical one, ggpairs shows boxplots (upper) or faceted histograms (lower) by default. These immediately tell you whether groups differ — for instance, you can see that setosa petals are dramatically shorter than versicolor or virginica petals.
You can customize these combo panels just like you customize continuous panels.
Here are the combo panel options:
| Option | What it shows |
|---|---|
"box" |
Faceted boxplots |
"box_no_facet" |
Overlapping boxplots (default upper) |
"dot" |
Faceted dot plots |
"dot_no_facet" |
Overlapping dot plots |
"facethist" |
Faceted histograms |
"facetdensity" |
Faceted density plots |
"denstrip" |
Density strip plots |
"blank" |
Empty |
mtcars$cyl with only 3), consider converting it to a factor with factor() so ggpairs treats it as categorical and uses combo panels instead of continuous ones.Try it: Convert mtcars$cyl to a factor, then create a ggpairs plot with mpg, hp, wt, and cyl. What combo plots appear for the cyl column?
Click to reveal solution
Explanation: Converting cyl to a factor triggers combo panels wherever cyl meets a numeric variable. The boxplots clearly show that 8-cylinder cars are heavier and less fuel-efficient, while 4-cylinder cars are the lightest and most economical.
How Do You Style and Theme Your Pair Plot?
A pair plot built for exploration might look fine in RStudio, but presentations and reports need a cleaner look. Since ggpairs returns a ggplot-compatible object, you can add themes, adjust fonts, and modify text sizes just like any other ggplot.
The axis.text size is the most common adjustment. With 4+ variables, the default tick labels often overlap. Dropping them to 7-8pt keeps everything readable without sacrificing information.
You can also add a title using standard ggplot2 syntax.
The combination of theme_bw() and a grey strip background produces a publication-ready look. The thinner regression lines and smaller correlation text keep the visual weight balanced.
strip.text to at least size 10 for readability from the back of the room.Try it: Apply theme_classic() to a pair plot of iris (columns 1:4) and set the strip text to size 10 with bold face.
Click to reveal solution
Explanation: theme_classic() removes background grid lines entirely, giving a clean look suited to publications and posters. The bold strip text ensures variable names stay readable even in the tight matrix layout.
Practice Exercises
Exercise 1: Diamonds Pair Plot with Color Grouping
Sample 200 rows from diamonds (use set.seed(42)), create a pair plot of price, carat, depth, and table colored by cut. Which variable is most strongly associated with price?
Click to reveal solution
Explanation: Carat is by far the strongest predictor of price (r ≈ 0.92). Depth and table have almost no linear relationship with price. The color grouping shows that Ideal and Premium cuts span the full price range — cut quality alone doesn't determine price.
Exercise 2: Customized mtcars Pair Plot
Build a pair plot of mtcars with mpg, disp, hp, wt colored by gear (as factor). Customize: lower = smooth with lm method, upper = correlation, diagonal = density. Add theme_minimal(). Report the strongest and weakest correlations.
Click to reveal solution
Explanation: With wrap("smooth", method = "lm"), the lower panels show linear regression lines instead of loess curves. The gear coloring reveals that 3-gear cars (mostly heavy automatics) drive the strong disp-wt correlation, while 4-gear and 5-gear cars show more variation.
Exercise 3: Airquality Deep Dive with wrap()
Create a pair plot of airquality (complete cases only) with Ozone, Solar.R, Wind, and Temp. Use wrap() to make the lower scatter points semi-transparent (alpha = 0.4) and sized small (size = 1.5). Set the upper panel to show correlations. Which pair has the strongest relationship, and does Wind affect it?
Click to reveal solution
Explanation: Ozone and Temperature have the strongest relationship (r ≈ 0.70). Wind has a moderating effect — high-wind days tend to be cooler and have lower ozone. The semi-transparent points make it easy to see where data concentrates, especially in the Ozone-Temp panel where most points cluster at lower ozone levels.
Complete Example
Let's bring everything together with a real-world analysis. We'll use the msleep dataset from ggplot2 — mammalian sleep data — to explore how body size, brain size, and sleep patterns relate across different dietary groups.
This single plot reveals the core story: larger animals sleep less, and diet is the hidden grouping variable. Herbivores (green) are the biggest and sleep the least — they need to spend more time eating low-calorie food. Insectivores (blue) are small and sleep the most. The columnLabels argument gave us clean axis labels instead of variable names, and the log transformation spread out the body/brain weight values that would otherwise be compressed by a few outliers (elephants).
Summary
| Task | Code | When to use |
|---|---|---|
| Basic pair plot | ggpairs(df, columns = 1:4) |
First look at multivariate data |
| Color by group | aes(color = group_var) |
Suspect hidden subgroups |
| Select columns | columns = c("a", "b", "c") |
Focus on key variables (4-7 max) |
| Custom panels | upper = list(continuous = "cor") |
Replace default visualizations |
| Pass parameters | wrap("smooth", method = "lm") |
Fine-tune panel functions |
| Mixed types | Include factor columns in data | Numeric + categorical together |
| Clean theme | + theme_bw() |
Reports and presentations |
| Custom labels | columnLabels = c("Label1", ...) |
Replace variable names with readable text |
References
- Schloerke, B. et al. — GGally: Extension to ggplot2. R package documentation. Link
- GGally package vignette — ggpairs(): Pairwise plot matrix. Link
- Emerson, J.W., Green, W.A., Schloerke, B. et al. — "The Generalized Pairs Plot," Journal of Computational and Graphical Statistics, 22(1), 79-91 (2013).
- Wickham, H. — ggplot2: Elegant Graphics for Data Analysis, 3rd Edition. Springer (2016). Link
- Wickham, H. & Grolemund, G. — R for Data Science, 2nd Edition. O'Reilly (2023). Link
- R Core Team — An Introduction to R. Link
Continue Learning
- Bivariate EDA in R — The parent guide covering scatter plots, grouped boxplots, mosaic plots, and correlation tests for two-variable analysis.
- Correlation Matrix Plot in R — When you need just the numeric correlations as a heatmap, without the scatter plots and density curves.
- Exploratory Data Analysis in R — The 7-step EDA framework that shows where pair plots fit in a complete analysis workflow.