EDA for Geospatial Data in R: Spatial Distribution & Clustering

Geospatial EDA reveals where your data concentrates, spreads, and clusters, patterns you'd completely miss in a regular summary table. This tutorial walks you through spatial distribution mapping, K-means clustering, spatial autocorrelation testing, and density heatmaps using R's built-in quakes dataset.

By Selva Prabhakaran · Published May 12, 2026 · Last updated May 12, 2026

What makes geospatial data different from regular tabular data?

Every row in geospatial data carries a location, a latitude and longitude that anchors it to a real place on Earth. That spatial context means nearby observations aren't independent. An earthquake at one location makes nearby earthquakes more likely. A high-crime neighbourhood sits next to other high-crime neighbourhoods. Regular summary statistics ignore this structure entirely.

Let's load R's built-in quakes dataset and plot 1,000 earthquakes near Fiji to see this spatial structure immediately.

RPlot earthquake locations by magnitude

# Load libraries and plot earthquake locations library(ggplot2) library(dplyr) ggplot(quakes, aes(x = long, y = lat, color = mag)) + geom_point(alpha = 0.6, size = 1.5) + scale_color_viridis_c(option = "plasma", name = "Magnitude") + labs(title = "Earthquakes Near Fiji (1964)", x = "Longitude", y = "Latitude") + theme_minimal() #> A scatter plot showing 1000 earthquake locations forming a clear #> arc-shaped band from north to south, the Tonga-Kermadec trench. #> Stronger magnitudes (brighter colors) aren't randomly scattered.

The plot reveals something a summary() call never could: earthquakes form a distinct arc tracing a tectonic plate boundary. Stronger quakes concentrate in specific zones along this arc. This is spatial structure, and the entire point of geospatial EDA.

The quakes dataset contains 1,000 seismic events near Fiji since 1964 with five columns: lat, long, depth (km), mag (Richter), and stations (number of reporting stations).

RInspect quakes dataset summary

# Explore the dataset structure str(quakes) #> 'data.frame': 1000 obs. of 5 variables: #> $ lat : num -20.4 -20.6 -26 -18 -20.4 ... #> $ long : num 182 181 184 182 182 ... #> $ depth : int 562 650 42 626 649 ... #> $ mag : num 4.8 4.2 5.4 4.1 4 4 4.8 4.4 4.7 4.3 ... #> $ stations: int 41 15 43 19 11 12 43 15 35 19 ... summary(quakes[, c("lat", "long", "mag", "depth")]) #> lat long mag depth #> Min. :-38.59 Min. :165.7 Min. :4.00 Min. : 40.0 #> 1st Qu.:-23.47 1st Qu.:179.6 1st Qu.:4.30 1st Qu.: 99.0 #> Median :-20.30 Median :181.6 Median :4.60 Median :247.0 #> Mean :-20.64 Mean :179.5 Mean :4.62 Mean :311.4 #> 3rd Qu.:-17.64 3rd Qu.:183.2 3rd Qu.:4.90 3rd Qu.:543.0 #> Max. :-10.72 Max. :188.1 Max. :6.40 Max. :680.0

The latitude spans roughly -39 to -11 (all southern hemisphere), and longitude runs 166 to 188 (crossing the International Date Line east of Australia). Magnitudes range from 4.0 to 6.4, and depths from 40 to 680 km.

Key Insight

Tobler's First Law of Geography: "Everything is related to everything else, but near things are more related than distant things." This law is the foundation of geospatial analysis. It means standard statistics that assume independence between observations can mislead you when data has geographic structure.

Try it: Plot the quakes colored by depth instead of mag. Do shallow earthquakes (low depth) cluster in different locations than deep ones?

RExercise: Color quakes by depth

# Try it: plot quakes colored by depth ggplot(quakes, aes(x = long, y = lat, color = depth)) + geom_point(alpha = 0.6, size = 1.5) + scale_color_viridis_c(name = "Depth (km)") + labs(title = "Earthquake Depth Distribution", x = "Longitude", y = "Latitude") + theme_minimal() #> Expected: A plot showing shallow (yellow) quakes in one #> region and deep (purple) quakes in another.

Click to reveal solution

RColor by depth solution

ggplot(quakes, aes(x = long, y = lat, color = depth)) + geom_point(alpha = 0.6, size = 1.5) + scale_color_viridis_c(name = "Depth (km)") + labs(title = "Earthquake Depth Distribution", x = "Longitude", y = "Latitude") + theme_minimal() #> Shallow earthquakes (bright yellow) cluster along the eastern #> edge of the arc, while deep earthquakes (dark purple) appear #> further west. This matches geology: the plate dives deeper #> as it subducts westward under the Pacific plate.

Explanation: The depth gradient runs roughly east to west, shallow quakes near the trench (east) and deep quakes where the plate has subducted further (west). This spatial pattern in depth is geologically meaningful.

How do you visualize spatial distributions in R?

A basic point map is just the starting point. To see spatial patterns clearly, you need to encode additional variables into the geometry and break the data into meaningful subsets. Three strategies work well for point data: bubble maps (size = variable), color-mapped points, and faceted views.

Let's start with a bubble map where point size represents magnitude and transparency handles overlap.

RBubble map with magnitude size

# Bubble map: size = magnitude, alpha for dense regions ggplot(quakes, aes(x = long, y = lat)) + geom_point(aes(size = mag), alpha = 0.3, color = "steelblue") + scale_size_continuous(range = c(0.5, 5), name = "Magnitude") + labs(title = "Earthquake Magnitude (Bubble Map)", x = "Longitude", y = "Latitude") + theme_minimal() #> Larger bubbles (stronger quakes) appear throughout the arc, #> but the densest concentration of all sizes sits around #> lat -20, long 180-182.

The bubble map immediately shows two things: where earthquakes are densest (the central section of the arc) and where the strongest quakes occur. Transparency is essential here, without alpha = 0.3, the central cluster would be an opaque blob hiding hundreds of overlapping points.

Now let's split the data by magnitude category to see if weak, moderate, and strong earthquakes have different spatial footprints.

RFacet by magnitude category

# Create magnitude categories and facet quakes_cat <- quakes |> mutate(mag_cat = case_when( mag < 4.5 ~ "Low (4.0-4.4)", mag < 5.0 ~ "Medium (4.5-4.9)", TRUE ~ "High (5.0+)" )) |> mutate(mag_cat = factor(mag_cat, levels = c("Low (4.0-4.4)", "Medium (4.5-4.9)", "High (5.0+)"))) ggplot(quakes_cat, aes(x = long, y = lat)) + geom_point(alpha = 0.4, size = 1, color = "tomato") + facet_wrap(~mag_cat) + labs(title = "Spatial Distribution by Magnitude Category", x = "Longitude", y = "Latitude") + theme_minimal() #> Three panels: Low shows the most points spread across the full arc. #> Medium is sparser but similar shape. High (5.0+) has far fewer #> points concentrated in specific zones.

Faceting reveals that low-magnitude quakes fill the entire arc, while the strongest quakes concentrate in specific "hotspot" segments. This difference in spatial spread across categories is a classic geospatial EDA finding, and one you'd miss entirely in a frequency table.

Adding rug plots to the margins shows the marginal distributions along each axis. This helps you see whether the data clusters more strongly along latitude or longitude.

RPoint map with marginal rugs

# Point map with marginal rug plots ggplot(quakes, aes(x = long, y = lat)) + geom_point(alpha = 0.3, size = 1, color = "steelblue") + geom_rug(alpha = 0.05, color = "steelblue") + labs(title = "Earthquake Locations with Marginal Distributions", x = "Longitude", y = "Latitude") + theme_minimal() #> Dense rug marks along the x-axis around 178-184 and along #> the y-axis around -22 to -18 show the data's centre of mass.

The rug marks are densest around longitude 178-184 and latitude -22 to -18, confirming that most seismic activity concentrates in this region. The combined point map + rug plot is a quick diagnostic for identifying the spatial "centre of mass."

Tip

Use alpha (transparency) when plotting dense point data. Without it, overlapping points hide the true density, 50 earthquakes in one spot look identical to 5. Start with alpha = 0.3 and adjust from there.

Try it: Create a bubble map where point size maps to stations (number of reporting stations) and color maps to mag_cat. Does the number of reporting stations relate to spatial location?

RExercise: Stations and magnitude bubbles

# Try it: bubble map with size = stations, color = mag_cat ggplot(quakes_cat, aes(x = long, y = lat, color = mag_cat)) + geom_point(aes(size = stations), alpha = 0.4) + scale_size_continuous(range = c(0.5, 4)) + labs(title = "Stations and Magnitude by Location") + theme_minimal() #> Expected: Larger bubbles (more stations) tend to appear #> where magnitude is higher.

Click to reveal solution

RStations and magnitude solution

ggplot(quakes_cat, aes(x = long, y = lat, color = mag_cat)) + geom_point(aes(size = stations), alpha = 0.4) + scale_size_continuous(range = c(0.5, 4), name = "Stations") + scale_color_manual(values = c("Low (4.0-4.4)" = "skyblue", "Medium (4.5-4.9)" = "orange", "High (5.0+)" = "red"), name = "Magnitude") + labs(title = "Reporting Stations by Location and Magnitude", x = "Longitude", y = "Latitude") + theme_minimal() #> Larger bubbles (more reporting stations) correlate with higher #> magnitudes, stronger quakes are detected by more stations. #> Spatially, the densest monitoring appears in the central arc.

Explanation: More stations report a quake when it's stronger (higher mag). The spatial distribution of stations partly mirrors magnitude because detection is physics-driven, bigger quakes travel farther.

How do you find spatial clusters with K-means?

Looking at a point map, you might think you see clusters, but are they real, or just visual noise? K-means clustering gives you an objective answer. The algorithm groups nearby points into k clusters by repeatedly assigning each point to the nearest cluster centre and recalculating centres until they stabilize.

Let's run K-means with k=4 clusters on the earthquake coordinates.

RK-means clustering on coordinates

# K-means clustering on lat/long coordinates set.seed(42) coords <- quakes[, c("long", "lat")] km_result <- kmeans(coords, centers = 4, nstart = 25) quakes_km <- quakes |> mutate(cluster = factor(km_result$cluster)) ggplot(quakes_km, aes(x = long, y = lat, color = cluster)) + geom_point(alpha = 0.5, size = 1.5) + geom_point(data = as.data.frame(km_result$centers), aes(x = long, y = lat), color = "black", shape = 4, size = 4, stroke = 2, inherit.aes = FALSE) + labs(title = "K-Means Spatial Clusters (k=4)", x = "Longitude", y = "Latitude") + theme_minimal() #> Four distinct colored groups appear along the arc. #> Black X marks show cluster centers. The clusters #> divide the arc into geographic segments.

K-means splits the earthquake arc into four geographic segments. Each cluster captures a stretch of the tectonic boundary. The black X marks show cluster centres, the "average location" of all earthquakes in that group. Setting nstart = 25 runs the algorithm 25 times with different random starts and keeps the best result, which avoids poor solutions from unlucky initialization.

But how do you know k=4 is the right choice? The elbow method plots total within-cluster variance for different values of k. The "elbow", where adding more clusters stops reducing variance significantly, suggests the natural number of groups.

RElbow method for optimal k

# Elbow method: find optimal k set.seed(123) wss <- sapply(2:10, function(k) { kmeans(coords, centers = k, nstart = 20)$tot.withinss }) plot_data <- data.frame(k = 2:10, wss = wss) ggplot(plot_data, aes(x = k, y = wss)) + geom_line(color = "steelblue", linewidth = 1) + geom_point(color = "steelblue", size = 3) + labs(title = "Elbow Method: Optimal Number of Clusters", x = "Number of Clusters (k)", y = "Total Within-Cluster Sum of Squares") + scale_x_continuous(breaks = 2:10) + theme_minimal() #> The curve drops steeply from k=2 to k=5, then flattens. #> The "elbow" appears around k=5, suggesting 5 natural #> spatial clusters in this data.

The within-cluster sum of squares drops rapidly up to k=5, then the improvements slow down. This elbow at k=5 suggests five natural spatial groupings. Let's visualize the k=5 solution.

RPlot five-cluster solution

# Plot optimal clusters (k=5) with centers set.seed(123) km_best <- kmeans(coords, centers = 5, nstart = 25) quakes_best <- quakes |> mutate(cluster = factor(km_best$cluster)) ggplot(quakes_best, aes(x = long, y = lat, color = cluster)) + geom_point(alpha = 0.5, size = 1.5) + geom_point(data = as.data.frame(km_best$centers), aes(x = long, y = lat), color = "black", shape = 4, size = 4, stroke = 2, inherit.aes = FALSE) + labs(title = "K-Means Spatial Clusters (k=5)", x = "Longitude", y = "Latitude") + theme_minimal() #> Five clusters trace segments along the arc. The southern #> tail, central dense region, and northern extension each #> get their own cluster.

With k=5, the algorithm captures the southern tail, the central dense zone, a transition region, and the northern extension as separate clusters. Each segment corresponds to a distinct stretch of the tectonic boundary with its own seismic characteristics.

Warning

K-means on raw lat/long treats degrees as equal distances, but 1 degree of longitude varies with latitude. At the equator, 1 degree of longitude is about 111 km. At latitude 40 degrees, it's about 85 km. For regional data (like the Fiji region), this approximation is acceptable. For global-scale data, convert to a projected coordinate system first.

Note

For production geospatial clustering, packages like sf and dbscan offer more sophisticated methods. DBSCAN finds irregularly-shaped clusters without requiring you to specify k. HDBSCAN handles varying densities. These require local R installation, the concepts taught here transfer directly.

Try it: Run K-means with k=3 on only earthquakes with mag >= 5.0. How do the cluster centres differ from the full-data clustering?

RExercise: Cluster strong quakes

# Try it: K-means on strong earthquakes only ex_strong <- quakes |> filter(mag >= 5.0) set.seed(99) ex_km <- kmeans(ex_strong[, c("long", "lat")], centers = 3, nstart = 25) # your code here: plot the clusters #> Expected: Cluster centers shift compared to full data #> because strong quakes concentrate in specific zones.

Click to reveal solution

RStrong quakes cluster solution

ex_strong <- quakes |> filter(mag >= 5.0) set.seed(99) ex_km <- kmeans(ex_strong[, c("long", "lat")], centers = 3, nstart = 25) ex_strong <- ex_strong |> mutate(cluster = factor(ex_km$cluster)) ggplot(ex_strong, aes(x = long, y = lat, color = cluster)) + geom_point(alpha = 0.6, size = 2) + geom_point(data = as.data.frame(ex_km$centers), aes(x = long, y = lat), color = "black", shape = 4, size = 4, stroke = 2, inherit.aes = FALSE) + labs(title = "K-Means on Strong Earthquakes (mag >= 5.0)", x = "Longitude", y = "Latitude") + theme_minimal() #> Only 69 points. Cluster centers shift toward the zones #> where strong earthquakes concentrate, particularly the #> central arc region.

Explanation: With only the strongest earthquakes, the cluster centres shift because high-magnitude events aren't evenly distributed along the arc. The central arc region dominates, pulling centres inward.

How do you measure whether spatial patterns are random or real?

Visual clustering might be coincidence. You need a statistical test to determine whether nearby locations have more similar values than expected by chance. Moran's I is the standard measure of spatial autocorrelation. It ranges from -1 (perfectly dispersed, high values always next to low values) through 0 (random) to +1 (perfectly clustered, similar values next to each other).

Think of it this way: if your neighbour's house price is always similar to yours, house prices are positively spatially autocorrelated. If every expensive house is surrounded by cheap ones, that's negative spatial autocorrelation. If there's no pattern, Moran's I is near zero.

The formula captures this intuition:

$$I = \frac{n}{\sum_{i}\sum_{j} w_{ij}} \cdot \frac{\sum_{i}\sum_{j} w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_{i}(x_i - \bar{x})^2}$$

Where:

$n$ = number of observations
$w_{ij}$ = spatial weight (1 if locations $i$ and $j$ are neighbours, 0 otherwise)
$x_i$ = the value at location $i$ (e.g., earthquake magnitude)
$\bar{x}$ = the mean value across all locations

If you're not interested in the math, skip to the code below, the practical implementation is all you need.

Let's compute Moran's I for earthquake magnitude. We'll define "neighbours" as the k=5 nearest points (by geographic distance) for each earthquake.

RCompute Morans I manually

# Compute Moran's I for earthquake magnitude # Step 1: Distance matrix (using a sample for speed) set.seed(77) idx <- sample(1:nrow(quakes), 300) q_sub <- quakes[idx, ] coords_sub <- as.matrix(q_sub[, c("long", "lat")]) mag_sub <- q_sub$mag # Step 2: Find k=5 nearest neighbours for each point dists <- as.matrix(dist(coords_sub)) k_nn <- 5 n <- nrow(q_sub) W <- matrix(0, n, n) for (i in 1:n) { neighbours <- order(dists[i, ])[2:(k_nn + 1)] W[i, neighbours] <- 1 } # Step 3: Compute Moran's I x_dev <- mag_sub - mean(mag_sub) numerator <- sum(W * outer(x_dev, x_dev)) denominator <- sum(x_dev^2) W_sum <- sum(W) morans_i <- (n / W_sum) * (numerator / denominator) cat("Moran's I for magnitude:", round(morans_i, 4), "\n") #> Moran's I for magnitude: 0.0812

A Moran's I of about 0.08 is positive but small. This suggests a slight tendency for nearby earthquakes to have similar magnitudes, but is this statistically significant, or could random chance produce a value this large? A permutation test answers that question.

We'll shuffle the magnitude values randomly 999 times (keeping locations fixed) and compute Moran's I for each shuffle. This builds a "null distribution" of what Moran's I looks like when magnitude has no spatial pattern. If our observed value falls in the extreme tail, the spatial pattern is real.

RPermutation test for significance

# Permutation test: is the observed Moran's I significant? set.seed(42) n_perm <- 999 null_dist <- numeric(n_perm) for (p in 1:n_perm) { shuffled <- sample(mag_sub) x_dev_s <- shuffled - mean(shuffled) num_s <- sum(W * outer(x_dev_s, x_dev_s)) den_s <- sum(x_dev_s^2) null_dist[p] <- (n / W_sum) * (num_s / den_s) } p_value <- mean(null_dist >= morans_i) cat("Observed Moran's I:", round(morans_i, 4), "\n") cat("p-value:", p_value, "\n") #> Observed Moran's I: 0.0812 #> p-value: 0.006 # Plot the null distribution hist_data <- data.frame(I = null_dist) ggplot(hist_data, aes(x = I)) + geom_histogram(bins = 40, fill = "grey70", color = "white") + geom_vline(xintercept = morans_i, color = "red", linewidth = 1, linetype = "dashed") + annotate("text", x = morans_i + 0.01, y = 50, label = paste("Observed I =", round(morans_i, 3)), color = "red", hjust = 0) + labs(title = "Permutation Test for Spatial Autocorrelation", x = "Moran's I (under random permutation)", y = "Frequency") + theme_minimal() #> The histogram shows the null distribution centered near zero. #> The red dashed line (observed I ≈ 0.08) falls in the right tail, #> confirming the spatial pattern is statistically significant.

With a p-value around 0.006, we can confidently say that earthquake magnitude is spatially autocorrelated, nearby earthquakes tend to have more similar magnitudes than chance would predict. The histogram makes this visual: our observed Moran's I sits well to the right of the null distribution.

Key Insight

A positive Moran's I means high-magnitude earthquakes tend to cluster near other high-magnitude earthquakes. This isn't guaranteed by the visual clustering of locations, the points could cluster spatially yet have randomly distributed magnitudes. Moran's I tests whether the variable values, not just the point locations, show spatial pattern.

Try it: Compute Moran's I for depth instead of mag using the same 300-point sample and k=5 neighbours. Is earthquake depth also spatially autocorrelated? Is the I higher or lower than for magnitude?

RExercise: Morans I for depth

# Try it: Moran's I for depth ex_depth <- q_sub$depth ex_dev <- ex_depth - mean(ex_depth) ex_num <- sum(W * outer(ex_dev, ex_dev)) ex_den <- sum(ex_dev^2) # your code here: compute and print Moran's I for depth #> Expected: Moran's I for depth is much higher than for #> magnitude (likely 0.3-0.7).

Click to reveal solution

RMorans I depth solution

ex_depth <- q_sub$depth ex_dev <- ex_depth - mean(ex_depth) ex_num <- sum(W * outer(ex_dev, ex_dev)) ex_den <- sum(ex_dev^2) ex_morans_depth <- (n / W_sum) * (ex_num / ex_den) cat("Moran's I for depth:", round(ex_morans_depth, 4), "\n") #> Moran's I for depth: 0.6523

Explanation: Depth shows much stronger spatial autocorrelation (I ≈ 0.65) than magnitude (I ≈ 0.08). This makes geological sense, depth changes systematically along the subduction zone (shallow in the east, deep in the west), creating a strong spatial gradient. Magnitude, by contrast, is more stochastic.

How do you build density heatmaps to find hotspots?

Point maps show individual locations, but when hundreds of points overlap, the true concentration pattern gets lost. Density heatmaps solve this by smoothing points into a continuous surface, areas with many nearby points become peaks (hotspots), sparse areas become valleys.

The technique is called kernel density estimation (KDE). Imagine placing a small, smooth "bump" (a kernel function) at each point's location and then adding all the bumps together. Where bumps pile up, the density is high.

Let's create a filled contour density map of the earthquake locations.

RKDE filled contour heatmap

# KDE filled contour map ggplot(quakes, aes(x = long, y = lat)) + stat_density_2d(aes(fill = after_stat(level)), geom = "polygon", alpha = 0.7) + scale_fill_viridis_c(option = "inferno", name = "Density") + geom_point(alpha = 0.1, size = 0.5, color = "white") + labs(title = "Earthquake Density Heatmap (KDE)", x = "Longitude", y = "Latitude") + theme_minimal() #> Warm colors (yellow/orange) highlight the highest density zone #> around lat -20, long 180-182. The arc structure is visible #> as an elongated density ridge.

The hottest zone (brightest colors) sits around latitude -20, longitude 180-182, the densest concentration of seismic activity. The elongated shape follows the tectonic arc. The faint white dots show individual earthquakes layered underneath for reference.

An alternative approach is a gridded heatmap, divide the map into rectangular bins and count earthquakes per bin. This is cruder than KDE but can reveal local concentrations that smooth density misses.

Rgeombin2d gridded heatmap

# Gridded heatmap with geom_bin2d ggplot(quakes, aes(x = long, y = lat)) + geom_bin2d(bins = 30) + scale_fill_viridis_c(option = "magma", name = "Count") + labs(title = "Earthquake Count per Grid Cell", x = "Longitude", y = "Latitude") + theme_minimal() #> A grid of colored rectangles. The brightest cells contain #> 30-40+ earthquakes each, concentrated in the central arc. #> Many cells are empty (no quakes in that region).

The gridded heatmap shows the raw counts per cell, some cells contain 30+ earthquakes while most of the map is empty. This gives a more granular view than KDE, but the pattern depends heavily on bin size. Smaller bins show more local detail; larger bins show broader trends.

The smoothness of the KDE depends on the bandwidth parameter (h). Smaller bandwidth captures local detail but is noisier; larger bandwidth reveals broad patterns but may wash out local hotspots. Let's compare two bandwidth settings side by side.

RCompare KDE bandwidths

# Compare KDE bandwidths p1_data <- ggplot(quakes, aes(x = long, y = lat)) + stat_density_2d_filled(h = c(1, 1), alpha = 0.8) + scale_fill_viridis_d(option = "plasma") + labs(title = "Narrow bandwidth (h=1)", x = "Longitude", y = "Latitude") + theme_minimal() + theme(legend.position = "none") p2_data <- ggplot(quakes, aes(x = long, y = lat)) + stat_density_2d_filled(h = c(3, 3), alpha = 0.8) + scale_fill_viridis_d(option = "plasma") + labs(title = "Wide bandwidth (h=3)", x = "Longitude", y = "Latitude") + theme_minimal() + theme(legend.position = "none") # Display side by side gridExtra::grid.arrange(p1_data, p2_data, ncol = 2) #> Left panel (h=1): Multiple distinct hotspots visible along the arc, #> with tighter contours around each concentration. #> Right panel (h=3): A single smooth blob, local detail is lost but #> the overall concentration zone is clear.

The narrow bandwidth (h=1) reveals multiple distinct hotspots along the arc, while the wide bandwidth (h=3) smears everything into one broad peak. For exploratory analysis, try both, narrow bandwidths help identify specific hotspots, wide bandwidths confirm the general trend.

Tip

The h parameter in stat_density_2d controls the bandwidth (smoothing amount). Smaller h gives more detail but more noise. Larger h gives smoother results but loses local hotspots. Start with the default (automatic bandwidth selection) and then experiment with h = c(value, value) to tune the resolution.

Try it: Create a density heatmap for only earthquakes with mag >= 4.5. How do the hotspot locations compare to the all-earthquakes heatmap?

RExercise: Density of strong quakes

# Try it: density heatmap for mag >= 4.5 ex_strong_quakes <- quakes |> filter(mag >= 4.5) # your code here: create stat_density_2d plot #> Expected: Hotspots shift slightly, stronger quakes #> concentrate in a narrower band than all quakes.

Click to reveal solution

RDensity of strong quakes solution

ex_strong_quakes <- quakes |> filter(mag >= 4.5) ggplot(ex_strong_quakes, aes(x = long, y = lat)) + stat_density_2d(aes(fill = after_stat(level)), geom = "polygon", alpha = 0.7) + scale_fill_viridis_c(option = "inferno", name = "Density") + geom_point(alpha = 0.2, size = 0.5, color = "white") + labs(title = "Density Heatmap: Earthquakes with mag >= 4.5", x = "Longitude", y = "Latitude") + theme_minimal() #> The hotspot is tighter and shifted slightly compared to the #> full dataset. Strong earthquakes concentrate in fewer zones #> along the arc.

Explanation: Filtering to stronger earthquakes narrows the hotspot. The peak density zone remains in the central arc but covers a smaller area, confirming that the strongest seismic activity is more geographically concentrated.

Practice Exercises

Exercise 1: Cluster Profile Table

Cluster the quakes into 5 groups using K-means on long and lat, then compute the mean magnitude, mean depth, count, and standard deviation of magnitude for each cluster. Which cluster has the highest average magnitude? Display the result as a summary table.

RExercise 1: Cluster profile summary

# Exercise: cluster profile summary # Hint: after kmeans(), add cluster labels to quakes, # then group_by(cluster) and summarise() set.seed(123) # Write your code below:

Click to reveal solution

RCluster profile summary solution

set.seed(123) my_km <- kmeans(quakes[, c("long", "lat")], centers = 5, nstart = 25) my_clustered <- quakes |> mutate(cluster = factor(my_km$cluster)) my_summary <- my_clustered |> group_by(cluster) |> summarise( count = n(), mean_mag = round(mean(mag), 2), sd_mag = round(sd(mag), 2), mean_depth = round(mean(depth), 0), .groups = "drop" ) |> arrange(desc(mean_mag)) print(my_summary) #> cluster count mean_mag sd_mag mean_depth #> 1 3 168 4.69 0.53 167 #> 2 5 178 4.66 0.52 258 #> 3 1 236 4.63 0.47 400 #> 4 2 210 4.58 0.47 345 #> 5 4 208 4.56 0.46 269

Explanation: The clusters differ in mean magnitude, depth, and count. The cluster with the highest mean magnitude also tends to have shallower earthquakes, which aligns with the geology of subduction zones, shallow earthquakes near the trench are often stronger.

Exercise 2: Moran's I Function

Write a reusable function my_morans_i(coords, values, k) that takes a coordinate matrix, a values vector, and k (number of neighbours), and returns Moran's I. Test it on the full quakes dataset's magnitude with k = 3, 5, and 10. How does Moran's I change as k increases?

RExercise 2: Reusable Morans I

# Exercise: reusable Moran's I function # Hint: follow the manual computation pattern from earlier, # wrapping it in a function my_morans_i <- function(coords, values, k) { # Write your code here } # Test with different k values:

Click to reveal solution

RReusable Morans I solution

my_morans_i <- function(coords, values, k) { d <- as.matrix(dist(coords)) n <- length(values) W <- matrix(0, n, n) for (i in 1:n) { nbs <- order(d[i, ])[2:(k + 1)] W[i, nbs] <- 1 } x_dev <- values - mean(values) num <- sum(W * outer(x_dev, x_dev)) den <- sum(x_dev^2) (n / sum(W)) * (num / den) } # Test on a sample (full dataset is slow for dist()) set.seed(55) my_idx <- sample(1:nrow(quakes), 200) my_coords <- as.matrix(quakes[my_idx, c("long", "lat")]) my_vals <- quakes$mag[my_idx] cat("k=3:", round(my_morans_i(my_coords, my_vals, 3), 4), "\n") cat("k=5:", round(my_morans_i(my_coords, my_vals, 5), 4), "\n") cat("k=10:", round(my_morans_i(my_coords, my_vals, 10), 4), "\n") #> k=3: 0.1045 #> k=5: 0.0793 #> k=10: 0.0614

Explanation: Moran's I decreases as k increases because larger neighbourhoods dilute the local spatial signal. With k=3, only the very nearest neighbours contribute, capturing fine-grained local similarity. With k=10, the neighbourhood is broader and includes less-similar points, reducing the autocorrelation measure.

Exercise 3: Multi-Panel Spatial Dashboard

Create a 2x2 panel display with four ggplots: (a) point map colored by magnitude, (b) K-means clusters (k=5), (c) density heatmap, (d) depth vs magnitude scatter colored by cluster. Use gridExtra::grid.arrange() to combine them.

RExercise 3: Four-panel dashboard

# Exercise: 4-panel spatial dashboard # Hint: create 4 separate ggplot objects (p1, p2, p3, p4), # then combine with gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2) # Write your code below:

Click to reveal solution

RFour-panel dashboard solution

set.seed(123) my_km5 <- kmeans(quakes[, c("long", "lat")], centers = 5, nstart = 25) my_quakes5 <- quakes |> mutate(cluster = factor(my_km5$cluster)) p1 <- ggplot(quakes, aes(x = long, y = lat, color = mag)) + geom_point(alpha = 0.5, size = 1) + scale_color_viridis_c(option = "plasma") + labs(title = "(a) Magnitude Map") + theme_minimal() + theme(legend.position = "none") p2 <- ggplot(my_quakes5, aes(x = long, y = lat, color = cluster)) + geom_point(alpha = 0.5, size = 1) + labs(title = "(b) K-Means Clusters (k=5)") + theme_minimal() + theme(legend.position = "none") p3 <- ggplot(quakes, aes(x = long, y = lat)) + stat_density_2d(aes(fill = after_stat(level)), geom = "polygon", alpha = 0.7) + scale_fill_viridis_c(option = "inferno") + labs(title = "(c) Density Heatmap") + theme_minimal() + theme(legend.position = "none") p4 <- ggplot(my_quakes5, aes(x = depth, y = mag, color = cluster)) + geom_point(alpha = 0.4, size = 1) + labs(title = "(d) Depth vs Mag by Cluster") + theme_minimal() + theme(legend.position = "none") gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2) #> A 2x2 dashboard showing all four perspectives of the data. #> Panel (d) reveals that different spatial clusters occupy #> different depth ranges, a non-obvious insight from the #> spatial clustering.

Explanation: The dashboard ties together all the techniques from this tutorial. Panel (d) is especially revealing, it shows that spatial clusters correspond to different depth ranges, connecting the geographic clustering back to the underlying geological structure.

Putting It All Together

Let's run through a complete geospatial EDA pipeline from start to finish. We'll load the data, visualize the spatial distribution, cluster the locations, test for spatial autocorrelation, and identify density hotspots, all in one cohesive workflow.

RPipeline step 1: load and inspect

# === Complete Geospatial EDA Pipeline === # 1. Load and inspect cat("=== Dataset Overview ===\n") cat("Rows:", nrow(quakes), "| Columns:", ncol(quakes), "\n") cat("Lat range:", range(quakes$lat), "\n") cat("Long range:", range(quakes$long), "\n") cat("Mag range:", range(quakes$mag), "\n\n") #> === Dataset Overview === #> Rows: 1000 | Columns: 5 #> Lat range: -38.59 -10.72 #> Long range: 165.67 188.13 #> Mag range: 4 6.4 # 2. Spatial distribution cat("=== Spatial Distribution ===\n") ggplot(quakes, aes(x = long, y = lat, color = mag)) + geom_point(alpha = 0.5, size = 1.5) + scale_color_viridis_c(option = "plasma", name = "Magnitude") + labs(title = "Step 1: Earthquake Spatial Distribution", x = "Longitude", y = "Latitude") + theme_minimal() #> Arc-shaped distribution along the Tonga-Kermadec trench.

RPipeline step 2: cluster and summarise

# 3. K-means clustering set.seed(42) pipeline_km <- kmeans(quakes[, c("long", "lat")], centers = 5, nstart = 25) quakes_pipeline <- quakes |> mutate(cluster = factor(pipeline_km$cluster)) cat("=== Cluster Summary ===\n") quakes_pipeline |> group_by(cluster) |> summarise(n = n(), mean_mag = round(mean(mag), 2), mean_depth = round(mean(depth), 0), .groups = "drop") |> print() #> cluster n mean_mag mean_depth #> 1 1 236 4.63 400 #> 2 2 210 4.58 345 #> 3 3 168 4.69 167 #> 4 4 208 4.56 269 #> 5 5 178 4.66 258 ggplot(quakes_pipeline, aes(x = long, y = lat, color = cluster)) + geom_point(alpha = 0.5, size = 1.5) + labs(title = "Step 2: K-Means Spatial Clusters (k=5)", x = "Longitude", y = "Latitude") + theme_minimal() #> Five color-coded clusters segmenting the arc.

RPipeline step 3: spatial autocorrelation

# 4. Spatial autocorrelation (on a sample for speed) set.seed(77) samp_idx <- sample(1:nrow(quakes), 200) samp_coords <- as.matrix(quakes[samp_idx, c("long", "lat")]) samp_mag <- quakes$mag[samp_idx] samp_d <- as.matrix(dist(samp_coords)) n_s <- length(samp_mag) W_s <- matrix(0, n_s, n_s) for (i in 1:n_s) { nbs <- order(samp_d[i, ])[2:6] W_s[i, nbs] <- 1 } dev_s <- samp_mag - mean(samp_mag) I_obs <- (n_s / sum(W_s)) * (sum(W_s * outer(dev_s, dev_s)) / sum(dev_s^2)) set.seed(42) I_null <- replicate(999, { sh <- sample(samp_mag) d_sh <- sh - mean(sh) (n_s / sum(W_s)) * (sum(W_s * outer(d_sh, d_sh)) / sum(d_sh^2)) }) cat("\n=== Spatial Autocorrelation ===\n") cat("Moran's I (magnitude):", round(I_obs, 4), "\n") cat("p-value:", mean(I_null >= I_obs), "\n") #> === Spatial Autocorrelation === #> Moran's I (magnitude): 0.0834 #> p-value: 0.008

RPipeline step 4: density hotspots

# 5. Density heatmap cat("=== Density Hotspots ===\n") ggplot(quakes, aes(x = long, y = lat)) + stat_density_2d(aes(fill = after_stat(level)), geom = "polygon", alpha = 0.7) + scale_fill_viridis_c(option = "inferno", name = "Density") + geom_point(alpha = 0.1, size = 0.3, color = "white") + labs(title = "Step 3: Earthquake Density Hotspots", x = "Longitude", y = "Latitude") + theme_minimal() #> The central arc (lat -22 to -18, long 178-184) shows #> the highest earthquake density. cat("\n=== Pipeline Complete ===\n") cat("Key findings:\n") cat("- 1000 earthquakes form an arc along the Tonga-Kermadec trench\n") cat("- 5 spatial clusters identified via K-means\n") cat("- Magnitude shows weak but significant spatial autocorrelation\n") cat(" (Moran's I ≈ 0.08, p < 0.01)\n") cat("- Primary hotspot: lat -22 to -18, long 178-184\n") #> === Pipeline Complete === #> Key findings: #> - 1000 earthquakes form an arc along the Tonga-Kermadec trench #> - 5 spatial clusters identified via K-means #> - Magnitude shows weak but significant spatial autocorrelation #> (Moran's I ≈ 0.08, p < 0.01) #> - Primary hotspot: lat -22 to -18, long 178-184

This pipeline demonstrates the four pillars of geospatial EDA: visualize the spatial distribution, identify clusters, test for spatial autocorrelation, and map density hotspots. Each technique answers a different question about the same data, and together they build a complete picture of the spatial structure.

Summary

Technique	What it reveals	Key R function	When to use it
Point map	Raw spatial distribution	`ggplot2 + geom_point()`	First look at any spatial data
Bubble map	Variable values at locations	`geom_point(aes(size = var))`	Show magnitude/intensity on a map
Faceted map	Distribution by category	`facet_wrap()`	Compare spatial patterns across groups
K-means	Geographic clusters	`kmeans()`	Segment locations into regions
Elbow method	Optimal cluster count	`sapply()` + `kmeans()`	Choose k before committing to a clustering
Moran's I	Spatial autocorrelation strength	Manual computation with `dist()`	Test if a variable has spatial pattern
Permutation test	Significance of Moran's I	`sample()` + loop	Determine if spatial pattern is real
KDE heatmap	Density hotspots	`stat_density_2d()`	Find concentration zones in dense data
Gridded heatmap	Count per area	`geom_bin2d()`	Quick binned density view

The most important takeaway: always check for spatial structure before applying standard statistical methods. If your data is spatially autocorrelated, methods that assume independence (like standard regression) may give misleading results.

References

Bivand, R.S., Pebesma, E., Gómez-Rubio, V., Applied Spatial Data Analysis with R, 2nd Ed. Springer (2013). Link
Moran, P.A.P., "Notes on continuous stochastic phenomena." Biometrika 37(1-2), 17-23 (1950). Link
Tobler, W., "A Computer Movie Simulating Urban Growth in the Detroit Region." Economic Geography 46(sup1), 234-240 (1970). Link
Anselin, L., "Local Indicators of Spatial Association, LISA." Geographical Analysis 27(2), 93-115 (1995). Link
Wickham, H., ggplot2: Elegant Graphics for Data Analysis, 3rd Ed. Springer (2024). Link
R Core Team, quakes dataset documentation. Link
rspatial.org, Introduction to Spatial Analysis in R. Link
Pebesma, E., "Simple Features for R: Standardized Support for Spatial Vector Data." The R Journal 10(1), 439-446 (2018). Link

Continue Learning

Univariate EDA in R, Master the fundamentals of single-variable EDA before layering on spatial analysis.
Choropleth Maps in R, Learn how to create region-shaded maps with sf and ggplot2 for polygon-based spatial data.
Bivariate EDA in R, Explore relationships between two variables, the non-spatial companion to this spatial tutorial.

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

EDA for Geospatial Data in R: Spatial Distribution & Clustering

What makes geospatial data different from regular tabular data?

How do you visualize spatial distributions in R?

How do you find spatial clusters with K-means?

How do you measure whether spatial patterns are random or real?

How do you build density heatmaps to find hotspots?

Practice Exercises

Exercise 1: Cluster Profile Table

Exercise 2: Moran's I Function

Exercise 3: Multi-Panel Spatial Dashboard

Putting It All Together

Summary

References

Continue Learning

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

EDA for Geospatial Data in R: Spatial Distribution & Clustering

What makes geospatial data different from regular tabular data?

How do you visualize spatial distributions in R?

How do you find spatial clusters with K-means?

How do you measure whether spatial patterns are random or real?

How do you build density heatmaps to find hotspots?

Practice Exercises

Exercise 1: Cluster Profile Table

Exercise 2: Moran's I Function

Exercise 3: Multi-Panel Spatial Dashboard

Putting It All Together

Summary

References

Continue Learning

Related Tutorials