r-statistics.co by Selva Prabhakaran


Loess Regression is the most common method used to smoothen a volatile time series. It is a non-parametric methods where least squares regression is performed in localized subsets, which makes it a suitable candidate for smoothing any numerical vector.

Introduction

Loess short for Local Regression is a non-parametric approach that fits multiple regressions in local neighborhood. This can be particularly resourceful, if you know that your X variables are bound within a range.

Loess regression can be applied using the loess() on a numerical vector to smoothen it and to predict the Y locally (i.e, within the trained values of Xs). The size of the neighborhood can be controlled using the span argument, which ranges between 0 to 1. It controls the degree of smoothing. So, the greater the value of span, more smooth is the fitted curve.

The predictor variable can just be indices from 1 to number of observations in the absence of explanatory variables. If other explanatory variables are available, they can be used as well (maximum of 4).

Example

For this example we will try to locally regress and smooth the median duration of unemployment based on the economics dataset from ggplot2 package. We consider only the first 80 rows for this analysis, so it is easier to observe the degree of smoothing in the graphs below.

data(economics, package="ggplot2")  # load data
economics$index <- 1:nrow(economics)  # create index variable
economics <- economics[1:80, ]  # retail 80rows for better graphical understanding
loessMod10 <- loess(uempmed ~ index, data=economics, span=0.10) # 10% smoothing span
loessMod25 <- loess(uempmed ~ index, data=economics, span=0.25) # 25% smoothing span
loessMod50 <- loess(uempmed ~ index, data=economics, span=0.50) # 50% smoothing span

Predict Loess

# get smoothed output
smoothed10 <- predict(loessMod10) 
smoothed25 <- predict(loessMod25) 
smoothed50 <- predict(loessMod50) 

From above plot, you would notice that as the span increases, the smoothing of the curve also increases.

Code for Plot
# Plot it
plot(economics$uempmed, x=economics$date, type="l", main="Loess Smoothing and Prediction", xlab="Date", ylab="Unemployment (Median)")
lines(smoothed10, x=economics$date, col="red")
lines(smoothed25, x=economics$date, col="green")
lines(smoothed50, x=economics$date, col="blue")

Finding the optimal smoothing span

As the smoothing span changes, the accuracy of the fitted curve also changes. If your intent is to minimize the error, the optim() can be used to find that value of span, that minimizes the Sum of Squared Errors (SSE). For this case, it is graphically intuitive that lower SSE will likely be achieved at lower values of span, but for more challenging cases, optimizing span could help.

To implement optim(), we define the function that computes the SSE. An error handling mechanism is needed to address very low values of span and cases where the non-numerics are produced. The simulated annealing method (SANN) is implemented here to find the span that gives minimal SSE. The par argument specifies the first value of the span at which optim() will begin the search.

# define function that returns the SSE
calcSSE <- function(x){
  loessMod <- try(loess(uempmed ~ index, data=economics, span=x), silent=T)
  res <- try(loessMod$residuals, silent=T)
  if(class(res)!="try-error"){
    if((sum(res, na.rm=T) > 0)){
      sse <- sum(res^2)  
    }
  }else{
    sse <- 99999
  }
  return(sse)
}

# Run optim to find span that gives min SSE, starting at 0.5
optim(par=c(0.5), calcSSE, method="SANN")
#> $par
#> [1] 0.05433545
#> 
#> $value
#> [1] 3.85753e-28
#> 
#> $counts
#> function gradient 
#>    10000       NA 
#> 
#> $convergence
#> [1] 0
#> 
#> $message
#> NULL

For this case, the best value of span turns out to be 0.05433 and the minimum SSE achieved is 3.85e-28.