[RsR] Seasonal adjustment; What's a good `blackbox' robust lm() function?

Sun Jun 8 17:14:27 CEST 2008

On Sun, Jun 8, 2008 at 10:19 AM, Ajay Shah <ajayshah using mayin.org> wrote:
> There is no support for x12arima or Tramo/Seats in R. I am doing a

Check out:

http://finzi.psych.upenn.edu/R/Rhelp02a/archive/40254.html

Also more below.

> poor man's seasonal adjustment procedure: For monthly data, I do a
> regression with no intercept and 12 dummy variables (i.e. a month
> factor), and then calculating the residual.
>
> If I did this using lm() the procedure is always feasible as long as
> there are atleast 12 data points.
>
> When I started using this, I noticed that quite a few time series have
> extreme values and these were making the function generate wonky
> values. So I thought I should switch to a robust lm(). I was surprised
> to find that MASS:rlm() does not always converge. For quite a few
> series I find that it breaks. I wonder what the gurus on this list
> think. I can come back with concrete numerical examples of time-series
> where MASS:rlm() does not work, if that's interesting.
>
> More generally, if you have any suggestions on how to do seasonal
> adjustment, they would be most interesting.

Noting that months(chron.object) and quarters(chron.object)
gives an ordered factor, here are a few approaches to the coefs:

library(chron)
library(quantreg)
rq(cpi ~ months(as.chron(date)) -1, data = n, method = "fn")

or what is basically the same, the median

tapply(n$cpi, list(month = months(as.chron(n$date))), median, na.rm = TRUE)

or a trimmed mean:

tapply(n$cpi, list(month = months(as.chron(n$date))), mean, trim =
0.1, na.rm = TRUE)

Note that we can simplify the adjustment function using ave. The
following subtracts out
monthly trimmed means from each data point:

ave(n$cpi, months(as.chron(n$date)), FUN = function(x) x - mean(x,
trim = 0.1, na.rm = TRUE))

>
> My code is:
>
> # A vector of data "x" and a vector of dates "dates" is supplied.
> # You must choose whether you want to focus on monthly or quarterly
> # seasonality. The function will then do the work of mapping from
> # "dates" into the relevant vector of dummy variables.
> #
> # x is assumed to be in levels.
> # A dummy variable regression is done.
> # NA values for x are correctly handled, but NA values for the
> # dates are not tolerated.
> # It is typical for you to send in c(NA,diff(log(levels))) in which case
> #  the dummy variables regression is additive seasonality in log
> #  differences.
> # You get back a vector which has the same length as what came in.
> # In addition, you also get back the results of the dummy variable
> # regression.
> library(MASS)
>
> date2month <- function(dates) {
>  factor(as.POSIXlt(dates)$mon, levels=0:11,
>         labels=c("Jan", "Feb", "Mar", "Apr",
>           "May", "Jun", "Jul", "Aug",
>           "Sep", "Oct", "Nov", "Dec"))
> }
> date2quarter <- function(dates) {
>  factor(c(1,1,1,2,2,2,3,3,3,4,4,4)[as.numeric(date2month(dates))],
>         levels=1:4, labels=c("Q1","Q2","Q3","Q4"))
> }
>
> sa.dummyvariableregression <- function(x, dates, seasonality="monthly") {
>  d <- data.frame(x=x, date=dates)
>  if (seasonality == "monthly") {
>    d$pfactor <- date2month(d$date)
>  } else {
>    d$pfactor <- date2quarter(d$date)
>  }
>  non.na.locations <- !is.na(d$x)
>  d <- d[non.na.locations,]
>  # At this point, we're holding the clean non-NA data.
>  # Dummy variable regression: do rlm() if you can but use lm() otherwise.
>  m <- rlm(x ~ -1 + pfactor, data=d)
>  if (! m$converged) {
>     cat("rlm failed to converge; switching to lm().\n")
>     m <- lm(x ~ -1 + pfactor, data=d)
>  }
>
>  # Extract residuals
>  sa <- m$residuals               # m$residuals is one short
>  sa <- sa + mean(coef(m))        # Intercepts have soaked up the mean
>  # Place back into the original layout with NAs in the right places
>  sa.full <- rep(NA, length(non.na.locations))
>  sa.full[non.na.locations] <- sa
>
>  # Send back the goodies.
>  list(sa=sa.full, m=m)
> }
>
> Here's some data to experiment with this. It's a data frame "n"
> containing n$date and n$cpi.
>
> n <- structure(list(date = structure(c(7395, 7425, 7456, 7486, 7517,
>    7548, 7578, 7609, 7639, 7670, 7701, 7729, 7760, 7790, 7821, 7851,
>    7882, 7913, 7943, 7974, 8004, 8035, 8066, 8095, 8126, 8156, 8187,
>    8217, 8248, 8279, 8309, 8340, 8370, 8401, 8432, 8460, 8491, 8521,
>    8552, 8582, 8613, 8644, 8674, 8705, 8735, 8766, 8797, 8825, 8856,
>    8886, 8917, 8947, 8978, 9009, 9039, 9070, 9100, 9131, 9162, 9190,
>    9221, 9251, 9282, 9312, 9343, 9374, 9404, 9435, 9465, 9496, 9527,
>    9556, 9587, 9617, 9648, 9678, 9709, 9740, 9770, 9801, 9831, 9862,
>    9893, 9921, 9952, 9982, 10013, 10043, 10074, 10105, 10135, 10166,
>    10196, 10227, 10258, 10286, 10317, 10347, 10378, 10408, 10439,
>    10470, 10500, 10531, 10561, 10592, 10623, 10651, 10682, 10712,
>    10743, 10773, 10804, 10835, 10865, 10896, 10926, 10957, 10988,
>    11017, 11048, 11078, 11109, 11139, 11170, 11201, 11231, 11262,
>    11292, 11323, 11354, 11382, 11413, 11443, 11474, 11504, 11535,
>    11566, 11596, 11627, 11657, 11688, 11719, 11747, 11778, 11808,
>    11839, 11869, 11900, 11931, 11961, 11992, 12022, 12053, 12084,
>    12112, 12143, 12173, 12204, 12234, 12265, 12296, 12326, 12357,
>    12387, 12418, 12449, 12478, 12509, 12539, 12570, 12600, 12631,
>    12662, 12692, 12723, 12753, 12784, 12815, 12843, 12874, 12904,
>    12935, 12965, 12996, 13027, 13057, 13088, 13118, 13149, 13180,
>    13208, 13239, 13269, 13300, 13330, 13361, 13392, 13422, 13453,
>    13483, 13514, 13545, 13573, 13604, 13634, 13665, 13695, 13726,
>    13757, 13787, 13818, 13848, 13879, 13910, 13939, 13970), class = "Date"),
>        cpi = c(38.88, 39.31, 39.96, 40.82, 41.04, 41.25, 42.12,
>        42.76, 42.98, 43.63, 43.63, 43.41, 43.63, 44.06, 45.14, 46.22,
>        46.87, 47.73, 48.16, 48.6, 48.6, 49.24, 49.46, 49.46, 49.89,
>        50.54, 50.97, 52.27, 52.27, 52.48, 52.7, 52.7, 52.48, 52.05,
>        52.27, 52.48, 52.92, 53.13, 54, 54.64, 55.29, 55.94, 56.59,
>        57.24, 57.02, 56.8, 57.24, 57.67, 58.1, 58.75, 59.83, 60.69,
>        61.34, 62.2, 62.42, 62.85, 62.42, 62.42, 62.85, 63.28, 63.71,
>        64.79, 66.09, 67.6, 68.03, 68.47, 68.9, 69.33, 68.47, 68.03,
>        68.25, 68.9, 69.98, 70.84, 71.92, 73.22, 74.08, 74.3, 74.73,
>        75.38, 75.59, 75.59, 75.59, 75.81, 76.46, 76.03, 76.67, 77.32,
>        77.54, 77.97, 78.83, 79.05, 80.35, 82.94, 82.51, 82.07, 82.72,
>        84.02, 86.18, 88.77, 89.2, 90.71, 93.52, 94.6, 92.66, 90.71,
>        89.63, 89.42, 89.63, 90.5, 90.71, 91.58, 92.01, 92.66, 94.38,
>        94.6, 93.09, 93.09, 92.87, 93.74, 94.6, 95.03, 95.46, 96.11,
>        95.68, 95.9, 96.98, 97.19, 96.33, 96.11, 95.68, 96.11, 96.76,
>        97.41, 98.7, 100, 100.65, 100.43, 101.08, 101.94, 101.3,
>        100.86, 100.65, 101.08, 101.3, 101.94, 102.81, 103.89, 104.54,
>        104.75, 105.18, 105.62, 104.54, 104.32, 104.54, 105.18, 106.48,
>        106.7, 107.34, 108.21, 107.78, 107.78, 108.64, 108.86, 108.42,
>        108.86, 108.86, 108.86, 108.86, 109.72, 110.58, 111.66, 112.74,
>        112.96, 113.61, 113.39, 112.53, 113.61, 113.39, 113.39, 114.25,
>        113.82, 114.25, 116.2, 116.63, 117.06, 118.36, 119.44, 118.79,
>        119, 119, 119, 120, 121, 123, 124, 124, 125, 127, 127, 127,
>        127, 128, 127, 128, 129, 130, 132, 133, 133, 134, 134, 134,
>        134, 135, 137, 138)), .Names = c("date", "cpi"), row.names = 4:220, class = "data.frame", na.action = structure(c(1L,
>    2L, 3L, 221L), .Names = c("1", "2", "3", "221"), class = "omit"))
>
> --
> Ajay Shah                                      http://www.mayin.org/ajayshah
> ajayshah using mayin.org                             http://ajayshahblog.blogspot.com
> <*(:-? - wizard who doesn't know the answer.
>
> _______________________________________________
> R-SIG-Robust using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-robust
>