[R-sig-finance] How can I do this better? (Filling in last traded price for NA)

Mon Sep 13 17:57:41 CEST 2004

Hi Ajay,

You will probably get other suggestions
along the following lines,
which use 'rle' and 'rep' to speed things up.

fillIn2 <- function(x)
{ bef <- x # keep a copy for display purposes only.
  xRle <- rle(is.na(x))
  # get indices where each NA seq starts (low) and stops (upp)
  upp <- (sumX <- cumsum(xRle$lengths))[xRle$values]
  low <- sumX[which(xRle$values)-1]+1
  # special case: NA at start _only_ i.e. c(NA, ..., NA, notNa, ..., notNA)
  if (length(low) == 0) return(cbind(before = x , after = x))
  # special case: NA at start and else where 
  if (length(upp) == length(low)+1) upp <- upp[-1]
  # Critical bit is 'rep' on RHS. 
  # On LHS, dont replace NAs at the start, if any.
  ind <- low[1]-1
  x[ind + which(is.na(x[-seq(ind)]))] <- x[rep(low-1, upp-low+1)]
  cbind(before = bef , after = x) # show off before and after effect
}
set.seed(123)
x <- 1:10
x[sample(length(x), floor(length(x)/2))] <- NA
fillIn2(x)

should produce

> fillIn2(x)
      before after
 [1,]      1     1
 [2,]      2     2
 [3,]     NA     2
 [4,]     NA     2
 [5,]      5     5
 [6,]     NA     5
 [7,]     NA     5
 [8,]     NA     5
 [9,]      9     9
[10,]     10    10

The code seems clunky and has special cases
so it is probably not optimal.

However, it is faster than, say, using 'mapply'

fillIn <- function(x)
{ bef <- x
  xRle <- rle(is.na(x))
  upp <- cumsum(xRle$lengths)[xRle$values]
  low <- cumsum(xRle$lengths)[which(xRle$values)-1]+1
  if (length(upp) == length(low)+1) upp <- upp[-1]
  mapply(function(l, u) x[l:u] <<- x[l-1], low, upp)
  cbind(before = bef , after = x) # show off before and after effect
}
fillIn(x)

Some simulations to compare times,
based on vectors of varying lengths with 50% of elements set to NA

simFillIn <- function(n, method = c("rep", "mapply"))
{ aa <- rpois(n, 5)
  aa[sample(seq(n), floor(n * .5))] <- NA
  method = match.arg(method)
  ansTime <- system.time(ans <- 
    switch(method,
      mapply = fillIn(aa),
      rep = fillIn2(aa), 
      stop("wrong method")
  )) # switch system.time
  list(time = ansTime) # ans = ans, 
}
ans <- lapply(c(2e4, 1e4, 1e3, 1e2, 1e1), simFillIn, method = "mapply")
lapply(ans, "[[", "time")
ans <- lapply(c(2e4, 1e4, 1e3, 1e2, 1e1), simFillIn, method = "rep")
lapply(ans, "[[", "time")

simFillIn (with 'mapply') seems at least 10 times slower
than simFillIn2 (with 'rep').

Regards,

John.

John Gavin <john.gavin at ubs.com>,
Quantitative Risk Models and Statistics,
UBS Investment Bank, 6th floor, 
100 Liverpool St., London EC2M 2RH, UK.
Phone +44 (0) 207 567 4289
Fax   +44 (0) 207 568 5352

Ajay Shah wrote:

>I have 3 different daily time-series. Using union() in the "its"
>package, I can make a long matrix, where rows are created when even
>one of the three time-series is observed:
>
>massive <- union(nifty.its, union(inrusd.its, infosys.its))
>
>Now in this, I want to replace NA values for prices by the
>most-recently observed price. I can do this painfully --
>
>for (i in 2:nrow(massive)) {
>  for (j in 1:3) {
>    if (is.na(massive[i,j])) {
>      massive[i,j] = massive[i-1,j]
>    }
>  }
>}
>
>But this is horribly slow. Is there a more clever way?

Visit our website at http://www.ubs.com

This message contains confidential information and is intend...{{dropped}}