[R-sig-finance] How can I do this better? (Filling in last tr aded price for NA)

Tue Sep 14 05:31:12 CEST 2004

How about this in C? With the example data (190MB), 11+ secs (variable) is
reduced to under 1 second consistently. Even when LOCF stops working, the C
method carries on consistently since it requires no working memory. There
would be some more work required to make it 'safe'. If a copy is required, a
copy can be taken first (or save(,compress=TRUE)'d out).

EXPORT void fillna (double *ans, int *rows, int *cols)
{
   int r=0, c=0;
   double last;
   for (c=0; c<*cols; c++) {
      last = *ans++;
      for (r=1; r<*rows; r++) {
         if (!ISNA(*ans)) last = *ans;
         *ans++ = last;
      }
   }
}

fill.na.byref = function(m)
{
   if (!is.matrix(m) || storage.mode(m)!="double") {
      stop("input must be a matrix, storage mode double")
   }
   invisible(.C("fillna", m, as.integer(nrow(m)), as.integer(ncol(m)),
DUP=FALSE, NAOK=TRUE))
}

For example:

> M = matrix(as.double(sample(100)), nrow=10)
> M[sample(100,50)] = NA
> M
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   88   NA   NA   NA   27   57   NA   NA   NA    99
 [2,]   72   NA   83   NA   24   54   20   32   NA    NA
 [3,]   60   85   33   NA   NA   61    2   NA   50    NA
 [4,]   91   NA    8    5   NA   51   NA   39   NA    45
 [5,]   NA   93   21   NA   NA   48   NA   69   12    56
 [6,]   NA   NA   10   NA   NA   NA   14   53   NA    NA
 [7,]   NA   15   95   NA   43   NA   34   NA   75    90
 [8,]   NA   NA   NA   NA   37   NA   19   NA    7    96
 [9,]   81   NA   NA   NA   NA   89   36   NA   NA    87
[10,]   NA   77   NA   11   NA   18   NA   28   74    NA
> fill.na.byref(M)
> M
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   88   NA   NA   NA   27   57   NA   NA   NA    99
 [2,]   72   NA   83   NA   24   54   20   32   NA    99
 [3,]   60   85   33   NA   24   61    2   32   50    99
 [4,]   91   85    8    5   24   51    2   39   50    45
 [5,]   91   93   21    5   24   48    2   69   12    56
 [6,]   91   93   10    5   24   48   14   53   12    56
 [7,]   91   15   95    5   43   48   34   53   75    90
 [8,]   91   15   95    5   37   48   19   53    7    96
 [9,]   81   15   95    5   37   89   36   53    7    87
[10,]   81   77   95   11   37   18   36   28   74    87
> 

With some 'large' data (190MB), LOCF works once (11 secs) then runs out of
memory :

> M = matrix(as.double(sample(5000*5000)), nrow=5000)
> object.size(M)/1024^2
[1] 190.7350
> M[sample(5000*5000, 5000)] = NA
> system.time(filled <<- apply(M, 2, LOCF))[3]
[1] 11.33
> system.time(filled <<- apply(M, 2, LOCF))[3]
Error: cannot allocate vector of size 195312 Kb
In addition: Warning message: 
Reached total allocation of 1200Mb: see help(memory.size) 
Timing stopped at: 10.28 0.25 10.67 NA NA 
> gc()
            used  (Mb) gc trigger  (Mb)
Ncells    427770  11.5     741108  19.8
Vcells 100105874 763.8  125499641 957.5
> system.time(filled <<- apply(M, 2, LOCF))[3]
Error: cannot allocate vector of size 195312 Kb
In addition: Warning message: 
Reached total allocation of 1200Mb: see help(memory.size) 
Timing stopped at: 0 0 0 NA NA 
> 

Continuing with the same session, trying the C function :

> identical(filled, M)
[1] FALSE
> system.time(fill.na.byref(M))[3]
[1] 0.92
> identical(filled, M)
[1] TRUE

Running several times (shouldn't matter that the NAs are already filled
since the same work has to be done) :

> system.time(fill.na.byref(M))[3]
[1] 0.9
> system.time(fill.na.byref(M))[3]
[1] 0.89
> system.time(fill.na.byref(M))[3]
[1] 0.87
> system.time(fill.na.byref(M))[3]
[1] 0.89
> system.time(fill.na.byref(M))[3]
[1] 0.89
> 

Since no working memory is required (afaik), the garbage collector isn't
involved and we get consistent, fast, timings.

Its possible there is something wrong with my setup/config which means the
LOCF method takes longer and fails. If anyone can point me in the right
direction (changing vcell options?) I'm happy to try in that direction.

-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at myway.com] 
Sent: 14 September 2004 02:51
To: edd at debian.org; ajayshah at mayin.org
Cc: r-sig-finance at stat.math.ethz.ch
Subject: Re: [R-sig-finance] How can I do this better? (Filling in last
traded price for NA)

Dirk,

I was not aware that locf was in "its" but was aware of Tony's solution as
we had discussed both it and a forerunner of the solution in my last post at
that time.  See the thread beginning with:

https://stat.ethz.ch/pipermail/r-help/2003-November/040603.html

The two solutions are the same except for the inner portion which calculates
the indices of the LOCF of a logical 
vector.  Simplifying slightly:

most.recent.1 <- function(L) {
	if (length(L) > 1) L[1] <- TRUE
	w <- which(c(L,T))
	rep(w[-length(w)], diff(w))
}

most.recent.2 <- function(L) {
	which(c(NA,L))[cumsum(L)+1]
}

so the key operations are which, rep and diff in #1 and which, [ and cumsum
in #2.  This suggests they are about equal in speed and, in fact, some
timings I did fluctuated from run to run but in general they seemed to run
at about the same speed with #1 running faster sometimes and #2 running
faster other times (even though the same input was used on every run).

_______________________________________________
R-sig-finance at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance