[R-sig-finance] How can I do this better? (Filling in last tr aded price for NA)

Tue Sep 14 12:11:52 CEST 2004

Yes, that fixes the memory problem for me. Its running in a consistent time
now, about 10 times slower than the C method.

I'm wondering why you think 190MB is large? memory.limit() states 1200MB so
the data is 1/6 of the memory available to R. The box has 2GB main memory.

The 'ref' package looks good. This is intended for the challenges you refer
to isn't it ?

-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at myway.com] 
Sent: 14 September 2004 09:56
To: mdowle at concordiafunds.com; ggrothendieck at myway.com; edd at debian.org;
ajayshah at mayin.org
Cc: r-sig-finance at stat.math.ethz.ch
Subject: RE: [R-sig-finance] How can I do this better? (Filling in last tr
aded price for NA)

The memory error is due to apply rather than LOCF, itself.
You can do it largely in place in R thereby eliminating
memory errors by replacing the apply with a for:

system.time({ for(j in 1:ncol(M)) M[,j] <<- LOCF(M[,j]) })

Of course, if your data really is this large you are not 
only going to have challenges here but it will also be problematic doing any
subsequent analyses that are not 
trivial.

Date:   	Tue, 14 Sep 2004 04:31:12 +0100
From:   	Matthew Dowle <mdowle at concordiafunds.com>
To:   	'ggrothendieck at myway.com' <ggrothendieck at myway.com>,
<edd at debian.org>, <ajayshah at mayin.org>
Cc:   	<r-sig-finance at stat.math.ethz.ch>
Subject:   	RE: [R-sig-finance] How can I do this better? (Filling in
last tr aded price for NA)

How about this in C? With the example data (190MB), 11+ secs (variable) is
reduced to under 1 second consistently. Even when LOCF stops working, the C
method carries on consistently since it requires no working memory. There
would be some more work required to make it 'safe'. If a copy is required, a
copy can be taken first (or save(,compress=TRUE)'d out).

EXPORT void fillna (double *ans, int *rows, int *cols)
{
int r=0, c=0;
double last;
for (c=0; c<*cols; c++) {
last = *ans++;
for (r=1; r<*rows; r++) {
if (!ISNA(*ans)) last = *ans;
*ans++ = last;
}
}
}

fill.na.byref = function(m)
{
if (!is.matrix(m) || storage.mode(m)!="double") {
stop("input must be a matrix, storage mode double")
}
invisible(.C("fillna", m, as.integer(nrow(m)), as.integer(ncol(m)),
DUP=FALSE, NAOK=TRUE)) }

For example:

> M = matrix(as.double(sample(100)), nrow=10)
> M[sample(100,50)] = NA
> M
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 88 NA NA NA 27 57 NA NA NA 99
[2,] 72 NA 83 NA 24 54 20 32 NA NA
[3,] 60 85 33 NA NA 61 2 NA 50 NA
[4,] 91 NA 8 5 NA 51 NA 39 NA 45
[5,] NA 93 21 NA NA 48 NA 69 12 56
[6,] NA NA 10 NA NA NA 14 53 NA NA
[7,] NA 15 95 NA 43 NA 34 NA 75 90
[8,] NA NA NA NA 37 NA 19 NA 7 96
[9,] 81 NA NA NA NA 89 36 NA NA 87
[10,] NA 77 NA 11 NA 18 NA 28 74 NA
> fill.na.byref(M)
> M
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 88 NA NA NA 27 57 NA NA NA 99
[2,] 72 NA 83 NA 24 54 20 32 NA 99
[3,] 60 85 33 NA 24 61 2 32 50 99
[4,] 91 85 8 5 24 51 2 39 50 45
[5,] 91 93 21 5 24 48 2 69 12 56
[6,] 91 93 10 5 24 48 14 53 12 56
[7,] 91 15 95 5 43 48 34 53 75 90
[8,] 91 15 95 5 37 48 19 53 7 96
[9,] 81 15 95 5 37 89 36 53 7 87
[10,] 81 77 95 11 37 18 36 28 74 87
>

With some 'large' data (190MB), LOCF works once (11 secs) then runs out of
memory :

> M = matrix(as.double(sample(5000*5000)), nrow=5000) 
> object.size(M)/1024^2
[1] 190.7350
> M[sample(5000*5000, 5000)] = NA
> system.time(filled <<- apply(M, 2, LOCF))[3]
[1] 11.33
> system.time(filled <<- apply(M, 2, LOCF))[3]
Error: cannot allocate vector of size 195312 Kb
In addition: Warning message:
Reached total allocation of 1200Mb: see help(memory.size) Timing stopped at:
10.28 0.25 10.67 NA NA
> gc()
used (Mb) gc trigger (Mb)
Ncells 427770 11.5 741108 19.8
Vcells 100105874 763.8 125499641 957.5
> system.time(filled <<- apply(M, 2, LOCF))[3]
Error: cannot allocate vector of size 195312 Kb
In addition: Warning message:
Reached total allocation of 1200Mb: see help(memory.size) Timing stopped at:
0 0 0 NA NA
>

Continuing with the same session, trying the C function :

> identical(filled, M)
[1] FALSE
> system.time(fill.na.byref(M))[3]
[1] 0.92
> identical(filled, M)
[1] TRUE

Running several times (shouldn't matter that the NAs are already filled
since the same work has to be done) :

> system.time(fill.na.byref(M))[3]
[1] 0.9
> system.time(fill.na.byref(M))[3]
[1] 0.89
> system.time(fill.na.byref(M))[3]
[1] 0.87
> system.time(fill.na.byref(M))[3]
[1] 0.89
> system.time(fill.na.byref(M))[3]
[1] 0.89
>

Since no working memory is required (afaik), the garbage collector isn't
involved and we get consistent, fast, timings.

Its possible there is something wrong with my setup/config which means the
LOCF method takes longer and fails. If anyone can point me in the right
direction (changing vcell options?) I'm happy to try in that direction.

-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at myway.com]
Sent: 14 September 2004 02:51
To: edd at debian.org; ajayshah at mayin.org
Cc: r-sig-finance at stat.math.ethz.ch
Subject: Re: [R-sig-finance] How can I do this better? (Filling in last
traded price for NA)

Dirk,

I was not aware that locf was in "its" but was aware of Tony's solution as
we had discussed both it and a forerunner of the solution in my last post at
that time. See the thread beginning with:

https://stat.ethz.ch/pipermail/r-help/2003-November/040603.html

The two solutions are the same except for the inner portion which calculates
the indices of the LOCF of a logical vector. Simplifying slightly:

most.recent.1 <- function(L) {
     if (length(L) > 1) L[1] <- TRUE
     w <- which(c(L,T))
     rep(w[-length(w)], diff(w))
}

most.recent.2 <- function(L) {
     which(c(NA,L))[cumsum(L)+1]
}

so the key operations are which, rep and diff in #1 and which, [ and cumsum
in #2. This suggests they are about equal in speed and, in fact, some
timings I did fluctuated from run to run but in general they seemed to run
at about the same speed with #1 running faster sometimes and #2 running
faster other times (even though the same input was used on every run).