[R-sig-finance] How can I do this better? (Filling in last tr
aded price for NA)
Matthew Dowle
mdowle at concordiafunds.com
Tue Sep 14 05:31:12 CEST 2004
How about this in C? With the example data (190MB), 11+ secs (variable) is
reduced to under 1 second consistently. Even when LOCF stops working, the C
method carries on consistently since it requires no working memory. There
would be some more work required to make it 'safe'. If a copy is required, a
copy can be taken first (or save(,compress=TRUE)'d out).
EXPORT void fillna (double *ans, int *rows, int *cols)
{
int r=0, c=0;
double last;
for (c=0; c<*cols; c++) {
last = *ans++;
for (r=1; r<*rows; r++) {
if (!ISNA(*ans)) last = *ans;
*ans++ = last;
}
}
}
fill.na.byref = function(m)
{
if (!is.matrix(m) || storage.mode(m)!="double") {
stop("input must be a matrix, storage mode double")
}
invisible(.C("fillna", m, as.integer(nrow(m)), as.integer(ncol(m)),
DUP=FALSE, NAOK=TRUE))
}
For example:
> M = matrix(as.double(sample(100)), nrow=10)
> M[sample(100,50)] = NA
> M
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 88 NA NA NA 27 57 NA NA NA 99
[2,] 72 NA 83 NA 24 54 20 32 NA NA
[3,] 60 85 33 NA NA 61 2 NA 50 NA
[4,] 91 NA 8 5 NA 51 NA 39 NA 45
[5,] NA 93 21 NA NA 48 NA 69 12 56
[6,] NA NA 10 NA NA NA 14 53 NA NA
[7,] NA 15 95 NA 43 NA 34 NA 75 90
[8,] NA NA NA NA 37 NA 19 NA 7 96
[9,] 81 NA NA NA NA 89 36 NA NA 87
[10,] NA 77 NA 11 NA 18 NA 28 74 NA
> fill.na.byref(M)
> M
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 88 NA NA NA 27 57 NA NA NA 99
[2,] 72 NA 83 NA 24 54 20 32 NA 99
[3,] 60 85 33 NA 24 61 2 32 50 99
[4,] 91 85 8 5 24 51 2 39 50 45
[5,] 91 93 21 5 24 48 2 69 12 56
[6,] 91 93 10 5 24 48 14 53 12 56
[7,] 91 15 95 5 43 48 34 53 75 90
[8,] 91 15 95 5 37 48 19 53 7 96
[9,] 81 15 95 5 37 89 36 53 7 87
[10,] 81 77 95 11 37 18 36 28 74 87
>
With some 'large' data (190MB), LOCF works once (11 secs) then runs out of
memory :
> M = matrix(as.double(sample(5000*5000)), nrow=5000)
> object.size(M)/1024^2
[1] 190.7350
> M[sample(5000*5000, 5000)] = NA
> system.time(filled <<- apply(M, 2, LOCF))[3]
[1] 11.33
> system.time(filled <<- apply(M, 2, LOCF))[3]
Error: cannot allocate vector of size 195312 Kb
In addition: Warning message:
Reached total allocation of 1200Mb: see help(memory.size)
Timing stopped at: 10.28 0.25 10.67 NA NA
> gc()
used (Mb) gc trigger (Mb)
Ncells 427770 11.5 741108 19.8
Vcells 100105874 763.8 125499641 957.5
> system.time(filled <<- apply(M, 2, LOCF))[3]
Error: cannot allocate vector of size 195312 Kb
In addition: Warning message:
Reached total allocation of 1200Mb: see help(memory.size)
Timing stopped at: 0 0 0 NA NA
>
Continuing with the same session, trying the C function :
> identical(filled, M)
[1] FALSE
> system.time(fill.na.byref(M))[3]
[1] 0.92
> identical(filled, M)
[1] TRUE
Running several times (shouldn't matter that the NAs are already filled
since the same work has to be done) :
> system.time(fill.na.byref(M))[3]
[1] 0.9
> system.time(fill.na.byref(M))[3]
[1] 0.89
> system.time(fill.na.byref(M))[3]
[1] 0.87
> system.time(fill.na.byref(M))[3]
[1] 0.89
> system.time(fill.na.byref(M))[3]
[1] 0.89
>
Since no working memory is required (afaik), the garbage collector isn't
involved and we get consistent, fast, timings.
Its possible there is something wrong with my setup/config which means the
LOCF method takes longer and fails. If anyone can point me in the right
direction (changing vcell options?) I'm happy to try in that direction.
-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at myway.com]
Sent: 14 September 2004 02:51
To: edd at debian.org; ajayshah at mayin.org
Cc: r-sig-finance at stat.math.ethz.ch
Subject: Re: [R-sig-finance] How can I do this better? (Filling in last
traded price for NA)
Dirk,
I was not aware that locf was in "its" but was aware of Tony's solution as
we had discussed both it and a forerunner of the solution in my last post at
that time. See the thread beginning with:
https://stat.ethz.ch/pipermail/r-help/2003-November/040603.html
The two solutions are the same except for the inner portion which calculates
the indices of the LOCF of a logical
vector. Simplifying slightly:
most.recent.1 <- function(L) {
if (length(L) > 1) L[1] <- TRUE
w <- which(c(L,T))
rep(w[-length(w)], diff(w))
}
most.recent.2 <- function(L) {
which(c(NA,L))[cumsum(L)+1]
}
so the key operations are which, rep and diff in #1 and which, [ and cumsum
in #2. This suggests they are about equal in speed and, in fact, some
timings I did fluctuated from run to run but in general they seemed to run
at about the same speed with #1 running faster sometimes and #2 running
faster other times (even though the same input was used on every run).
_______________________________________________
R-sig-finance at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-finance
More information about the R-sig-finance
mailing list