[R-sig-finance] SUMMARY: Copying previous observation for NA values
(LOCF)
Ajay Shah
ajayshah at mayin.org
Mon Sep 27 07:24:31 CEST 2004
A few weeks ago, I had asked questions about situations like:
> small
p.nifty p.inrusd p.infosys
1996-11-06 866.57 35.72 21.25
1996-11-07 874.89 35.73 21.56
1996-11-08 884.64 35.70 21.69
1996-11-10 880.36 NA 21.78
1996-11-11 891.27 NA 21.41
1996-11-13 896.84 35.73 21.72
1996-11-14 889.39 35.81 21.81
which happen a lot in finance, where we'd want to copy the last traded
price (LTP) on to the next day if the next day has an NA price.
There were a series of extremely valuable responses. See
https://stat.ethz.ch/pipermail/r-sig-finance/2004q3/thread.html and
look for "How can I do this better? (Filling in last tradedprice for
NA)". I did some reading and thinking in putting them together.
Here's a quick summary. I have the slowest machine in the world - an
IBM X20 Celeron at 500 MHz notebook running linux. My problem size was
2178 rows by 3 columns. For this, I have clear data on 4 alternatives:
------------------------------------------------------------
Version Time (seconds)
------------------------------------------------------------
My dumb loops solution 250
Patrick Burns' 1st solution 13.71
Patrick Burns' 2nd solution 1.16
The locf() function of ITS (pointed out by 0.32
Dirk)
------------------------------------------------------------
locf() rocks! It's great and all of us should use it. ITS has a
strange format where when you say library(help=its) you don't get a
list of the functions. So locf() is hidden inside ?itsInterp. It
shouldn't be tucked away like this. Makes me wonder what else about
its is unknown to me!
There were also nice solutions proposed by Gabor, Richard Pugh, and
John Gavin (look at this thread in the mailing list archives). As far
as I could see, they weren't working in the context of the ITS
package. As an example, Richard Pugh's retain() function clearly works
for his example --
> apply(vv, 2, retain)
ts(v1) ts(v2, start = 2)
[1,] 1 NA
[2,] 1 5
[3,] 1 5
[4,] 2 3
[5,] 3 2
[6,] 3 2
[7,] 5 2
[8,] 5 1
but not when fed an ITS object --
> apply(small, 3, retain)
Error in apply(small, 3, retain) : subscript out of bounds
It would be nice if we could somehow have a generic locf() method
which worked for data frames, matrices and ITS objects.
These solutions on the mailing list, which don't work with ITS, do
have their role for the many situations where one might have data
which is not an ITS object, and it's great that all this knowledge has
been given to google on the mailing list archive.
Finally, Matthew Dowle had a solution in C which I didn't experiment
with, since I don't yet know how to marry C and R.
R code which puts together the above 4 solutions and compares their
performance is placed ahead. It uses 3 data files. You can pickup
these data files at /home/ajayshah/public_html/datafiles.tar.bz2
but I won't leave them there indefinitely.
------------------------------------------------------------ snip snip
library(its)
infosys.its <- its(readcsvIts(filename="infosys.text", header=F, sep=",",
col.names=c("date", "p.infosys"),
informat=its.format("%m/%d/%Y"),
outformat=its.format("%Y-%m-%d")))
inrusd.its <- its(readcsvIts(filename="inrusd.text", header=F, sep="|",
col.names=c("date", "p.inrusd"),
informat=its.format("%d %b %Y"),
outformat=its.format("%Y-%m-%d")))
nifty.its <- its(readcsvIts(filename="nifty.text", header=F, sep="|",
col.names=c("date", "p.nifty"),
informat=its.format("%d %b %Y"),
outformat=its.format("%Y-%m-%d")))
massive <- union(nifty.its, union(inrusd.its, infosys.its))
small <- massive[215:230,]
# column 1 is missing until 165. it starts from 166.
# column 2 is missing until 11, it starts from 12.
# column 3 is there from 1 onwards.
# Dumbest solution - my starting point --
loops.solution <- function(X) {
for (i in 2:nrow(X)) {
for (j in 1:ncol(X)) {
if (is.na(X[i,j])) {
X[i,j] = X[i-1,j]
}
}
}
return(X)
}
# First solution proposed by Patrick Burns --
pburns <- function(X) {
mass.na <- is.na(X)
for (i in 2:nrow(X)) {
for (j in 1:ncol(X)) {
if (mass.na[i,j]) {
X[i,j] <- X[i-1, j]
}
}
}
return(X)
}
# Second solution proposed by Patrick Burns --
pburns.columnatatime <- function(X) {
subfun.miss.use <- function(x) {
missing <- which(is.na(x)) # Makes a vector of indexes of missing data
return(missing[missing != seq(along=missing)]) # Don't understand this.
}
for (j in 1:ncol(X)) {
while (length(this.mis <- subfun.miss.use(X[, j]))) {
X[this.mis, j] <- X[this.mis-1, j]
} # Does this correctly handle situations with row=1?
} # For those, we shouldn't be copy from row=0.
return(X)
}
# If you want to recreate it --
system.time(S0 <- loops.solution(massive)) # 249.15 0.19 258.87 0.00 0.00
# save(S0, file="quickly.rda")
# If you want to just read it in to save time --
# load("quickly.rda")
system.time(S1 <- pburns(massive)) # 13.71 0.00 14.00 0.00 0.00
which(S0!=S1)
system.time(S2 <- pburns.columnatatime(massive)) # 1.16 0.00 1.16 0.00 0.00
which(S0!=S2)
# The 3rd solution is to use the locf() function from its, as suggested
# by Dirk. "Its in its but its documentation isn't". :-)
# "locf" = "last observation carried forward".
system.time(S3 <- locf(massive)) # 0.32 0.00 0.32 0.00 0.00
which(S0!=S3)
------------------------------------------------------------ snip snip
--
Ajay Shah Consultant
ajayshah at mayin.org Department of Economic Affairs
http://www.mayin.org/ajayshah Ministry of Finance, New Delhi
More information about the R-sig-finance
mailing list