[R] R Newbie, please help!

Fri Jun 4 08:40:07 CEST 2010

Hey Jeff,

I have a few ideas.  Each has some different requirements, and to help
you choose, I bench marked them.

###START###

##Basic data
> test <- data.frame(totret=rnorm(10^7), id=rep(1:10^4, each=10^3), time=rep(c(1, rep(0, 999)), 10^4))

##Option 1: probably the most general, but also the slowest by far.
##The idea is it does the calculation for each stock/ID, and then
concatenates [c()] an NA in front.

> system.time(test[,"dailyreturns"] <- unlist(by(test[,"totret"], test[,"id"], function(x) {c(NA, x[-1]/x[-length(x)])})), gcFirst=TRUE)
   user  system elapsed
  49.11    0.42   49.86

##Option 2: Assumes that you have the same number of measurements for
each stock/ID so you can just assign an NA every nth row.
##This is fairly fast

> system.time(test[-1,"dailyreturns"] <- test[-1,"totret"]/test[-nrow(test),"totret"], gcFirst=TRUE)
   user  system elapsed
   1.11    0.21    1.31
> system.time(test[seq(1, 10^7, by=10^3),"dailyreturns"] <- NA, gcFirst=TRUE)
   user  system elapsed
   0.39    0.04    0.42

##Option 3: Assumes that you have some variable (time in my little
test data) that somehow indicates when each stock/ID has its first
measurement.  In the example, the first measurement gets a 1 and
subsequent ones a 0.  So we just assign NA in 'dailyreturns' everytime
the other "time" column has a 1.  Again, a big assumption, but fairly
quick.

> system.time(test[-1,"dailyreturns"] <- test[-1,"totret"]/test[-nrow(test),"totret"], gcFirst=TRUE)
   user  system elapsed
   1.06    0.17    1.25
> system.time(test[which(test[,"time"]==1),"dailyreturns"] <- NA, gcFirst=TRUE)
   user  system elapsed
   0.46    0.09    0.55

###END###

I really feel like there should be a faster way that is also more
general, but it is late and I am not coming up with any better ideas
at the moment.  Perhaps somehow finding the first instance of a
stock/ID?  Anyway, this was simulated on 10 million rows, so maybe
by() works plenty fast for you.

Josh

On Thu, Jun 3, 2010 at 10:20 PM, Jeff08 <jefferyding at gmail.com> wrote:
>
> Hey Josh,
>
> Thanks for the quick response!
>
> I guess I have to switch from the Java mindset to the matrix/vector mindset
> of R.
>
> Your code worked very well, but I just have one problem:
>
> Essentially I have a time series of stock A, followed by a time series of
> stock B, etc.
> So there are break points in the data (the points where it switches stocks
> have incorrect returns, and should be NA at t=0 for each stock)
>
> Is there an easy way to account for this in R? What I was thinking of is if
> there is a way to make a filter rule. Such as if the ID of the row matches
> Stock A, then perform this.
>
>>>"Hello Jeff,
>
> Try this:
>
> test <- data.frame(totret=rnorm(10^7)) #create some sample data
> test[-1,"dailyreturn"] <- test[-1,"totret"]/test[-nrow(test),"totret"]
>
> The general idea is to take the column "totret" excluding the first 1,
> dividided by "totret" exluding the last row.  This gives in effect t+1
> (since t is now shorter)/t
>
> I assigned the result to a new column "dailyreturn".  For 10^7 rows,
> it tooks 1.92 seconds on my system."
> --
> View this message in context: http://r.789695.n4.nabble.com/R-Newbie-please-help-tp2242633p2242703.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Joshua Wiley
Senior in Psychology
University of California, Riverside
http://www.joshuawiley.com/