[R] Still help needed on embeded regression

Wed May 3 05:57:18 CEST 2006

For my perticular purpose (run regression on trailing
60 number b on 60 a), I have writen a short function
for it because no available function (runmean,
rollmean) can take care of NA as I wanted.  I
basically want the regression ignore the NA row.  I
have read a little bit source code of "runmean" and
developed my code.  I basically put zero for every NA
row, as well as keep tracking the real number of
regression for the least square fit (dummy).   The
whole program takes less than a about 0.2 seconds for
the whole 144,000 data. I was amazed when I think
about the 2+ hours for my original code.  My friend
with SAS needs about 1 minutes, he used to beat me by
100 times.  Now I beat him by 100 times!

Here is the code. I am wondering if it will be useful
for others.  Actually I think this kind of calculation
is fairly common at least in finance for things like
"stock beta".  Does it worht to wrap it and put it
somewhere on web?  Thanks. 

window.sum<-function(x,k){
na.row<-is.na(x);x[na.row]<-0;
temp<-c(sum(x[1:k]),diff(x,k)); #print(temp)
cumsum(temp)
}

##################################
# function to calculate linear regression of y on x
for the lag k,
# such as stock beta for the trailing k months. 
# It will return a list of two vectors with length as
length(x)-k+1;
# ( intercept,coefficency)

lag.regression<-function(x,y,k){
l<-length(x);

###########################
# correctly taken care of the NA value, we basically
set all value in a row as 0
# if one of them is NA.  By that we we find the
correct answer for regression 
#  with this row basically takes no effect. 
na.row<-is.na(x)|is.na(y);
x[na.row]<-0; y[na.row]<-0;
dummy<-rep(1,l); dummy[na.row]<-0;
###############################

mxy<-window.sum(x*y,k);mx<-window.sum(x,k);
mxx<-window.sum(x*x,k);my<-window.sum(y,k);
num<-window.sum(dummy,k);
b<-(num*mxy-mx*my)/(num*mxx-mx*mx);
a<-(my-b*mx)/num;

#########
#this take care of the case when all k ys or xs are 0
and yield a none sense result
na.row1<-is.na(a)|is.na(b);
a[na.row1]<-NA; b[na.row1]<-NA;
###############

list(a,b)
}

x<-RET-riskfree; y<-sprtrn-riskfree;
s<-lag.regression(y,x,60)

--- Gabor Grothendieck <ggrothendieck at gmail.com>
wrote:

> Try
> 
> 	runmean2 <- function(x, k) # k must be even
> 		(coredata(runmean(x, k-1)) * (k-1) +
> 			coredata(lag(x, -k/2, na.pad = TRUE)))/k
> 
> Also, in your code use matrices or vectors instead
> of data frames
> to avoid any overhead in using data frames.
> 
> On 5/2/06, Guojun Zhu <shmilylemon at yahoo.com> wrote:
> > Sorry to bother you guys again. This is great. 
> But
> > this is for 61 number and the second case will
> change
> > 60 to 61.  "run*" only accept odd number window.
> How
> > to get around it with 60?  Any suggestion? Thanks.
> >
> > --- Gabor Grothendieck <ggrothendieck at gmail.com>
> > wrote:
> >
> > > Using runmean from caTools the first one below
> does
> > > it in under 1 second but will not handle NAs. 
> The
> > > second one takes under 15 seconds and handles
> > > them by replacing them with linear
> approximations.
> > > Note that k must be odd.
> > >
> > > # 1
> > >
> > > library(caTools)
> > > set.seed(1)
> > > system.time({
> > >       y <- rnorm(140001)
> > >       x <- as.numeric(seq(y))
> > >       k <- 61
> > >       Mxy <- runmean(x * y, k)
> > >       Mxx <- runmean(x * x, k)
> > >       Mx <- runmean(x, k)
> > >       My <- runmean(y, k)
> > >       b <- (Mxy - Mx * My) / (Mxx - Mx * Mx)
> > >       a <- My - b * Mx
> > > })
> > >
> > > # 2
> > >
> > > library(caTools)
> > > library(zoo)
> > > set.seed(1)
> > > system.time({
> > >       y <- rnorm(140000)
> > >       x <- as.numeric(seq(y))
> > >       x[100:200] <- NA
> > >       x <- na.approx(zoo(x))
> > >       y <- zoo(y)
> > >       k <- 60
> > >       Mxy <- runmean(x * y, k)
> > >       Mxx <- runmean(x * x, k)
> > >       Mx <- runmean(x, k)
> > >       My <- runmean(y, k)
> > >       b <- (Mxy - Mx * My) / (Mxx - Mx * Mx)
> > >       a <- My - b * Mx
> > > })
> > >
> > >
> > > On 5/1/06, Guojun Zhu <shmilylemon at yahoo.com>
> wrote:
> > > > I basically has a long data.frame a.  but I
> only
> > > need
> > > > three columns x,y. Let us say the index of row
> is
> > > t.
> > > > I need to produce new column s_t as the linear
> > > > regression coefficient of
> (x_(t-60),...x_(t-1)) on
> > > > (y_(t-60),...,y_(t-1)). The data is about
> 140,000
> > > > rows.  I wrote a simple code on this which is
> > > super
> > > > slow, it takes more than 2 hours on a 2.8Ghz
> Intel
> > > Duo
> > > > Core.  My friend use SAS and his code needs
> only
> > > > couple of minutes.  I know there must be some
> more
> > > > efficient way to write it.  Can anyone help me
> on
> > > > this?  Here is the code.
> > > >
> > > > Also one line produce a complete NA temp$y and
> lm
> > > > function failed on that.  How to make it just
> > > produce
> > > > a NA instead and keep runing?
> > > >
> > > > attach(return)
> > > > betat=rep(NA,length(RET))
> > > > for (i in 61:length(RET)){cat(i," ");
> > > > if (year[[i]]>=1995){
> > > >
> > > >
> > >
> >
>
temp<-data.frame(y=RET[(i-60):(i-1)]-riskfree[(i-60):(i-1)],x=sprtrn[(i-60):(i-1)]-riskfree[(i-60):(i-1)])
> > > >
> > > >
> > >
> >
>
betat[[i]]<-lm(y~x+1,na.action=na.exclude,temp)[[1]][[2]]
> > > >  #if (i%%100==0)
> > > > cat(i," ");
> > > >
> > > >
> > > >
> > >
> >
>
return$vol.cap[[i]]=mean(VOL[(i-12):(i-1)],na.rm=TRUE)/return$cap[[i]]
> > > > }
> > > > }
> > > >
> > > > ______________________________________________
> > > > R-help at stat.math.ethz.ch mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > > >
> > >
> >
> >
> > __________________________________________________
> > Do You Yahoo!?

> protection around
> > http://mail.yahoo.com
> >
>