[R] regression on large file

Barry Rowlingson b.rowlingson at lancaster.ac.uk
Wed Oct 28 12:07:27 CET 2009


On Wed, Oct 28, 2009 at 11:50 AM, Georg Ehret <georgehret at gmail.com> wrote:
> Dear R community,
>   I have a fairly large file with variables in rows. Every variable
> (thousands) needs to be regressed on a reference variable. The file is too
> big to load into R (or R gets too slow having done it) and I do now read in
> line by line with "scan" (see below) and write the results to out. Although
> improved, this is still very slow... Can someone please help me and suggest
> how I can make this faster?
>
> Thank you and best regards, Georg.
> *******************************************
> Georg Ehret, Johns Hopkins U, Baltimore MD, USA
>
>
> for (i in 16:nmax){
>
> line<-scan(file=paste(file),nlines=1,skip=(i-1),what="integer",sep=",")
>        d<-as.numeric(line[-1])
>        name<-line[1]
>        modela <- lm(s1~a+a2+b+s+M+W)
>        modelb <- lm(s2~a+a2+b+s+M+W+d)
>        modelc <- lm(s3~a+2+b+s+M+W+d+d*s)
>        p_main <- anova(modela,modelb)$P[2]
>        p_main_i <- anova(modela,modelc)$P[2]
>        p_i <- anova(modelb,modelc)$P[2]
>
> cat(c(name,p_main,p_main_i,p_i),file=paste("out",".txt",sep=""),append=T)
>        cat("\n",file=paste("out",".txt",sep=""),append=T)
> }

 Normally you shouldn't try to optimise something until you know where
the time is going. It could be that fitting your three linear models
is taking most time, in which case there's no point optimising the
input/output...

 But I reckon (and this is a guess) the time is taken by the fact that
scan() is having to skip from the start every time. You can confirm
this by commenting out all the stuff inside the loop except for the
line<-scan(...) line. If this still takes ages then we've found the
bottleneck.

 So, what you then do to fix that is to get R to read from a
connection - this is an object that you can read from sequentially
without having to skip from the start every time. There's examples in
help(connections) that will get you going.


Barry




More information about the R-help mailing list