[R] aggregate with cumsum

William Dunlap wdunlap at tibco.com
Mon Oct 18 17:05:52 CEST 2010



Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Gabor Grothendieck
> Sent: Monday, October 18, 2010 7:03 AM
> To: Bond, Stephen
> Cc: r-help at r-project.org
> Subject: Re: [R] aggregate with cumsum
> 
> On Mon, Oct 18, 2010 at 9:55 AM, Bond, Stephen 
> <Stephen.Bond at cibc.com> wrote:
> > Gabor,
> >
> > You are suggesting some very advanced usage that I do not 
> understand, but it seems this is not what I meant when I said loop.
> > I have a df with 47k rows and each of these is fed to a 
> 'predict' which will output about 62 rows, so the number of 
> groups is very large and I implied that I would go through 
> the 47k x 62 rows with
> >
> > For (jj in (set of 47k values)) # 
> tmp.df=big.df[big.df$group==jj,] to subset
> >                                # and then sum
> >
> > Which is very slow. I discovered that even creating the 
> dataset is super slow as I use write.table
> >
> > The clogging comes from
> >
> > 
> write.table(tmp,"predcom.csv",row.names=FALSE,col.names=FALSE,
> append=TRUE,sep=',')
> >
> > Can anybody suggest a faster way of appending to a text file??

Writing the output to a file instead of inserting
it into an R object almost never gives you more speed.  Writing
to a text file and later reading from it with read.table or
the like can lose a lot of precision.  Use one of the
R functions Gabor and others have suggested.

If you really want to append many times to one file things will
go much faster if you open the file before all the writing
and close it when you are done, instead of opening and
closing it implicitly for each write.  E.g., on my Windows XP
laptop opening the file once gives a c. 320:1 speedup:

  > tfile1 <- tempfile()
  > system.time(for(i in 1:1e4)cat(i, file=tfile1, append=TRUE))
     user  system elapsed 
     1.84    4.30   79.86 

  > tfile2 <- tempfile()
  > ofile <- file(tfile2, open="a") # open in append mode
  > system.time(for(i in 1:1e4)cat(i, file=ofile))
     user  system elapsed 
     0.18    0.07    0.25 
  > close(ofile)

and there is not difference in what the output files contain.

  > identical(readLines(tfile1), readLines(tfile2))
  [1] TRUE
  Warning messages:
  1: In readLines(tfile1) :
    incomplete final line found on 'C:\DOCUME~1\wdunlap\LOCALS~1\Temp\Rtmpdy7MQ0\file41bb5af1'
  2: In readLines(tfile2) :
    incomplete final line found on 'C:\DOCUME~1\wdunlap\LOCALS~1\Temp\Rtmpdy7MQ0\file1eb26e9'

write.table() has a lot of additional overhead beyond
opening and closing files.  Using cat() is the fastest.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> >
> > All comments are appreciated.
> 
> If the problem is to sum each row of a matrix then rowSums can do that
> without a loop.
> 
> -- 
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 



More information about the R-help mailing list