[R] Incremental ReadLines

Mon Nov 2 20:47:52 CET 2009

Hi Gene,

Rather than using R to parse this file, have you considered using either 
grep or sed to pre-process the file and then read it in?

It looks like you just want lines starting with numbers, so something like

grep '^[0-9]\+' thefile.csv > otherfile.csv

should be much faster, and then you can just read in otherfile.csv using 
read.csv().

Best,

Jim

Gene Leynes wrote:
> I've been trying to figure out how to read in a large file for a few days
> now, and after extensive research I'm still not sure what to do.
> 
> I have a large comma delimited text file that contains 59 fields in each
> record.
> There is also a header every 121 records
> 
> This function works well for smallish records
> getcsv=function(fname){
>     ff=file(description = fname)
>     x <- readLines(ff)
>     closeAllConnections()
>     x <- x[x != ""]          # REMOVE BLANKS
>     x=x[grep("^[-0-9]", x)]  # REMOVE ALL TEXT
> 
>     spl=strsplit(x,',')      # THIS PART IS SLOW, BUT MANAGABLE
> 
> xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]])))))
>     return(xx)
> }
> It's not elegant, but it works.
> For 121,000 records it completes in 2.3 seconds
> For 121,000*5 records it completes in 63 seconds
> For 121,000*10 records it doesn't complete
> 
> When I try other methods to read the file in chunks (using scan), the
> process breaks down because I have to start at the beginning of the file on
> every iteration.
> For example:
> fnn=function(n,col){
>     a=122*(n-1)+2
>     xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0))
>     xx=xx[xx!='']
>     xx=matrix(xx,ncol=49,byrow=TRUE)
>     xx[,col]
> }
> system.time(sapply(1:10,fnn,c=26))     # 0.31 Seconds
> system.time(sapply(91:90,fnn,c=26))    # 1.09 Seconds
> system.time(sapply(901:910,fnn,c=26))  # 5.78 Seconds
> 
> Even though I'm only getting the 26th column for 10 sets of records, it
> takes a lot longer the further into the file I go.
> 
> How can I tell scan to pick up where it left off, without it starting at the
> beginning??  There must be a good example somewhere.
> 
> I have done a lot of research (in fact, thank you to Michael J. Crawley and
> others for your help thus far)
> 
> Thanks,
> 
> Gene
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826