[R] proper use of textConnection

Peter Dalgaard p.dalgaard at biostat.ku.dk
Sun Oct 12 17:49:25 CEST 2008


Dennis Fisher wrote:
> Colleagues,
>
> Using R2.7.0 in OS X, I am having trouble understanding the command  
> textConnection.  My situation is as follows:
> 1.  I am trying to read a lengthy file (45000 lines) that has headers  
> ~ every 1000 lines.  read.table (or its variants) fail because of the  
> recurrent headers.
> 2.  My present approach is the following:
> 	a.  use readLines to read the file, save as an array
> 	b.  use grep to find the recurrent headers (not including the first  
> set)
> 	c.  delete the recurrent headers from the array
> 	d.  write the array to a temp file
> 	e.  read the temp file using read.table
> 	f.   delete the temp file
> 3.  My understanding is to textConnection might enable me to replace  
> steps d-f with a single step akin to  
> read.table(textConnection(array)).  This appears to work but it is  
> very slow.  I executed code on successively larger chunks of the array:
> for (Each in 1000 * 1:45)
> 	{
> 	cat("N lines =", Each, "\t", date(), "\n")
> 	A <- read.table(textConnection(Z[1:Each]), header=T)
> 	}
> yielding:
> N lines = 1000 	 Sun Oct 12 07:09:48 2008
> N lines = 2000 	 Sun Oct 12 07:09:48 2008
> N lines = 3000 	 Sun Oct 12 07:09:48 2008
> N lines = 4000 	 Sun Oct 12 07:09:50 2008
> N lines = 5000 	 Sun Oct 12 07:09:52 2008
> N lines = 6000 	 Sun Oct 12 07:09:56 2008
> N lines = 7000 	 Sun Oct 12 07:10:01 2008
> N lines = 8000 	 Sun Oct 12 07:10:09 2008
> N lines = 9000 	 Sun Oct 12 07:10:18 2008
> N lines = 10000 	 Sun Oct 12 07:10:31 2008
> N lines = 11000 	 Sun Oct 12 07:10:46 2008
> N lines = 12000 	 Sun Oct 12 07:11:04 2008
> N lines = 13000 	 Sun Oct 12 07:11:25 2008
> N lines = 14000 	 Sun Oct 12 07:11:51 2008
> N lines = 15000 	 Sun Oct 12 07:12:20 2008
> N lines = 16000 	 Sun Oct 12 07:12:54 2008
> N lines = 17000 	 Sun Oct 12 07:13:32 2008
> N lines = 18000 	 Sun Oct 12 07:14:16 2008
> N lines = 19000 	 Sun Oct 12 07:15:04 2008
> N lines = 20000 	 Sun Oct 12 07:15:58 2008
> N lines = 21000 	 Sun Oct 12 07:16:58 2008
> N lines = 22000 	 Sun Oct 12 07:18:04 2008
> N lines = 23000 	 Sun Oct 12 07:19:17 2008
> N lines = 24000 	 Sun Oct 12 07:20:36 2008
> N lines = 25000 	 Sun Oct 12 07:22:02 2008
> N lines = 26000 	 Sun Oct 12 07:23:36 2008
>
> Any clever ideas will be greatly appreciated.
>   
So you are taking about 1.5 minutes to read a 26000 line part of the 
file? It's a bit hard to tell whether that is a lot or a little if you 
don't tell us what those lines contain...  If you're exceeding the 
amount of available RAM, that could be causing problems.

You're not closing the earlier connections though, so

A <- read.table(con <- textConnection(Z[1:Each]), header=T)
close(con)

might help.

Also notice that the usual tricks for speeding up read.table() still 
apply (use colClasses, e.g.).

> Dennis
>
>
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-415-564-2220
> www.PLessThan.com
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>   


-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907



More information about the R-help mailing list