[R] proper use of textConnection
Peter Dalgaard
p.dalgaard at biostat.ku.dk
Sun Oct 12 17:49:25 CEST 2008
Dennis Fisher wrote:
> Colleagues,
>
> Using R2.7.0 in OS X, I am having trouble understanding the command
> textConnection. My situation is as follows:
> 1. I am trying to read a lengthy file (45000 lines) that has headers
> ~ every 1000 lines. read.table (or its variants) fail because of the
> recurrent headers.
> 2. My present approach is the following:
> a. use readLines to read the file, save as an array
> b. use grep to find the recurrent headers (not including the first
> set)
> c. delete the recurrent headers from the array
> d. write the array to a temp file
> e. read the temp file using read.table
> f. delete the temp file
> 3. My understanding is to textConnection might enable me to replace
> steps d-f with a single step akin to
> read.table(textConnection(array)). This appears to work but it is
> very slow. I executed code on successively larger chunks of the array:
> for (Each in 1000 * 1:45)
> {
> cat("N lines =", Each, "\t", date(), "\n")
> A <- read.table(textConnection(Z[1:Each]), header=T)
> }
> yielding:
> N lines = 1000 Sun Oct 12 07:09:48 2008
> N lines = 2000 Sun Oct 12 07:09:48 2008
> N lines = 3000 Sun Oct 12 07:09:48 2008
> N lines = 4000 Sun Oct 12 07:09:50 2008
> N lines = 5000 Sun Oct 12 07:09:52 2008
> N lines = 6000 Sun Oct 12 07:09:56 2008
> N lines = 7000 Sun Oct 12 07:10:01 2008
> N lines = 8000 Sun Oct 12 07:10:09 2008
> N lines = 9000 Sun Oct 12 07:10:18 2008
> N lines = 10000 Sun Oct 12 07:10:31 2008
> N lines = 11000 Sun Oct 12 07:10:46 2008
> N lines = 12000 Sun Oct 12 07:11:04 2008
> N lines = 13000 Sun Oct 12 07:11:25 2008
> N lines = 14000 Sun Oct 12 07:11:51 2008
> N lines = 15000 Sun Oct 12 07:12:20 2008
> N lines = 16000 Sun Oct 12 07:12:54 2008
> N lines = 17000 Sun Oct 12 07:13:32 2008
> N lines = 18000 Sun Oct 12 07:14:16 2008
> N lines = 19000 Sun Oct 12 07:15:04 2008
> N lines = 20000 Sun Oct 12 07:15:58 2008
> N lines = 21000 Sun Oct 12 07:16:58 2008
> N lines = 22000 Sun Oct 12 07:18:04 2008
> N lines = 23000 Sun Oct 12 07:19:17 2008
> N lines = 24000 Sun Oct 12 07:20:36 2008
> N lines = 25000 Sun Oct 12 07:22:02 2008
> N lines = 26000 Sun Oct 12 07:23:36 2008
>
> Any clever ideas will be greatly appreciated.
>
So you are taking about 1.5 minutes to read a 26000 line part of the
file? It's a bit hard to tell whether that is a lot or a little if you
don't tell us what those lines contain... If you're exceeding the
amount of available RAM, that could be causing problems.
You're not closing the earlier connections though, so
A <- read.table(con <- textConnection(Z[1:Each]), header=T)
close(con)
might help.
Also notice that the usual tricks for speeding up read.table() still
apply (use colClasses, e.g.).
> Dennis
>
>
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-415-564-2220
> www.PLessThan.com
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help
mailing list