[R] speeding read.table
Rui Barradas
ruipbarradas at sapo.pt
Thu Oct 18 17:35:41 CEST 2012
Hello,
Try the following, readaing your file into 'x', using readLines.
tc <- textConnection("
TABLE NO. 1
COL1 COL2 COL3 COL4 COL5 COL6
COL7 COL8 COL9 COL10 COL11 COL12
1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00
0.0000E+00
1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00
1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00
0.0000E+00
TABLE NO. 1
COL1 COL2 COL3 COL4 COL5 COL6
COL7 COL8 COL9 COL10 COL11 COL12
1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00
0.0000E+00
1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00
1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00
0.0000E+00
")
x <- readLines(tc)
close(tc)
#------------------------ starts here
x <- x[ x != "" ]
i1 <- grep("TABLE", x)
i2 <- grep("COL", x)
y <- x[-c(i1, i2)]
tc <- textConnection(y)
dat <- read.table(tc)
close(tc)
cnames <- unlist(strsplit(x[2], " "))
names(dat) <- cnames[cnames != ""]
Hope this helps,
Rui Barradas
Em 18-10-2012 14:57, Fisher Dennis escreveu:
> R 2.15.1
> OS X
>
> Colleagues,
>
> I am reading a 1 GB file into R using read.table. The file consists of 100 tables, each of which is headed by two lines of characters.
> The first of these lines is:
> TABLE NO. 1
> The second is a list of column headers.
>
> For example:
> TABLE NO. 1
> COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12
> 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
> 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00 0.0000E+00
> 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00 0.0000E+00
>
> Later something similar appears:
> TABLE NO. 1
> COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12
> 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
> 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00 0.0000E+00
> 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00 0.0000E+00
>
> I will use the term "problematic lines" to refer to the repeated occurrences of the two non-data lines
>
> read.table is not successful in reading the table because of these problematic lines (I get around the first "TABLE NO." line using the skip option)
>
> My word-around has been to:
> 1. read the table with readLines
> 2. remove the problematic lines
> 3. write the file to disk
> 4. read the file with read.table.
> However, this process is slow.
>
> I though about using "comment.char" as a means of avoiding reading the problematic lines. However, comment.char does not accept ="[A-Z]"
>
> Are there any clever workarounds for this?
>
> Dennis
>
>
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list