[R] speeding read.table

Thu Oct 18 16:14:52 CEST 2012

Jason

Are you suggesting grep in R or grep in the system?  If the latter, this won't work because I need to implement this same procedure in Windows (sorry about not mentioning this), in which grep does not exist.  If in R, the syntax is not obvious -- could you provide an example?

Dennis

Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

On Oct 18, 2012, at 7:10 AM, Jason Edgecombe wrote:

> On 10/18/2012 09:57 AM, Fisher Dennis wrote:
>> R 2.15.1
>> OS X
>> 
>> Colleagues,
>> 
>> I am reading a 1 GB file into R using read.table.  The file consists of 100 tables, each of which is headed by two lines of characters.
>> The first of these lines is:
>> 	TABLE NO.  1
>> The second is a list of column headers.
>> 
>> For example:
>> TABLE NO.  1
>>  COL1        COL2        COL3        COL4        COL5        COL6        COL7        COL8        COL9        COL10       COL11       COL12
>>   1.0010E+05  0.0000E+00  1.0000E+00  1.0000E+03 -1.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
>>   1.0010E+05  1.0001E+01  1.0000E+00  1.0000E+03 -1.0000E+00  1.0000E+00  2.2737E-14 -2.2737E-14  0.0000E+00  1.9281E-08  0.0000E+00  0.0000E+00
>>   1.0010E+05  2.4000E+01  1.0000E+00  2.0000E+03 -1.0000E+00  1.0000E+00  5.7541E-15 -5.7541E-15  0.0000E+00  5.1115E-13  0.0000E+00  0.0000E+00
>> 
>> Later something similar appears:
>> TABLE NO.  1
>>  COL1        COL2        COL3        COL4        COL5        COL6        COL7        COL8        COL9        COL10       COL11       COL12
>>   1.0010E+05  0.0000E+00  1.0000E+00  1.0000E+03 -1.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
>>   1.0010E+05  1.0001E+01  1.0000E+00  1.0000E+03 -1.0000E+00  1.0000E+00  2.2737E-14 -2.2737E-14  0.0000E+00  1.9281E-08  0.0000E+00  0.0000E+00
>>   1.0010E+05  2.4000E+01  1.0000E+00  2.0000E+03 -1.0000E+00  1.0000E+00  5.7541E-15 -5.7541E-15  0.0000E+00  5.1115E-13  0.0000E+00  0.0000E+00
>> 
>> I will use the term "problematic lines" to refer to the repeated occurrences of the two non-data lines
>> 
>> read.table is not successful in reading the table because of these problematic lines (I get around the first "TABLE NO." line using the skip option)
>> 
>> My word-around has been to:
>> 	1.  read the table with readLines
>> 	2.  remove the problematic lines
>> 	3.  write the file to disk
>> 	4.  read the file with read.table.
>> However, this process is slow.
>> 
>> I though about using "comment.char" as a means of avoiding reading the problematic lines.  However, comment.char does not accept ="[A-Z]"
>> 
>> Are there any clever workarounds for this?
>> 
> Create a connection to a pipe, where pipe reads from the grep command. Grep can exlude the problematic lines. Use the pipe object as your connection in read.table.