[R] speeding read.table
Rui Barradas
ruipbarradas at sapo.pt
Thu Oct 18 20:27:21 CEST 2012
Hello,
Time down by a factor of 4. It still takes some minutes, 2 mins for a
file of 380Mb/3.6M lines. So maybe system commands (maybe awk?) can do
the job better.
fun <- function(infile, outfile, lines = 10000L){
remove <- function(x){
i1 <- grep("TABLE", x)
i2 <- grep("COL", x)
x[-c(i1, i2)]
}
fin <- file(infile, open = "rt")
on.exit(close(fin))
while(TRUE){
x <- try(readLines(fin, n = lines))
if(class(x) == "try-error") return(NULL)
y <- remove(x[ x != "" ])
if(length(y) == 0) return(NULL)
lst <- lapply(strsplit(y, " "), function(.y)
as.numeric(.y[ .y != "" ]))
mat <- do.call(rbind, lst)
write.table(mat, outfile, append = TRUE, row.names = FALSE,
col.names = FALSE)
}
}
fun("test", "clean")
Hope this helps,
Rui Barradas
Em 18-10-2012 18:14, Rui Barradas escreveu:
> Hello,
>
> The problem doesn't seem to be memory swaps. I've tried with a 380Mb
> file (3.6M lines) and it took aroun 8.5 minutes. I'll think of
> something else and write back.
>
> Rui Barradas
> Em 18-10-2012 16:42, Fisher Dennis escreveu:
>> Rui
>>
>> I tried something similar to this. To my surprise, it was quite slow
>> (it is still running after many minutes). I suspect that that
>> textConnection is a slow process compared to actually reading from
>> the drive. It is possible that the problem is that the object is so
>> large that it is being swapped in and out of virtual memory --
>> however, this machine has 12 GB RAM so this seems unlikely.
>>
>> Dennis
>>
>> Dennis Fisher MD
>> P < (The "P Less Than" Company)
>> Phone: 1-866-PLessThan (1-866-753-7784)
>> Fax: 1-866-PLessThan (1-866-753-7784)
>> www.PLessThan.com
>>
>> On Oct 18, 2012, at 8:35 AM, Rui Barradas wrote:
>>
>>> Hello,
>>>
>>> Try the following, readaing your file into 'x', using readLines.
>>>
>>>
>>>
>>> tc <- textConnection("
>>> TABLE NO. 1
>>> COL1 COL2 COL3 COL4 COL5 COL6
>>> COL7 COL8 COL9 COL10 COL11 COL12
>>> 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00
>>> 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
>>> 0.0000E+00 0.0000E+00
>>> 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00
>>> 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08
>>> 0.0000E+00 0.0000E+00
>>> 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00
>>> 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13
>>> 0.0000E+00 0.0000E+00
>>>
>>> TABLE NO. 1
>>> COL1 COL2 COL3 COL4 COL5 COL6
>>> COL7 COL8 COL9 COL10 COL11 COL12
>>> 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00
>>> 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
>>> 0.0000E+00 0.0000E+00
>>> 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00
>>> 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08
>>> 0.0000E+00 0.0000E+00
>>> 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00
>>> 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13
>>> 0.0000E+00 0.0000E+00
>>> ")
>>>
>>> x <- readLines(tc)
>>> close(tc)
>>>
>>> #------------------------ starts here
>>> x <- x[ x != "" ]
>>>
>>> i1 <- grep("TABLE", x)
>>> i2 <- grep("COL", x)
>>> y <- x[-c(i1, i2)]
>>>
>>> tc <- textConnection(y)
>>> dat <- read.table(tc)
>>> close(tc)
>>>
>>> cnames <- unlist(strsplit(x[2], " "))
>>> names(dat) <- cnames[cnames != ""]
>>>
>>>
>>> Hope this helps,
>>>
>>> Rui Barradas
>>> Em 18-10-2012 14:57, Fisher Dennis escreveu:
>>>> R 2.15.1
>>>> OS X
>>>>
>>>> Colleagues,
>>>>
>>>> I am reading a 1 GB file into R using read.table. The file
>>>> consists of 100 tables, each of which is headed by two lines of
>>>> characters.
>>>> The first of these lines is:
>>>> TABLE NO. 1
>>>> The second is a list of column headers.
>>>>
>>>> For example:
>>>> TABLE NO. 1
>>>> COL1 COL2 COL3 COL4 COL5 COL6
>>>> COL7 COL8 COL9 COL10 COL11 COL12
>>>> 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00
>>>> 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
>>>> 0.0000E+00 0.0000E+00
>>>> 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00
>>>> 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08
>>>> 0.0000E+00 0.0000E+00
>>>> 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00
>>>> 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13
>>>> 0.0000E+00 0.0000E+00
>>>>
>>>> Later something similar appears:
>>>> TABLE NO. 1
>>>> COL1 COL2 COL3 COL4 COL5 COL6
>>>> COL7 COL8 COL9 COL10 COL11 COL12
>>>> 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00
>>>> 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
>>>> 0.0000E+00 0.0000E+00
>>>> 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00
>>>> 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08
>>>> 0.0000E+00 0.0000E+00
>>>> 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00
>>>> 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13
>>>> 0.0000E+00 0.0000E+00
>>>>
>>>> I will use the term "problematic lines" to refer to the repeated
>>>> occurrences of the two non-data lines
>>>>
>>>> read.table is not successful in reading the table because of these
>>>> problematic lines (I get around the first "TABLE NO." line using
>>>> the skip option)
>>>>
>>>> My word-around has been to:
>>>> 1. read the table with readLines
>>>> 2. remove the problematic lines
>>>> 3. write the file to disk
>>>> 4. read the file with read.table.
>>>> However, this process is slow.
>>>>
>>>> I though about using "comment.char" as a means of avoiding reading
>>>> the problematic lines. However, comment.char does not accept ="[A-Z]"
>>>>
>>>> Are there any clever workarounds for this?
>>>>
>>>> Dennis
>>>>
>>>>
>>>> Dennis Fisher MD
>>>> P < (The "P Less Than" Company)
>>>> Phone: 1-866-PLessThan (1-866-753-7784)
>>>> Fax: 1-866-PLessThan (1-866-753-7784)
>>>> www.PLessThan.com
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list