[R] read.delim very slow in reading files with lots of columns

Benilton Carvalho bcarvalh at jhsph.edu
Fri Sep 25 20:19:53 CEST 2009


it may be worth it writing a script to transpose the data (in awk, it  
takes 10min on my laptop)... then read in the transposed data...


 > system.time({x <- read.delim("testTransposed.txt", header=F,  
colClasses="numeric", nrow=700000); x <- t(x)})
    user  system elapsed
   4.958   0.412   5.477

b

On Sep 25, 2009, at 1:35 PM, Ping-Hsun Hsieh wrote:

> Thanks, Ben.
>
> The matrix is a pure numeric matrix (6x700000, 31mb).
> I tried the colClasses='numeric' as well as nrows=7(one of these is  
> header line) on the matrix.
> Also I tested it with not setting the two options in read.delim()
>
> Here is the time spent on reading the matrix for each test.
>
>> system.time( tmp <- read.delim("test_data.txt"))
>     user    system   elapsed
> 50985.421    27.665 51013.384
>
>> system.time(tmp <-  
>> read 
>> .delim("test_data.txt",colClasses="numeric",nrows=7,comment.char=""))
>     user    system   elapsed
> 51301.563    60.491 51362.208
>
> It seems setting the options does not speed up the reading at all.
> Is it because of the header line? I will test it.
> Did I misunderstand something?
>
> One additional and interesting observation:
> The one with the options does save memory a lot. It took ~150mb,  
> while the other took ~4GB for reading the matrix.
>
> I will try the scan() and see if it helps.
>
> Thanks!
> Mike
>
>
> -----Original Message-----
> From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
> Sent: Wednesday, September 23, 2009 4:56 PM
> To: Ping-Hsun Hsieh
> Cc: r-help at r-project.org
> Subject: Re: [R] read.delim very slow in reading files with lots of  
> columns
>
> use the 'colClasses' argument and you can also set 'nrows'.
>
> b
>
> On Sep 23, 2009, at 8:24 PM, Ping-Hsun Hsieh wrote:
>
>> Hi,
>>
>>
>>
>> I am trying to read a tab-delimited file into R (Ver. 2.8). The
>> machine I am using is 64bit Linux with 16 GB.
>>
>> The file is basically a matrix(~600x700000) and as large as 3GB.
>>
>>
>>
>> The read.delim() ran extremely slow (hours) even with a subset of
>> the file (31 MB with 6x700000)
>>
>> I monitored the memory usage, and found it constantly only took less
>> than 1% of 16GB memory.
>>
>> Does read.delim() have difficulty to read files with lots of columns?
>>
>> Any suggestions?
>>
>>
>>
>> Thanks,
>>
>> Mike
>>
>>
>>
>>
>>       [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>




More information about the R-help mailing list