[R] read.delim very slow in reading files with lots of columns

Charles C. Berry cberry at tajo.ucsd.edu
Fri Sep 25 19:17:21 CEST 2009


On Fri, 25 Sep 2009, Ping-Hsun Hsieh wrote:

> Thanks, Ben.
>
> The matrix is a pure numeric matrix (6x700000, 31mb).
> I tried the colClasses='numeric' as well as nrows=7(one of these is header line) on the matrix.
> Also I tested it with not setting the two options in read.delim()


A couple of things come to mind.

First, I have not read the internals of scan, but suspect that parsing a 
really long line may be slowing things down.

Since you are attempting to read in a numeric matrix, you can simply do a 
global replacement of your delimiter with a newline and use scan on 
the result. On unix-like systems, something like

 	tmp <- scan( pipe( 'tr "\t" "\n"  < test_data.txt' ) )

ought to help.

Second, the memory occupied by each line - once it has been processed - is 
spread over the full 32MB (or 3.2 GB for the 600 by 700000 version) region 
of memory. I am guessing that this is causing your cache to work hard to 
put it in place.

If you really want the result to be a 600 by 700000 matrix, you might try 
to read it in smaller blocks using scan( pipe( "cut ... " ) ) to feed 
selected blocks of columns of your text file to R.

HTH,

Chuck


>
> Here is the time spent on reading the matrix for each test.
>
>> system.time( tmp <- read.delim("test_data.txt"))
>     user    system   elapsed
> 50985.421    27.665 51013.384
>
>> system.time(tmp <- read.delim("test_data.txt",colClasses="numeric",nrows=7,comment.char=""))
>     user    system   elapsed
> 51301.563    60.491 51362.208
>
> It seems setting the options does not speed up the reading at all.
> Is it because of the header line? I will test it.
> Did I misunderstand something?
>
> One additional and interesting observation:
> The one with the options does save memory a lot. It took ~150mb, while the other took ~4GB for reading the matrix.
>
> I will try the scan() and see if it helps.
>
> Thanks!
> Mike
>
>
> -----Original Message-----
> From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
> Sent: Wednesday, September 23, 2009 4:56 PM
> To: Ping-Hsun Hsieh
> Cc: r-help at r-project.org
> Subject: Re: [R] read.delim very slow in reading files with lots of columns
>
> use the 'colClasses' argument and you can also set 'nrows'.
>
> b
>
> On Sep 23, 2009, at 8:24 PM, Ping-Hsun Hsieh wrote:
>
>> Hi,
>>
>>
>>
>> I am trying to read a tab-delimited file into R (Ver. 2.8). The
>> machine I am using is 64bit Linux with 16 GB.
>>
>> The file is basically a matrix(~600x700000) and as large as 3GB.
>>
>>
>>
>> The read.delim() ran extremely slow (hours) even with a subset of
>> the file (31 MB with 6x700000)
>>
>> I monitored the memory usage, and found it constantly only took less
>> than 1% of 16GB memory.
>>
>> Does read.delim() have difficulty to read files with lots of columns?
>>
>> Any suggestions?
>>
>>
>>
>> Thanks,
>>
>> Mike
>>
>>
>>
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901




More information about the R-help mailing list