[R] read.delim very slow in reading files with lots of columns

Fri Sep 25 20:57:17 CEST 2009

Here is how much time it took to read a file with 10 lines and 700,000
columns per line separated with comma:

> system.time(input <- scan("/tempxx.txt", what=0, sep=','))
Read 7000000 items
   user  system elapsed
  15.62    0.22   15.84
> object.size(input)
56000024 bytes
>

'scan' should be sufficient and it will not take another 10 minutes in awk.

On Fri, Sep 25, 2009 at 1:17 PM, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:
> On Fri, 25 Sep 2009, Ping-Hsun Hsieh wrote:
>
>> Thanks, Ben.
>>
>> The matrix is a pure numeric matrix (6x700000, 31mb).
>> I tried the colClasses='numeric' as well as nrows=7(one of these is header
>> line) on the matrix.
>> Also I tested it with not setting the two options in read.delim()
>
>
> A couple of things come to mind.
>
> First, I have not read the internals of scan, but suspect that parsing a
> really long line may be slowing things down.
>
> Since you are attempting to read in a numeric matrix, you can simply do a
> global replacement of your delimiter with a newline and use scan on the
> result. On unix-like systems, something like
>
>        tmp <- scan( pipe( 'tr "\t" "\n"  < test_data.txt' ) )
>
> ought to help.
>
> Second, the memory occupied by each line - once it has been processed - is
> spread over the full 32MB (or 3.2 GB for the 600 by 700000 version) region
> of memory. I am guessing that this is causing your cache to work hard to put
> it in place.
>
> If you really want the result to be a 600 by 700000 matrix, you might try to
> read it in smaller blocks using scan( pipe( "cut ... " ) ) to feed selected
> blocks of columns of your text file to R.
>
> HTH,
>
> Chuck
>
>
>>
>> Here is the time spent on reading the matrix for each test.
>>
>>> system.time( tmp <- read.delim("test_data.txt"))
>>
>>    user    system   elapsed
>> 50985.421    27.665 51013.384
>>
>>> system.time(tmp <-
>>> read.delim("test_data.txt",colClasses="numeric",nrows=7,comment.char=""))
>>
>>    user    system   elapsed
>> 51301.563    60.491 51362.208
>>
>> It seems setting the options does not speed up the reading at all.
>> Is it because of the header line? I will test it.
>> Did I misunderstand something?
>>
>> One additional and interesting observation:
>> The one with the options does save memory a lot. It took ~150mb, while the
>> other took ~4GB for reading the matrix.
>>
>> I will try the scan() and see if it helps.
>>
>> Thanks!
>> Mike
>>
>>
>> -----Original Message-----
>> From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
>> Sent: Wednesday, September 23, 2009 4:56 PM
>> To: Ping-Hsun Hsieh
>> Cc: r-help at r-project.org
>> Subject: Re: [R] read.delim very slow in reading files with lots of
>> columns
>>
>> use the 'colClasses' argument and you can also set 'nrows'.
>>
>> b
>>
>> On Sep 23, 2009, at 8:24 PM, Ping-Hsun Hsieh wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> I am trying to read a tab-delimited file into R (Ver. 2.8). The
>>> machine I am using is 64bit Linux with 16 GB.
>>>
>>> The file is basically a matrix(~600x700000) and as large as 3GB.
>>>
>>>
>>>
>>> The read.delim() ran extremely slow (hours) even with a subset of
>>> the file (31 MB with 6x700000)
>>>
>>> I monitored the memory usage, and found it constantly only took less
>>> than 1% of 16GB memory.
>>>
>>> Does read.delim() have difficulty to read files with lots of columns?
>>>
>>> Any suggestions?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>>
>>>
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> Charles C. Berry                            (858) 534-2098
>                                            Dept of Family/Preventive
> Medicine
> E mailto:cberry at tajo.ucsd.edu               UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?