[R] read.table performance

Wed Dec 7 08:21:23 CET 2011

On Dec 6, 2011, at 22:33 , Gene Leynes wrote:

> Mark,
> 
> Thanks for your suggestions.
> 
> That's a good idea about the NULL columns; I didn't think of that.
> Surprisingly, it didn't have any effect on the time.

Hmm, I think you want "character" and "NULL" there (i.e., quoted). Did you fix both? 

>> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
>> rep(NULL,3696)).

As a general matter, if you want people to dig into this, they need some paraphrase of the file to play with. Would it be possible to set up a small R program that generates a data file which displays the issue? Everything I try seems to take about a second to read in.

-pd

> 
> This problem was just a curiosity, I already did the import using Excel and
> VBA.  I was just going to illustrate the power and simplicity of R, but it
> ironically it's been much slower and harder in R...
> The VBA was painful and messy, and took me over an hour to write; but at
> least it worked quickly and reliably.
> The R code was clean and only took me about 5 minutes to write, but the run
> time was prohibitively slow!
> 
> I profiled the code, but that offers little insight to me.
> 
> Profile results with 10 line file:
> 
>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
> $by.self
>             self.time self.pct total.time total.pct
> scan             12.24    53.50      12.24     53.50
> read.table       10.58    46.24      22.88    100.00
> type.convert      0.04     0.17       0.04      0.17
> make.names        0.02     0.09       0.02      0.09
> 
> $by.total
>             total.time total.pct self.time self.pct
> read.table        22.88    100.00     10.58    46.24
> scan              12.24     53.50     12.24    53.50
> type.convert       0.04      0.17      0.04     0.17
> make.names         0.02      0.09      0.02     0.09
> 
> $sample.interval
> [1] 0.02
> 
> $sampling.time
> [1] 22.88
> 
> 
> Profile results with 250 line file:
> 
>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
> $by.self
>             self.time self.pct total.time total.pct
> scan             23.88    68.15      23.88     68.15
> read.table       10.78    30.76      35.04    100.00
> type.convert      0.30     0.86       0.32      0.91
> character         0.02     0.06       0.02      0.06
> file              0.02     0.06       0.02      0.06
> lapply            0.02     0.06       0.02      0.06
> unlist            0.02     0.06       0.02      0.06
> 
> $by.total
>               total.time total.pct self.time self.pct
> read.table          35.04    100.00     10.78    30.76
> scan                23.88     68.15     23.88    68.15
> type.convert         0.32      0.91      0.30     0.86
> sapply               0.04      0.11      0.00     0.00
> character            0.02      0.06      0.02     0.06
> file                 0.02      0.06      0.02     0.06
> lapply               0.02      0.06      0.02     0.06
> unlist               0.02      0.06      0.02     0.06
> simplify2array       0.02      0.06      0.00     0.00
> 
> $sample.interval
> [1] 0.02
> 
> $sampling.time
> [1] 35.04
> 
> 
> 
> 
> On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds <markleeds2 at gmail.com> wrote:
> 
>> hi gene: maybe someone else will reply with some  subtleties that I'm not
>> aware of. one other thing
>> that might help: if you know which columns you want , you can set the
>> others to NULL through
>> colClasses and this should speed things up also. For example, say you knew
>> you only wanted the
>> first four columns and they were character. then you could do,
>> 
>> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
>> rep(NULL,3696)).
>> 
>> hopefully someone else will say something that does the trick. it seems
>> odd to me as far as the
>> difference in timings ? good luck.
>> 
>> 
>> 
>> 
>> 
>> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes <gleynes at gmail.com> wrote:
>> 
>>> Mark,
>>> 
>>> Thank you for the reply
>>> 
>>> I neglected to mention that I had already set
>>> options(stringsAsFactors=FALSE)
>>> 
>>> I agree, skipping the factor determination can help performance.
>>> 
>>> The main reason that I wanted to use read.table is because it will
>>> correctly determine the column classes for me.  I don't really want to
>>> specify 3700 column classes!  (I'm not sure what they are anyway).
>>> 
>>> 
>>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds <markleeds2 at gmail.com> wrote:
>>> 
>>>> Hi Gene: Sometimes using colClasses in read.table can speed things up.
>>>> If you know what your variables are ahead of time and what you want them to
>>>> be, this allows you to be specific  by specifying, character or numeric,
>>>> etc  and often it makes things faster. others will have more to say.
>>>> 
>>>> also, if most of your variables are characters, R will try to turn
>>>> convert them into factors by default. If you use as.is = TRUE it won't
>>>> do this and that might speed things up also.
>>>> 
>>>> 
>>>> Rejoinder:  above tidbits are  just from experience. I don't know if
>>>> it's in stone or a hard and fast rule.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gleynes at gmail.com> wrote:
>>>> 
>>>>> ** Disclaimer: I'm looking for general suggestions **
>>>>> I'm sorry, but can't send out the file I'm using, so there is no
>>>>> reproducible example.
>>>>> 
>>>>> I'm using read.table and it's taking over 30 seconds to read a tiny
>>>>> file.
>>>>> The strange thing is that it takes roughly the same amount of time if
>>>>> the
>>>>> file is 100 times larger.
>>>>> 
>>>>> After re-reviewing the data Import / Export manual I think the best
>>>>> approach would be to use Python, or perhaps the readLines function, but
>>>>> I
>>>>> was hoping to understand why the simple read.table approach wasn't
>>>>> working
>>>>> as expected.
>>>>> 
>>>>> Some relevant facts:
>>>>> 
>>>>>  1. There are about 3700 columns.  Maybe this is the problem?  Still
>>>>> the
>>>>> 
>>>>>  file size is not very large.
>>>>>  2. The file encoding is ANSI, but I'm not specifying that in the
>>>>> 
>>>>>  function.  Setting fileEncoding="ANSI" produces an "unsupported
>>>>> conversion"
>>>>>  error
>>>>>  3. readLines imports the lines quickly
>>>>>  4. scan imports the file quickly also
>>>>> 
>>>>> 
>>>>> Obviously, scan and readLines would require more coding to identify
>>>>> columns, etc.
>>>>> 
>>>>> my code:
>>>>> system.time(dat <- read.table('C:/test.txt', nrows=-1, sep='\t',
>>>>> header=TRUE))
>>>>> 
>>>>> It's taking 33.4 seconds and the file size is only 315 kb!
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> Gene
>>>>> 
>>>>>       [[alternative HTML version deleted]]
>>>>> 
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>> 
>>>> 
>>>> 
>>> 
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com