[R] read.table performance

peter dalgaard pdalgd at gmail.com
Wed Dec 7 23:11:10 CET 2011


On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote:

> R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file
> verbatim: system.time(read.table("test2.txt"))

About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. 

Gene, are you by any chance storing the file in a heavily virus-scanned system directory?

-pd

> Michael
> 
> 2011/12/7 Gene Leynes <gleynes at gmail.com>:
>> Peter,
>> 
>> You're quite right; it's nearly impossible to make progress without a
>> working example.
>> 
>> I created an ** extremely simplified ** example for distribution.  The real
>> data has numeric, character, and boolean classes.
>> 
>> The file still takes 25.08 seconds to read, despite it's small size.
>> 
>> I neglected to mention that I'm using R 2.13.0 and I"m on a windows 7
>> machine (not that it should particularly matter with this type of data /
>> functions).
>> 
>> ## The code:
>> options(stringsAsFactors=FALSE)
>> system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t', header=TRUE))
>> str(dat, 0)
>> 
>> 
>> Thanks again!
>> 
>> 
>> 
>> On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard <pdalgd at gmail.com> wrote:
>> 
>>> 
>>> On Dec 6, 2011, at 22:33 , Gene Leynes wrote:
>>> 
>>>> Mark,
>>>> 
>>>> Thanks for your suggestions.
>>>> 
>>>> That's a good idea about the NULL columns; I didn't think of that.
>>>> Surprisingly, it didn't have any effect on the time.
>>> 
>>> Hmm, I think you want "character" and "NULL" there (i.e., quoted). Did you
>>> fix both?
>>> 
>>>>> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
>>>>> rep(NULL,3696)).
>>> 
>>> As a general matter, if you want people to dig into this, they need some
>>> paraphrase of the file to play with. Would it be possible to set up a small
>>> R program that generates a data file which displays the issue? Everything I
>>> try seems to take about a second to read in.
>>> 
>>> -pd
>>> 
>>>> 
>>>> This problem was just a curiosity, I already did the import using Excel
>>> and
>>>> VBA.  I was just going to illustrate the power and simplicity of R, but
>>> it
>>>> ironically it's been much slower and harder in R...
>>>> The VBA was painful and messy, and took me over an hour to write; but at
>>>> least it worked quickly and reliably.
>>>> The R code was clean and only took me about 5 minutes to write, but the
>>> run
>>>> time was prohibitively slow!
>>>> 
>>>> I profiled the code, but that offers little insight to me.
>>>> 
>>>> Profile results with 10 line file:
>>>> 
>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
>>>> $by.self
>>>>             self.time self.pct total.time total.pct
>>>> scan             12.24    53.50      12.24     53.50
>>>> read.table       10.58    46.24      22.88    100.00
>>>> type.convert      0.04     0.17       0.04      0.17
>>>> make.names        0.02     0.09       0.02      0.09
>>>> 
>>>> $by.total
>>>>             total.time total.pct self.time self.pct
>>>> read.table        22.88    100.00     10.58    46.24
>>>> scan              12.24     53.50     12.24    53.50
>>>> type.convert       0.04      0.17      0.04     0.17
>>>> make.names         0.02      0.09      0.02     0.09
>>>> 
>>>> $sample.interval
>>>> [1] 0.02
>>>> 
>>>> $sampling.time
>>>> [1] 22.88
>>>> 
>>>> 
>>>> Profile results with 250 line file:
>>>> 
>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
>>>> $by.self
>>>>             self.time self.pct total.time total.pct
>>>> scan             23.88    68.15      23.88     68.15
>>>> read.table       10.78    30.76      35.04    100.00
>>>> type.convert      0.30     0.86       0.32      0.91
>>>> character         0.02     0.06       0.02      0.06
>>>> file              0.02     0.06       0.02      0.06
>>>> lapply            0.02     0.06       0.02      0.06
>>>> unlist            0.02     0.06       0.02      0.06
>>>> 
>>>> $by.total
>>>>               total.time total.pct self.time self.pct
>>>> read.table          35.04    100.00     10.78    30.76
>>>> scan                23.88     68.15     23.88    68.15
>>>> type.convert         0.32      0.91      0.30     0.86
>>>> sapply               0.04      0.11      0.00     0.00
>>>> character            0.02      0.06      0.02     0.06
>>>> file                 0.02      0.06      0.02     0.06
>>>> lapply               0.02      0.06      0.02     0.06
>>>> unlist               0.02      0.06      0.02     0.06
>>>> simplify2array       0.02      0.06      0.00     0.00
>>>> 
>>>> $sample.interval
>>>> [1] 0.02
>>>> 
>>>> $sampling.time
>>>> [1] 35.04
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds <markleeds2 at gmail.com> wrote:
>>>> 
>>>>> hi gene: maybe someone else will reply with some  subtleties that I'm
>>> not
>>>>> aware of. one other thing
>>>>> that might help: if you know which columns you want , you can set the
>>>>> others to NULL through
>>>>> colClasses and this should speed things up also. For example, say you
>>> knew
>>>>> you only wanted the
>>>>> first four columns and they were character. then you could do,
>>>>> 
>>>>> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
>>>>> rep(NULL,3696)).
>>>>> 
>>>>> hopefully someone else will say something that does the trick. it seems
>>>>> odd to me as far as the
>>>>> difference in timings ? good luck.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes <gleynes at gmail.com> wrote:
>>>>> 
>>>>>> Mark,
>>>>>> 
>>>>>> Thank you for the reply
>>>>>> 
>>>>>> I neglected to mention that I had already set
>>>>>> options(stringsAsFactors=FALSE)
>>>>>> 
>>>>>> I agree, skipping the factor determination can help performance.
>>>>>> 
>>>>>> The main reason that I wanted to use read.table is because it will
>>>>>> correctly determine the column classes for me.  I don't really want to
>>>>>> specify 3700 column classes!  (I'm not sure what they are anyway).
>>>>>> 
>>>>>> 
>>>>>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds <markleeds2 at gmail.com>
>>> wrote:
>>>>>> 
>>>>>>> Hi Gene: Sometimes using colClasses in read.table can speed things up.
>>>>>>> If you know what your variables are ahead of time and what you want
>>> them to
>>>>>>> be, this allows you to be specific  by specifying, character or
>>> numeric,
>>>>>>> etc  and often it makes things faster. others will have more to say.
>>>>>>> 
>>>>>>> also, if most of your variables are characters, R will try to turn
>>>>>>> convert them into factors by default. If you use as.is = TRUE it
>>> won't
>>>>>>> do this and that might speed things up also.
>>>>>>> 
>>>>>>> 
>>>>>>> Rejoinder:  above tidbits are  just from experience. I don't know if
>>>>>>> it's in stone or a hard and fast rule.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gleynes at gmail.com>
>>> wrote:
>>>>>>> 
>>>>>>>> ** Disclaimer: I'm looking for general suggestions **
>>>>>>>> I'm sorry, but can't send out the file I'm using, so there is no
>>>>>>>> reproducible example.
>>>>>>>> 
>>>>>>>> I'm using read.table and it's taking over 30 seconds to read a tiny
>>>>>>>> file.
>>>>>>>> The strange thing is that it takes roughly the same amount of time if
>>>>>>>> the
>>>>>>>> file is 100 times larger.
>>>>>>>> 
>>>>>>>> After re-reviewing the data Import / Export manual I think the best
>>>>>>>> approach would be to use Python, or perhaps the readLines function,
>>> but
>>>>>>>> I
>>>>>>>> was hoping to understand why the simple read.table approach wasn't
>>>>>>>> working
>>>>>>>> as expected.
>>>>>>>> 
>>>>>>>> Some relevant facts:
>>>>>>>> 
>>>>>>>>  1. There are about 3700 columns.  Maybe this is the problem?  Still
>>>>>>>> the
>>>>>>>> 
>>>>>>>>  file size is not very large.
>>>>>>>>  2. The file encoding is ANSI, but I'm not specifying that in the
>>>>>>>> 
>>>>>>>>  function.  Setting fileEncoding="ANSI" produces an "unsupported
>>>>>>>> conversion"
>>>>>>>>  error
>>>>>>>>  3. readLines imports the lines quickly
>>>>>>>>  4. scan imports the file quickly also
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Obviously, scan and readLines would require more coding to identify
>>>>>>>> columns, etc.
>>>>>>>> 
>>>>>>>> my code:
>>>>>>>> system.time(dat <- read.table('C:/test.txt', nrows=-1, sep='\t',
>>>>>>>> header=TRUE))
>>>>>>>> 
>>>>>>>> It's taking 33.4 seconds and the file size is only 315 kb!
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> 
>>>>>>>> Gene
>>>>>>>> 
>>>>>>>>       [[alternative HTML version deleted]]
>>>>>>>> 
>>>>>>>> ______________________________________________
>>>>>>>> R-help at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>> PLEASE do read the posting guide
>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>>       [[alternative HTML version deleted]]
>>>> 
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>>> --
>>> Peter Dalgaard, Professor,
>>> Center for Statistics, Copenhagen Business School
>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>>> Phone: (+45)38153501
>>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list