[R] read.table performance

Gene Leynes gleynes at gmail.com
Thu Dec 8 06:10:51 CET 2011


No, it was just on my desktop (and on a network drive, and in a temp
folder on my c drive).

There have been some new policies put into place at work though, and
perhaps that includes more / some monitoring software, but I don't
know.

Sent from my iPhone

On Dec 7, 2011, at 4:11 PM, peter dalgaard <pdalgd at gmail.com> wrote:

>
> On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote:
>
>> R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file
>> verbatim: system.time(read.table("test2.txt"))
>
> About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8.
>
> Gene, are you by any chance storing the file in a heavily virus-scanned system directory?
>
> -pd
>
>> Michael
>>
>> 2011/12/7 Gene Leynes <gleynes at gmail.com>:
>>> Peter,
>>>
>>> You're quite right; it's nearly impossible to make progress without a
>>> working example.
>>>
>>> I created an ** extremely simplified ** example for distribution.  The real
>>> data has numeric, character, and boolean classes.
>>>
>>> The file still takes 25.08 seconds to read, despite it's small size.
>>>
>>> I neglected to mention that I'm using R 2.13.0 and I"m on a windows 7
>>> machine (not that it should particularly matter with this type of data /
>>> functions).
>>>
>>> ## The code:
>>> options(stringsAsFactors=FALSE)
>>> system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t', header=TRUE))
>>> str(dat, 0)
>>>
>>>
>>> Thanks again!
>>>
>>>
>>>
>>> On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard <pdalgd at gmail.com> wrote:
>>>
>>>>
>>>> On Dec 6, 2011, at 22:33 , Gene Leynes wrote:
>>>>
>>>>> Mark,
>>>>>
>>>>> Thanks for your suggestions.
>>>>>
>>>>> That's a good idea about the NULL columns; I didn't think of that.
>>>>> Surprisingly, it didn't have any effect on the time.
>>>>
>>>> Hmm, I think you want "character" and "NULL" there (i.e., quoted). Did you
>>>> fix both?
>>>>
>>>>>> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
>>>>>> rep(NULL,3696)).
>>>>
>>>> As a general matter, if you want people to dig into this, they need some
>>>> paraphrase of the file to play with. Would it be possible to set up a small
>>>> R program that generates a data file which displays the issue? Everything I
>>>> try seems to take about a second to read in.
>>>>
>>>> -pd
>>>>
>>>>>
>>>>> This problem was just a curiosity, I already did the import using Excel
>>>> and
>>>>> VBA.  I was just going to illustrate the power and simplicity of R, but
>>>> it
>>>>> ironically it's been much slower and harder in R...
>>>>> The VBA was painful and messy, and took me over an hour to write; but at
>>>>> least it worked quickly and reliably.
>>>>> The R code was clean and only took me about 5 minutes to write, but the
>>>> run
>>>>> time was prohibitively slow!
>>>>>
>>>>> I profiled the code, but that offers little insight to me.
>>>>>
>>>>> Profile results with 10 line file:
>>>>>
>>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
>>>>> $by.self
>>>>>            self.time self.pct total.time total.pct
>>>>> scan             12.24    53.50      12.24     53.50
>>>>> read.table       10.58    46.24      22.88    100.00
>>>>> type.convert      0.04     0.17       0.04      0.17
>>>>> make.names        0.02     0.09       0.02      0.09
>>>>>
>>>>> $by.total
>>>>>            total.time total.pct self.time self.pct
>>>>> read.table        22.88    100.00     10.58    46.24
>>>>> scan              12.24     53.50     12.24    53.50
>>>>> type.convert       0.04      0.17      0.04     0.17
>>>>> make.names         0.02      0.09      0.02     0.09
>>>>>
>>>>> $sample.interval
>>>>> [1] 0.02
>>>>>
>>>>> $sampling.time
>>>>> [1] 22.88
>>>>>
>>>>>
>>>>> Profile results with 250 line file:
>>>>>
>>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
>>>>> $by.self
>>>>>            self.time self.pct total.time total.pct
>>>>> scan             23.88    68.15      23.88     68.15
>>>>> read.table       10.78    30.76      35.04    100.00
>>>>> type.convert      0.30     0.86       0.32      0.91
>>>>> character         0.02     0.06       0.02      0.06
>>>>> file              0.02     0.06       0.02      0.06
>>>>> lapply            0.02     0.06       0.02      0.06
>>>>> unlist            0.02     0.06       0.02      0.06
>>>>>
>>>>> $by.total
>>>>>              total.time total.pct self.time self.pct
>>>>> read.table          35.04    100.00     10.78    30.76
>>>>> scan                23.88     68.15     23.88    68.15
>>>>> type.convert         0.32      0.91      0.30     0.86
>>>>> sapply               0.04      0.11      0.00     0.00
>>>>> character            0.02      0.06      0.02     0.06
>>>>> file                 0.02      0.06      0.02     0.06
>>>>> lapply               0.02      0.06      0.02     0.06
>>>>> unlist               0.02      0.06      0.02     0.06
>>>>> simplify2array       0.02      0.06      0.00     0.00
>>>>>
>>>>> $sample.interval
>>>>> [1] 0.02
>>>>>
>>>>> $sampling.time
>>>>> [1] 35.04
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds <markleeds2 at gmail.com> wrote:
>>>>>
>>>>>> hi gene: maybe someone else will reply with some  subtleties that I'm
>>>> not
>>>>>> aware of. one other thing
>>>>>> that might help: if you know which columns you want , you can set the
>>>>>> others to NULL through
>>>>>> colClasses and this should speed things up also. For example, say you
>>>> knew
>>>>>> you only wanted the
>>>>>> first four columns and they were character. then you could do,
>>>>>>
>>>>>> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
>>>>>> rep(NULL,3696)).
>>>>>>
>>>>>> hopefully someone else will say something that does the trick. it seems
>>>>>> odd to me as far as the
>>>>>> difference in timings ? good luck.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes <gleynes at gmail.com> wrote:
>>>>>>
>>>>>>> Mark,
>>>>>>>
>>>>>>> Thank you for the reply
>>>>>>>
>>>>>>> I neglected to mention that I had already set
>>>>>>> options(stringsAsFactors=FALSE)
>>>>>>>
>>>>>>> I agree, skipping the factor determination can help performance.
>>>>>>>
>>>>>>> The main reason that I wanted to use read.table is because it will
>>>>>>> correctly determine the column classes for me.  I don't really want to
>>>>>>> specify 3700 column classes!  (I'm not sure what they are anyway).
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds <markleeds2 at gmail.com>
>>>> wrote:
>>>>>>>
>>>>>>>> Hi Gene: Sometimes using colClasses in read.table can speed things up.
>>>>>>>> If you know what your variables are ahead of time and what you want
>>>> them to
>>>>>>>> be, this allows you to be specific  by specifying, character or
>>>> numeric,
>>>>>>>> etc  and often it makes things faster. others will have more to say.
>>>>>>>>
>>>>>>>> also, if most of your variables are characters, R will try to turn
>>>>>>>> convert them into factors by default. If you use as.is = TRUE it
>>>> won't
>>>>>>>> do this and that might speed things up also.
>>>>>>>>
>>>>>>>>
>>>>>>>> Rejoinder:  above tidbits are  just from experience. I don't know if
>>>>>>>> it's in stone or a hard and fast rule.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gleynes at gmail.com>
>>>> wrote:
>>>>>>>>
>>>>>>>>> ** Disclaimer: I'm looking for general suggestions **
>>>>>>>>> I'm sorry, but can't send out the file I'm using, so there is no
>>>>>>>>> reproducible example.
>>>>>>>>>
>>>>>>>>> I'm using read.table and it's taking over 30 seconds to read a tiny
>>>>>>>>> file.
>>>>>>>>> The strange thing is that it takes roughly the same amount of time if
>>>>>>>>> the
>>>>>>>>> file is 100 times larger.
>>>>>>>>>
>>>>>>>>> After re-reviewing the data Import / Export manual I think the best
>>>>>>>>> approach would be to use Python, or perhaps the readLines function,
>>>> but
>>>>>>>>> I
>>>>>>>>> was hoping to understand why the simple read.table approach wasn't
>>>>>>>>> working
>>>>>>>>> as expected.
>>>>>>>>>
>>>>>>>>> Some relevant facts:
>>>>>>>>>
>>>>>>>>> 1. There are about 3700 columns.  Maybe this is the problem?  Still
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>> file size is not very large.
>>>>>>>>> 2. The file encoding is ANSI, but I'm not specifying that in the
>>>>>>>>>
>>>>>>>>> function.  Setting fileEncoding="ANSI" produces an "unsupported
>>>>>>>>> conversion"
>>>>>>>>> error
>>>>>>>>> 3. readLines imports the lines quickly
>>>>>>>>> 4. scan imports the file quickly also
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Obviously, scan and readLines would require more coding to identify
>>>>>>>>> columns, etc.
>>>>>>>>>
>>>>>>>>> my code:
>>>>>>>>> system.time(dat <- read.table('C:/test.txt', nrows=-1, sep='\t',
>>>>>>>>> header=TRUE))
>>>>>>>>>
>>>>>>>>> It's taking 33.4 seconds and the file size is only 315 kb!
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> Gene
>>>>>>>>>
>>>>>>>>>      [[alternative HTML version deleted]]
>>>>>>>>>
>>>>>>>>> ______________________________________________
>>>>>>>>> R-help at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>      [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>> --
>>>> Peter Dalgaard, Professor,
>>>> Center for Statistics, Copenhagen Business School
>>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>>>> Phone: (+45)38153501
>>>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>
>
>
>
>
>
>
>



More information about the R-help mailing list