[R] read.table performance

Thu Dec 8 10:06:03 CET 2011

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/12/11 09:32, Petr PIKAL wrote:
> Hi
> 
>> system.time(dat<-read.table("test2.txt"))
> user  system elapsed 32.38    0.00   32.40
> 
>> system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t',
> header=TRUE)) user  system elapsed 32.30    0.03   32.36
> 
> Couldn't.it be a Windows issue?

Likely - here on Linux I get:

> system.time(dat <- read.table('tmp/test2.txt', nrows=-1, sep='\t',
header=TRUE))
   user  system elapsed
  1.560   0.000   1.579
> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: i686-pc-linux-gnu (32-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
> version
               _
platform       i686-pc-linux-gnu
arch           i686
os             linux-gnu
system         i686, linux-gnu
status
major          2
minor          14.0
year           2011
month          10
day            31
svn rev        57496
language       R
version.string R version 2.14.0 (2011-10-31)
> 

Cheers,

Rainer

> _ platform       i386-pc-mingw32 arch           i386 os
> mingw32 system         i386, mingw32 status         Under
> development (unstable) major          2 minor          14.0 year
> 2011 month          04 day            27 svn rev        55657 
> language       R version.string R version 2.14.0 Under development
> (unstable) (2011-04-27 r55657)
>> 
> 
> 
>> dim(dat)
> [1]    7 3765
>> 
> 
> But from the dat file it seems to me that its structure is somehow
> weird.
> 
>> head(names(dat))
> [1] "X..Hydrogen" "Helium"      "Lithium"     "Beryllium"   "Boron"
>  [6] "Carbon"
>> tail(names(dat))
> [1] "Sulfur.32"    "Chlorine.32"  "Argon.32"     "Potassium.32" 
> "Calcium.32" [6] "Scandium.32"
>> 
> 
> There is row of names which has repeating values. Maybe the most
> time is spent by checking the names validity.
> 
> Regards Petr
> 
> r-help-bounces at r-project.org napsal dne 07.12.2011 23:11:10:
> 
>> peter dalgaard <pdalgd at gmail.com> Odeslal:
>> r-help-bounces at r-project.org
>> 
>> 07.12.2011 23:11
>> 
>> Komu
>> 
>> "R. Michael Weylandt" <michael.weylandt at gmail.com>
>> 
>> Kopie
>> 
>> r-help at r-project.org, Gene Leynes <gleynes at gmail.com>
>> 
>> P?edm?t
>> 
>> Re: [R] read.table performance
>> 
>> 
>> On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote:
>> 
>>> R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file 
>>> verbatim: system.time(read.table("test2.txt"))
>> 
>> About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8.
>> 
>> Gene, are you by any chance storing the file in a heavily
>> virus-scanned system directory?
>> 
>> -pd
>> 
>>> Michael
>>> 
>>> 2011/12/7 Gene Leynes <gleynes at gmail.com>:
>>>> Peter,
>>>> 
>>>> You're quite right; it's nearly impossible to make progress
>>>> without a working example.
>>>> 
>>>> I created an ** extremely simplified ** example for
>>>> distribution. The
> real
>>>> data has numeric, character, and boolean classes.
>>>> 
>>>> The file still takes 25.08 seconds to read, despite it's
>>>> small size.
>>>> 
>>>> I neglected to mention that I'm using R 2.13.0 and I"m on a
>>>> windows 7 machine (not that it should particularly matter
>>>> with this type of
> data /
>>>> functions).
>>>> 
>>>> ## The code: options(stringsAsFactors=FALSE) system.time(dat
>>>> <- read.table('test2.txt', nrows=-1, sep='\t',
> header=TRUE))
>>>> str(dat, 0)
>>>> 
>>>> 
>>>> Thanks again!
>>>> 
>>>> 
>>>> 
>>>> On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard
>>>> <pdalgd at gmail.com>
> wrote:
>>>> 
>>>>> 
>>>>> On Dec 6, 2011, at 22:33 , Gene Leynes wrote:
>>>>> 
>>>>>> Mark,
>>>>>> 
>>>>>> Thanks for your suggestions.
>>>>>> 
>>>>>> That's a good idea about the NULL columns; I didn't think
>>>>>> of that. Surprisingly, it didn't have any effect on the
>>>>>> time.
>>>>> 
>>>>> Hmm, I think you want "character" and "NULL" there (i.e.,
>>>>> quoted).
> Did you
>>>>> fix both?
>>>>> 
>>>>>>> read.table(whatever, as.is=TRUE, colClasses =
>>>>>>> c(rep(character,4), rep(NULL,3696)).
>>>>> 
>>>>> As a general matter, if you want people to dig into this,
>>>>> they need
> some
>>>>> paraphrase of the file to play with. Would it be possible
>>>>> to set up
> a small
>>>>> R program that generates a data file which displays the
>>>>> issue?
> Everything I
>>>>> try seems to take about a second to read in.
>>>>> 
>>>>> -pd
>>>>> 
>>>>>> 
>>>>>> This problem was just a curiosity, I already did the
>>>>>> import using
> Excel
>>>>> and
>>>>>> VBA.  I was just going to illustrate the power and
>>>>>> simplicity of R,
> but
>>>>> it
>>>>>> ironically it's been much slower and harder in R... The
>>>>>> VBA was painful and messy, and took me over an hour to
>>>>>> write;
> but at
>>>>>> least it worked quickly and reliably. The R code was
>>>>>> clean and only took me about 5 minutes to write, but
> the
>>>>> run
>>>>>> time was prohibitively slow!
>>>>>> 
>>>>>> I profiled the code, but that offers little insight to
>>>>>> me.
>>>>>> 
>>>>>> Profile results with 10 line file:
>>>>>> 
>>>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
>>>>>> $by.self self.time self.pct total.time total.pct scan
>>>>>> 12.24    53.50      12.24     53.50 read.table
>>>>>> 10.58    46.24      22.88    100.00 type.convert
>>>>>> 0.04     0.17       0.04      0.17 make.names        0.02
>>>>>> 0.09       0.02      0.09
>>>>>> 
>>>>>> $by.total total.time total.pct self.time self.pct 
>>>>>> read.table        22.88    100.00     10.58    46.24 scan
>>>>>> 12.24     53.50     12.24    53.50 type.convert
>>>>>> 0.04      0.17      0.04     0.17 make.names         0.02
>>>>>> 0.09      0.02     0.09
>>>>>> 
>>>>>> $sample.interval [1] 0.02
>>>>>> 
>>>>>> $sampling.time [1] 22.88
>>>>>> 
>>>>>> 
>>>>>> Profile results with 250 line file:
>>>>>> 
>>>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
>>>>>> $by.self self.time self.pct total.time total.pct scan
>>>>>> 23.88    68.15      23.88     68.15 read.table
>>>>>> 10.78    30.76      35.04    100.00 type.convert
>>>>>> 0.30     0.86       0.32      0.91 character         0.02
>>>>>> 0.06       0.02      0.06 file              0.02     0.06
>>>>>> 0.02      0.06 lapply            0.02     0.06       0.02
>>>>>> 0.06 unlist            0.02     0.06       0.02
>>>>>> 0.06
>>>>>> 
>>>>>> $by.total total.time total.pct self.time self.pct 
>>>>>> read.table          35.04    100.00     10.78    30.76 
>>>>>> scan                23.88     68.15     23.88    68.15 
>>>>>> type.convert         0.32      0.91      0.30     0.86 
>>>>>> sapply               0.04      0.11      0.00     0.00 
>>>>>> character            0.02      0.06      0.02     0.06 
>>>>>> file                 0.02      0.06      0.02     0.06 
>>>>>> lapply               0.02      0.06      0.02     0.06 
>>>>>> unlist               0.02      0.06      0.02     0.06 
>>>>>> simplify2array       0.02      0.06      0.00     0.00
>>>>>> 
>>>>>> $sample.interval [1] 0.02
>>>>>> 
>>>>>> $sampling.time [1] 35.04
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds
>>>>>> <markleeds2 at gmail.com>
> wrote:
>>>>>> 
>>>>>>> hi gene: maybe someone else will reply with some
>>>>>>> subtleties that
> I'm
>>>>> not
>>>>>>> aware of. one other thing that might help: if you know
>>>>>>> which columns you want , you can set
> the
>>>>>>> others to NULL through colClasses and this should speed
>>>>>>> things up also. For example, say
> you
>>>>> knew
>>>>>>> you only wanted the first four columns and they were
>>>>>>> character. then you could do,
>>>>>>> 
>>>>>>> read.table(whatever, as.is=TRUE, colClasses =
>>>>>>> c(rep(character,4), rep(NULL,3696)).
>>>>>>> 
>>>>>>> hopefully someone else will say something that does the
>>>>>>> trick. it
> seems
>>>>>>> odd to me as far as the difference in timings ? good
>>>>>>> luck.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes
>>>>>>> <gleynes at gmail.com>
> wrote:
>>>>>>> 
>>>>>>>> Mark,
>>>>>>>> 
>>>>>>>> Thank you for the reply
>>>>>>>> 
>>>>>>>> I neglected to mention that I had already set 
>>>>>>>> options(stringsAsFactors=FALSE)
>>>>>>>> 
>>>>>>>> I agree, skipping the factor determination can help
>>>>>>>> performance.
>>>>>>>> 
>>>>>>>> The main reason that I wanted to use read.table is
>>>>>>>> because it
> will
>>>>>>>> correctly determine the column classes for me.  I
>>>>>>>> don't really
> want to
>>>>>>>> specify 3700 column classes!  (I'm not sure what they
>>>>>>>> are
> anyway).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds
> <markleeds2 at gmail.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Gene: Sometimes using colClasses in read.table
>>>>>>>>> can speed
> things up.
>>>>>>>>> If you know what your variables are ahead of time
>>>>>>>>> and what you
> want
>>>>> them to
>>>>>>>>> be, this allows you to be specific  by specifying,
>>>>>>>>> character or
>>>>> numeric,
>>>>>>>>> etc  and often it makes things faster. others will
>>>>>>>>> have more to
> say.
>>>>>>>>> 
>>>>>>>>> also, if most of your variables are characters, R
>>>>>>>>> will try to
> turn
>>>>>>>>> convert them into factors by default. If you use
>>>>>>>>> as.is = TRUE it
>>>>> won't
>>>>>>>>> do this and that might speed things up also.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Rejoinder:  above tidbits are  just from
>>>>>>>>> experience. I don't
> know if
>>>>>>>>> it's in stone or a hard and fast rule.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes
>>>>>>>>> <gleynes at gmail.com>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> ** Disclaimer: I'm looking for general
>>>>>>>>>> suggestions ** I'm sorry, but can't send out the
>>>>>>>>>> file I'm using, so there is
> no
>>>>>>>>>> reproducible example.
>>>>>>>>>> 
>>>>>>>>>> I'm using read.table and it's taking over 30
>>>>>>>>>> seconds to read a
> tiny
>>>>>>>>>> file. The strange thing is that it takes roughly
>>>>>>>>>> the same amount of
> time if
>>>>>>>>>> the file is 100 times larger.
>>>>>>>>>> 
>>>>>>>>>> After re-reviewing the data Import / Export
>>>>>>>>>> manual I think the
> best
>>>>>>>>>> approach would be to use Python, or perhaps the
>>>>>>>>>> readLines
> function,
>>>>> but
>>>>>>>>>> I was hoping to understand why the simple
>>>>>>>>>> read.table approach
> wasn't
>>>>>>>>>> working as expected.
>>>>>>>>>> 
>>>>>>>>>> Some relevant facts:
>>>>>>>>>> 
>>>>>>>>>> 1. There are about 3700 columns.  Maybe this is
>>>>>>>>>> the problem?
> Still
>>>>>>>>>> the
>>>>>>>>>> 
>>>>>>>>>> file size is not very large. 2. The file encoding
>>>>>>>>>> is ANSI, but I'm not specifying that in
> the
>>>>>>>>>> 
>>>>>>>>>> function.  Setting fileEncoding="ANSI" produces
>>>>>>>>>> an
> "unsupported
>>>>>>>>>> conversion" error 3. readLines imports the lines
>>>>>>>>>> quickly 4. scan imports the file quickly also
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Obviously, scan and readLines would require more
>>>>>>>>>> coding to
> identify
>>>>>>>>>> columns, etc.
>>>>>>>>>> 
>>>>>>>>>> my code: system.time(dat <-
>>>>>>>>>> read.table('C:/test.txt', nrows=-1,
> sep='\t',
>>>>>>>>>> header=TRUE))
>>>>>>>>>> 
>>>>>>>>>> It's taking 33.4 seconds and the file size is
>>>>>>>>>> only 315 kb!
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> 
>>>>>>>>>> Gene
>>>>>>>>>> 
>>>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>>> 
>>>>>>>>>> ______________________________________________ 
>>>>>>>>>> R-help at r-project.org mailing list 
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help 
>>>>>>>>>> PLEASE do read the posting guide 
>>>>>>>>>> http://www.R-project.org/posting-guide.html and
>>>>>>>>>> provide commented, minimal, self-contained,
>>>>>>>>>> reproducible
> code.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> [[alternative HTML version deleted]]
>>>>>> 
>>>>>> ______________________________________________ 
>>>>>> R-help at r-project.org mailing list 
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
>>>>>> read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
>>>>>> reproducible code.
>>>>> 
>>>>> -- Peter Dalgaard, Professor, Center for Statistics,
>>>>> Copenhagen Business School Solbjerg Plads 3, 2000
>>>>> Frederiksberg, Denmark Phone: (+45)38153501 Email:
>>>>> pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> ______________________________________________ 
>>>> R-help at r-project.org mailing list 
>>>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read
>>>> the posting guide
> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
>>>> code.
>>>> 
>> 
>> -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen
>> Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark 
>> Phone: (+45)38153501 Email: pd.mes at cbs.dk  Priv:
>> PDalgd at gmail.com
>> 
>> ______________________________________________ 
>> R-help at r-project.org mailing list 
>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the
>> posting guide
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible
>> code.
> 
> ______________________________________________ R-help at r-project.org
> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
> read the posting guide http://www.R-project.org/posting-guide.html 
> and provide commented, minimal, self-contained, reproducible code.

- -- 
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
Biology, UCT), Dipl. Phys. (Germany)

Centre of Excellence for Invasion Biology
Stellenbosch University
South Africa

Tel :       +33 - (0)9 53 10 27 44
Cell:       +33 - (0)6 85 62 59 98
Fax :       +33 - (0)9 58 10 27 44

Fax (D):    +49 - (0)3 21 21 25 22 44

email:      Rainer at krugs.de

Skype:      RMkrug
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk7gffsACgkQoYgNqgF2egpNpACeLbyAXB1pLGgyt7hAE7QAWe9i
uV0An1Z8tvGw/1+40JM6YSe3aDqQoRkh
=/mB7
-----END PGP SIGNATURE-----