[R] read.table performance

Fri Dec 9 13:55:07 CET 2011

> 
> By the way, here's my original session information.  (I can never 
remember
> the name of that command when I want it).  It's strange that Petr is 
> having the problem with 2.14.  It's relatively fast on my machine with R 
2.14.  

Probably I use inferior PC based on nowadays standards. WXP, Intel 2,33 
GHz, 2GB memory.

Petr

> 
> > sessionInfo()
> R version 2.13.0 (2011-04-13)
> Platform: i386-pc-mingw32/i386 (32-bit)
> 
> locale:
> [1] LC_COLLATE=English_United States.1252 
> [2] LC_CTYPE=English_United States.1252   
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C                          
> [5] LC_TIME=English_United States.1252    
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     

> > 
> 

> On Thu, Dec 8, 2011 at 3:06 AM, Rainer M Krug <r.m.krug at gmail.com> 
wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 08/12/11 09:32, Petr PIKAL wrote:
> > Hi
> >
> >> system.time(dat<-read.table("test2.txt"))
> > user  system elapsed 32.38    0.00   32.40
> >
> >> system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t',
> > header=TRUE)) user  system elapsed 32.30    0.03   32.36
> >
> > Couldn't.it be a Windows issue?

> Likely - here on Linux I get:
> 
> > system.time(dat <- read.table('tmp/test2.txt', nrows=-1, sep='\t',
> header=TRUE))
>   user  system elapsed
>  1.560   0.000   1.579
> > sessionInfo()
> R version 2.14.0 (2011-10-31)
> Platform: i686-pc-linux-gnu (32-bit)
> 
> locale:
>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>  [7] LC_PAPER=C                 LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> > version
>               _
> platform       i686-pc-linux-gnu
> arch           i686
> os             linux-gnu
> system         i686, linux-gnu
> status
> major          2
> minor          14.0
> year           2011
> month          10
> day            31
> svn rev        57496
> language       R
> version.string R version 2.14.0 (2011-10-31)
> >
> 
> 
> Cheers,
> 
> Rainer
> 
> > _ platform       i386-pc-mingw32 arch           i386 os
> > mingw32 system         i386, mingw32 status         Under
> > development (unstable) major          2 minor          14.0 year
> > 2011 month          04 day            27 svn rev        55657
> > language       R version.string R version 2.14.0 Under development
> > (unstable) (2011-04-27 r55657)
> >>
> >
> >
> >> dim(dat)
> > [1]    7 3765
> >>
> >
> > But from the dat file it seems to me that its structure is somehow
> > weird.
> >
> >> head(names(dat))
> > [1] "X..Hydrogen" "Helium"      "Lithium"     "Beryllium"   "Boron"
> >  [6] "Carbon"
> >> tail(names(dat))
> > [1] "Sulfur.32"    "Chlorine.32"  "Argon.32"     "Potassium.32"
> > "Calcium.32" [6] "Scandium.32"
> >>
> >
> > There is row of names which has repeating values. Maybe the most
> > time is spent by checking the names validity.
> >
> > Regards Petr
> >
> > r-help-bounces at r-project.org napsal dne 07.12.2011 23:11:10:
> >
> >> peter dalgaard <pdalgd at gmail.com> Odeslal:
> >> r-help-bounces at r-project.org
> >>
> >> 07.12.2011 23:11
> >>
> >> Komu
> >>
> >> "R. Michael Weylandt" <michael.weylandt at gmail.com>
> >>
> >> Kopie
> >>
> >> r-help at r-project.org, Gene Leynes <gleynes at gmail.com>
> >>
> >> P?edm?t
> >>
> >> Re: [R] read.table performance
> >>
> >>
> >> On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote:
> >>
> >>> R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file
> >>> verbatim: system.time(read.table("test2.txt"))
> >>
> >> About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8.
> >>
> >> Gene, are you by any chance storing the file in a heavily
> >> virus-scanned system directory?
> >>
> >> -pd
> >>
> >>> Michael
> >>>
> >>> 2011/12/7 Gene Leynes <gleynes at gmail.com>:
> >>>> Peter,
> >>>>
> >>>> You're quite right; it's nearly impossible to make progress
> >>>> without a working example.
> >>>>
> >>>> I created an ** extremely simplified ** example for
> >>>> distribution. The
> > real
> >>>> data has numeric, character, and boolean classes.
> >>>>
> >>>> The file still takes 25.08 seconds to read, despite it's
> >>>> small size.
> >>>>
> >>>> I neglected to mention that I'm using R 2.13.0 and I"m on a
> >>>> windows 7 machine (not that it should particularly matter
> >>>> with this type of
> > data /
> >>>> functions).
> >>>>
> >>>> ## The code: options(stringsAsFactors=FALSE) system.time(dat
> >>>> <- read.table('test2.txt', nrows=-1, sep='\t',
> > header=TRUE))
> >>>> str(dat, 0)
> >>>>
> >>>>
> >>>> Thanks again!
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard
> >>>> <pdalgd at gmail.com>
> > wrote:
> >>>>
> >>>>>
> >>>>> On Dec 6, 2011, at 22:33 , Gene Leynes wrote:
> >>>>>
> >>>>>> Mark,
> >>>>>>
> >>>>>> Thanks for your suggestions.
> >>>>>>
> >>>>>> That's a good idea about the NULL columns; I didn't think
> >>>>>> of that. Surprisingly, it didn't have any effect on the
> >>>>>> time.
> >>>>>
> >>>>> Hmm, I think you want "character" and "NULL" there (i.e.,
> >>>>> quoted).
> > Did you
> >>>>> fix both?
> >>>>>
> >>>>>>> read.table(whatever, as.is=TRUE, colClasses =
> >>>>>>> c(rep(character,4), rep(NULL,3696)).
> >>>>>
> >>>>> As a general matter, if you want people to dig into this,
> >>>>> they need
> > some
> >>>>> paraphrase of the file to play with. Would it be possible
> >>>>> to set up
> > a small
> >>>>> R program that generates a data file which displays the
> >>>>> issue?
> > Everything I
> >>>>> try seems to take about a second to read in.
> >>>>>
> >>>>> -pd
> >>>>>
> >>>>>>
> >>>>>> This problem was just a curiosity, I already did the
> >>>>>> import using
> > Excel
> >>>>> and
> >>>>>> VBA.  I was just going to illustrate the power and
> >>>>>> simplicity of R,
> > but
> >>>>> it
> >>>>>> ironically it's been much slower and harder in R... The
> >>>>>> VBA was painful and messy, and took me over an hour to
> >>>>>> write;
> > but at
> >>>>>> least it worked quickly and reliably. The R code was
> >>>>>> clean and only took me about 5 minutes to write, but
> > the
> >>>>> run
> >>>>>> time was prohibitively slow!
> >>>>>>
> >>>>>> I profiled the code, but that offers little insight to
> >>>>>> me.
> >>>>>>
> >>>>>> Profile results with 10 line file:
> >>>>>>
> >>>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
> >>>>>> $by.self self.time self.pct total.time total.pct scan
> >>>>>> 12.24    53.50      12.24     53.50 read.table
> >>>>>> 10.58    46.24      22.88    100.00 type.convert
> >>>>>> 0.04     0.17       0.04      0.17 make.names        0.02
> >>>>>> 0.09       0.02      0.09
> >>>>>>
> >>>>>> $by.total total.time total.pct self.time self.pct
> >>>>>> read.table        22.88    100.00     10.58    46.24 scan
> >>>>>> 12.24     53.50     12.24    53.50 type.convert
> >>>>>> 0.04      0.17      0.04     0.17 make.names         0.02
> >>>>>> 0.09      0.02     0.09
> >>>>>>
> >>>>>> $sample.interval [1] 0.02
> >>>>>>
> >>>>>> $sampling.time [1] 22.88
> >>>>>>
> >>>>>>
> >>>>>> Profile results with 250 line file:
> >>>>>>
> >>>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
> >>>>>> $by.self self.time self.pct total.time total.pct scan
> >>>>>> 23.88    68.15      23.88     68.15 read.table
> >>>>>> 10.78    30.76      35.04    100.00 type.convert
> >>>>>> 0.30     0.86       0.32      0.91 character         0.02
> >>>>>> 0.06       0.02      0.06 file              0.02     0.06
> >>>>>> 0.02      0.06 lapply            0.02     0.06       0.02
> >>>>>> 0.06 unlist            0.02     0.06       0.02
> >>>>>> 0.06
> >>>>>>
> >>>>>> $by.total total.time total.pct self.time self.pct
> >>>>>> read.table          35.04    100.00     10.78    30.76
> >>>>>> scan                23.88     68.15     23.88    68.15
> >>>>>> type.convert         0.32      0.91      0.30     0.86
> >>>>>> sapply               0.04      0.11      0.00     0.00
> >>>>>> character            0.02      0.06      0.02     0.06
> >>>>>> file                 0.02      0.06      0.02     0.06
> >>>>>> lapply               0.02      0.06      0.02     0.06
> >>>>>> unlist               0.02      0.06      0.02     0.06
> >>>>>> simplify2array       0.02      0.06      0.00     0.00
> >>>>>>
> >>>>>> $sample.interval [1] 0.02
> >>>>>>
> >>>>>> $sampling.time [1] 35.04
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds
> >>>>>> <markleeds2 at gmail.com>
> > wrote:
> >>>>>>
> >>>>>>> hi gene: maybe someone else will reply with some
> >>>>>>> subtleties that
> > I'm
> >>>>> not
> >>>>>>> aware of. one other thing that might help: if you know
> >>>>>>> which columns you want , you can set
> > the
> >>>>>>> others to NULL through colClasses and this should speed
> >>>>>>> things up also. For example, say
> > you
> >>>>> knew
> >>>>>>> you only wanted the first four columns and they were
> >>>>>>> character. then you could do,
> >>>>>>>
> >>>>>>> read.table(whatever, as.is=TRUE, colClasses =
> >>>>>>> c(rep(character,4), rep(NULL,3696)).
> >>>>>>>
> >>>>>>> hopefully someone else will say something that does the
> >>>>>>> trick. it
> > seems
> >>>>>>> odd to me as far as the difference in timings ? good
> >>>>>>> luck.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes
> >>>>>>> <gleynes at gmail.com>
> > wrote:
> >>>>>>>
> >>>>>>>> Mark,
> >>>>>>>>
> >>>>>>>> Thank you for the reply
> >>>>>>>>
> >>>>>>>> I neglected to mention that I had already set
> >>>>>>>> options(stringsAsFactors=FALSE)
> >>>>>>>>
> >>>>>>>> I agree, skipping the factor determination can help
> >>>>>>>> performance.
> >>>>>>>>
> >>>>>>>> The main reason that I wanted to use read.table is
> >>>>>>>> because it
> > will
> >>>>>>>> correctly determine the column classes for me.  I
> >>>>>>>> don't really
> > want to
> >>>>>>>> specify 3700 column classes!  (I'm not sure what they
> >>>>>>>> are
> > anyway).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds
> > <markleeds2 at gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Gene: Sometimes using colClasses in read.table
> >>>>>>>>> can speed
> > things up.
> >>>>>>>>> If you know what your variables are ahead of time
> >>>>>>>>> and what you
> > want
> >>>>> them to
> >>>>>>>>> be, this allows you to be specific  by specifying,
> >>>>>>>>> character or
> >>>>> numeric,
> >>>>>>>>> etc  and often it makes things faster. others will
> >>>>>>>>> have more to
> > say.
> >>>>>>>>>
> >>>>>>>>> also, if most of your variables are characters, R
> >>>>>>>>> will try to
> > turn
> >>>>>>>>> convert them into factors by default. If you use
> >>>>>>>>> as.is = TRUE it
> >>>>> won't
> >>>>>>>>> do this and that might speed things up also.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Rejoinder:  above tidbits are  just from
> >>>>>>>>> experience. I don't
> > know if
> >>>>>>>>> it's in stone or a hard and fast rule.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes
> >>>>>>>>> <gleynes at gmail.com>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> ** Disclaimer: I'm looking for general
> >>>>>>>>>> suggestions ** I'm sorry, but can't send out the
> >>>>>>>>>> file I'm using, so there is
> > no
> >>>>>>>>>> reproducible example.
> >>>>>>>>>>
> >>>>>>>>>> I'm using read.table and it's taking over 30
> >>>>>>>>>> seconds to read a
> > tiny
> >>>>>>>>>> file. The strange thing is that it takes roughly
> >>>>>>>>>> the same amount of
> > time if
> >>>>>>>>>> the file is 100 times larger.
> >>>>>>>>>>
> >>>>>>>>>> After re-reviewing the data Import / Export
> >>>>>>>>>> manual I think the
> > best
> >>>>>>>>>> approach would be to use Python, or perhaps the
> >>>>>>>>>> readLines
> > function,
> >>>>> but
> >>>>>>>>>> I was hoping to understand why the simple
> >>>>>>>>>> read.table approach
> > wasn't
> >>>>>>>>>> working as expected.
> >>>>>>>>>>
> >>>>>>>>>> Some relevant facts:
> >>>>>>>>>>
> >>>>>>>>>> 1. There are about 3700 columns.  Maybe this is
> >>>>>>>>>> the problem?
> > Still
> >>>>>>>>>> the
> >>>>>>>>>>
> >>>>>>>>>> file size is not very large. 2. The file encoding
> >>>>>>>>>> is ANSI, but I'm not specifying that in
> > the
> >>>>>>>>>>
> >>>>>>>>>> function.  Setting fileEncoding="ANSI" produces
> >>>>>>>>>> an
> > "unsupported
> >>>>>>>>>> conversion" error 3. readLines imports the lines
> >>>>>>>>>> quickly 4. scan imports the file quickly also
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Obviously, scan and readLines would require more
> >>>>>>>>>> coding to
> > identify
> >>>>>>>>>> columns, etc.
> >>>>>>>>>>
> >>>>>>>>>> my code: system.time(dat <-
> >>>>>>>>>> read.table('C:/test.txt', nrows=-1,
> > sep='\t',
> >>>>>>>>>> header=TRUE))
> >>>>>>>>>>
> >>>>>>>>>> It's taking 33.4 seconds and the file size is
> >>>>>>>>>> only 315 kb!
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>> Gene
> >>>>>>>>>>
> >>>>>>>>>> [[alternative HTML version deleted]]
> >>>>>>>>>>
> >>>>>>>>>> ______________________________________________
> >>>>>>>>>> R-help at r-project.org mailing list
> >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>> http://www.R-project.org/posting-guide.html and
> >>>>>>>>>> provide commented, minimal, self-contained,
> >>>>>>>>>> reproducible
> > code.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> [[alternative HTML version deleted]]
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help at r-project.org mailing list
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
> >>>>>> read the posting guide
> >>>>> http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained,
> >>>>>> reproducible code.
> >>>>>
> >>>>> -- Peter Dalgaard, Professor, Center for Statistics,
> >>>>> Copenhagen Business School Solbjerg Plads 3, 2000
> >>>>> Frederiksberg, Denmark Phone: (+45)38153501 Email:
> >>>>> pd.mes at cbs.dk  Priv: PDalgd at gmail.com
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read
> >>>> the posting guide
> > http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible
> >>>> code.
> >>>>
> >>
> >> -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen
> >> Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> >> Phone: (+45)38153501 Email: pd.mes at cbs.dk  Priv:
> >> PDalgd at gmail.com
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the
> >> posting guide
> > http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible
> >> code.
> >
> > ______________________________________________ R-help at r-project.org
> > mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
> > read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 

> - --
> Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
> Biology, UCT), Dipl. Phys. (Germany)
> 
> Centre of Excellence for Invasion Biology
> Stellenbosch University
> South Africa
> 
> Tel :       +33 - (0)9 53 10 27 44
> Cell:       +33 - (0)6 85 62 59 98
> Fax :       +33 - (0)9 58 10 27 44
> 
> Fax (D):    +49 - (0)3 21 21 25 22 44
> 
> email:      Rainer at krugs.de
> 
> Skype:      RMkrug
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAk7gffsACgkQoYgNqgF2egpNpACeLbyAXB1pLGgyt7hAE7QAWe9i
> uV0An1Z8tvGw/1+40JM6YSe3aDqQoRkh
> =/mB7
> -----END PGP SIGNATURE-----