[R] read.table performance
Petr PIKAL
petr.pikal at precheza.cz
Fri Dec 9 13:55:07 CET 2011
>
> By the way, here's my original session information. (I can never
remember
> the name of that command when I want it). It's strange that Petr is
> having the problem with 2.14. It's relatively fast on my machine with R
2.14.
Probably I use inferior PC based on nowadays standards. WXP, Intel 2,33
GHz, 2GB memory.
Petr
>
> > sessionInfo()
> R version 2.13.0 (2011-04-13)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> >
>
> On Thu, Dec 8, 2011 at 3:06 AM, Rainer M Krug <r.m.krug at gmail.com>
wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 08/12/11 09:32, Petr PIKAL wrote:
> > Hi
> >
> >> system.time(dat<-read.table("test2.txt"))
> > user system elapsed 32.38 0.00 32.40
> >
> >> system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t',
> > header=TRUE)) user system elapsed 32.30 0.03 32.36
> >
> > Couldn't.it be a Windows issue?
> Likely - here on Linux I get:
>
> > system.time(dat <- read.table('tmp/test2.txt', nrows=-1, sep='\t',
> header=TRUE))
> user system elapsed
> 1.560 0.000 1.579
> > sessionInfo()
> R version 2.14.0 (2011-10-31)
> Platform: i686-pc-linux-gnu (32-bit)
>
> locale:
> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> > version
> _
> platform i686-pc-linux-gnu
> arch i686
> os linux-gnu
> system i686, linux-gnu
> status
> major 2
> minor 14.0
> year 2011
> month 10
> day 31
> svn rev 57496
> language R
> version.string R version 2.14.0 (2011-10-31)
> >
>
>
> Cheers,
>
> Rainer
>
> > _ platform i386-pc-mingw32 arch i386 os
> > mingw32 system i386, mingw32 status Under
> > development (unstable) major 2 minor 14.0 year
> > 2011 month 04 day 27 svn rev 55657
> > language R version.string R version 2.14.0 Under development
> > (unstable) (2011-04-27 r55657)
> >>
> >
> >
> >> dim(dat)
> > [1] 7 3765
> >>
> >
> > But from the dat file it seems to me that its structure is somehow
> > weird.
> >
> >> head(names(dat))
> > [1] "X..Hydrogen" "Helium" "Lithium" "Beryllium" "Boron"
> > [6] "Carbon"
> >> tail(names(dat))
> > [1] "Sulfur.32" "Chlorine.32" "Argon.32" "Potassium.32"
> > "Calcium.32" [6] "Scandium.32"
> >>
> >
> > There is row of names which has repeating values. Maybe the most
> > time is spent by checking the names validity.
> >
> > Regards Petr
> >
> > r-help-bounces at r-project.org napsal dne 07.12.2011 23:11:10:
> >
> >> peter dalgaard <pdalgd at gmail.com> Odeslal:
> >> r-help-bounces at r-project.org
> >>
> >> 07.12.2011 23:11
> >>
> >> Komu
> >>
> >> "R. Michael Weylandt" <michael.weylandt at gmail.com>
> >>
> >> Kopie
> >>
> >> r-help at r-project.org, Gene Leynes <gleynes at gmail.com>
> >>
> >> P?edm?t
> >>
> >> Re: [R] read.table performance
> >>
> >>
> >> On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote:
> >>
> >>> R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file
> >>> verbatim: system.time(read.table("test2.txt"))
> >>
> >> About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8.
> >>
> >> Gene, are you by any chance storing the file in a heavily
> >> virus-scanned system directory?
> >>
> >> -pd
> >>
> >>> Michael
> >>>
> >>> 2011/12/7 Gene Leynes <gleynes at gmail.com>:
> >>>> Peter,
> >>>>
> >>>> You're quite right; it's nearly impossible to make progress
> >>>> without a working example.
> >>>>
> >>>> I created an ** extremely simplified ** example for
> >>>> distribution. The
> > real
> >>>> data has numeric, character, and boolean classes.
> >>>>
> >>>> The file still takes 25.08 seconds to read, despite it's
> >>>> small size.
> >>>>
> >>>> I neglected to mention that I'm using R 2.13.0 and I"m on a
> >>>> windows 7 machine (not that it should particularly matter
> >>>> with this type of
> > data /
> >>>> functions).
> >>>>
> >>>> ## The code: options(stringsAsFactors=FALSE) system.time(dat
> >>>> <- read.table('test2.txt', nrows=-1, sep='\t',
> > header=TRUE))
> >>>> str(dat, 0)
> >>>>
> >>>>
> >>>> Thanks again!
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard
> >>>> <pdalgd at gmail.com>
> > wrote:
> >>>>
> >>>>>
> >>>>> On Dec 6, 2011, at 22:33 , Gene Leynes wrote:
> >>>>>
> >>>>>> Mark,
> >>>>>>
> >>>>>> Thanks for your suggestions.
> >>>>>>
> >>>>>> That's a good idea about the NULL columns; I didn't think
> >>>>>> of that. Surprisingly, it didn't have any effect on the
> >>>>>> time.
> >>>>>
> >>>>> Hmm, I think you want "character" and "NULL" there (i.e.,
> >>>>> quoted).
> > Did you
> >>>>> fix both?
> >>>>>
> >>>>>>> read.table(whatever, as.is=TRUE, colClasses =
> >>>>>>> c(rep(character,4), rep(NULL,3696)).
> >>>>>
> >>>>> As a general matter, if you want people to dig into this,
> >>>>> they need
> > some
> >>>>> paraphrase of the file to play with. Would it be possible
> >>>>> to set up
> > a small
> >>>>> R program that generates a data file which displays the
> >>>>> issue?
> > Everything I
> >>>>> try seems to take about a second to read in.
> >>>>>
> >>>>> -pd
> >>>>>
> >>>>>>
> >>>>>> This problem was just a curiosity, I already did the
> >>>>>> import using
> > Excel
> >>>>> and
> >>>>>> VBA. I was just going to illustrate the power and
> >>>>>> simplicity of R,
> > but
> >>>>> it
> >>>>>> ironically it's been much slower and harder in R... The
> >>>>>> VBA was painful and messy, and took me over an hour to
> >>>>>> write;
> > but at
> >>>>>> least it worked quickly and reliably. The R code was
> >>>>>> clean and only took me about 5 minutes to write, but
> > the
> >>>>> run
> >>>>>> time was prohibitively slow!
> >>>>>>
> >>>>>> I profiled the code, but that offers little insight to
> >>>>>> me.
> >>>>>>
> >>>>>> Profile results with 10 line file:
> >>>>>>
> >>>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
> >>>>>> $by.self self.time self.pct total.time total.pct scan
> >>>>>> 12.24 53.50 12.24 53.50 read.table
> >>>>>> 10.58 46.24 22.88 100.00 type.convert
> >>>>>> 0.04 0.17 0.04 0.17 make.names 0.02
> >>>>>> 0.09 0.02 0.09
> >>>>>>
> >>>>>> $by.total total.time total.pct self.time self.pct
> >>>>>> read.table 22.88 100.00 10.58 46.24 scan
> >>>>>> 12.24 53.50 12.24 53.50 type.convert
> >>>>>> 0.04 0.17 0.04 0.17 make.names 0.02
> >>>>>> 0.09 0.02 0.09
> >>>>>>
> >>>>>> $sample.interval [1] 0.02
> >>>>>>
> >>>>>> $sampling.time [1] 22.88
> >>>>>>
> >>>>>>
> >>>>>> Profile results with 250 line file:
> >>>>>>
> >>>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
> >>>>>> $by.self self.time self.pct total.time total.pct scan
> >>>>>> 23.88 68.15 23.88 68.15 read.table
> >>>>>> 10.78 30.76 35.04 100.00 type.convert
> >>>>>> 0.30 0.86 0.32 0.91 character 0.02
> >>>>>> 0.06 0.02 0.06 file 0.02 0.06
> >>>>>> 0.02 0.06 lapply 0.02 0.06 0.02
> >>>>>> 0.06 unlist 0.02 0.06 0.02
> >>>>>> 0.06
> >>>>>>
> >>>>>> $by.total total.time total.pct self.time self.pct
> >>>>>> read.table 35.04 100.00 10.78 30.76
> >>>>>> scan 23.88 68.15 23.88 68.15
> >>>>>> type.convert 0.32 0.91 0.30 0.86
> >>>>>> sapply 0.04 0.11 0.00 0.00
> >>>>>> character 0.02 0.06 0.02 0.06
> >>>>>> file 0.02 0.06 0.02 0.06
> >>>>>> lapply 0.02 0.06 0.02 0.06
> >>>>>> unlist 0.02 0.06 0.02 0.06
> >>>>>> simplify2array 0.02 0.06 0.00 0.00
> >>>>>>
> >>>>>> $sample.interval [1] 0.02
> >>>>>>
> >>>>>> $sampling.time [1] 35.04
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds
> >>>>>> <markleeds2 at gmail.com>
> > wrote:
> >>>>>>
> >>>>>>> hi gene: maybe someone else will reply with some
> >>>>>>> subtleties that
> > I'm
> >>>>> not
> >>>>>>> aware of. one other thing that might help: if you know
> >>>>>>> which columns you want , you can set
> > the
> >>>>>>> others to NULL through colClasses and this should speed
> >>>>>>> things up also. For example, say
> > you
> >>>>> knew
> >>>>>>> you only wanted the first four columns and they were
> >>>>>>> character. then you could do,
> >>>>>>>
> >>>>>>> read.table(whatever, as.is=TRUE, colClasses =
> >>>>>>> c(rep(character,4), rep(NULL,3696)).
> >>>>>>>
> >>>>>>> hopefully someone else will say something that does the
> >>>>>>> trick. it
> > seems
> >>>>>>> odd to me as far as the difference in timings ? good
> >>>>>>> luck.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes
> >>>>>>> <gleynes at gmail.com>
> > wrote:
> >>>>>>>
> >>>>>>>> Mark,
> >>>>>>>>
> >>>>>>>> Thank you for the reply
> >>>>>>>>
> >>>>>>>> I neglected to mention that I had already set
> >>>>>>>> options(stringsAsFactors=FALSE)
> >>>>>>>>
> >>>>>>>> I agree, skipping the factor determination can help
> >>>>>>>> performance.
> >>>>>>>>
> >>>>>>>> The main reason that I wanted to use read.table is
> >>>>>>>> because it
> > will
> >>>>>>>> correctly determine the column classes for me. I
> >>>>>>>> don't really
> > want to
> >>>>>>>> specify 3700 column classes! (I'm not sure what they
> >>>>>>>> are
> > anyway).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds
> > <markleeds2 at gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Gene: Sometimes using colClasses in read.table
> >>>>>>>>> can speed
> > things up.
> >>>>>>>>> If you know what your variables are ahead of time
> >>>>>>>>> and what you
> > want
> >>>>> them to
> >>>>>>>>> be, this allows you to be specific by specifying,
> >>>>>>>>> character or
> >>>>> numeric,
> >>>>>>>>> etc and often it makes things faster. others will
> >>>>>>>>> have more to
> > say.
> >>>>>>>>>
> >>>>>>>>> also, if most of your variables are characters, R
> >>>>>>>>> will try to
> > turn
> >>>>>>>>> convert them into factors by default. If you use
> >>>>>>>>> as.is = TRUE it
> >>>>> won't
> >>>>>>>>> do this and that might speed things up also.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Rejoinder: above tidbits are just from
> >>>>>>>>> experience. I don't
> > know if
> >>>>>>>>> it's in stone or a hard and fast rule.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes
> >>>>>>>>> <gleynes at gmail.com>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> ** Disclaimer: I'm looking for general
> >>>>>>>>>> suggestions ** I'm sorry, but can't send out the
> >>>>>>>>>> file I'm using, so there is
> > no
> >>>>>>>>>> reproducible example.
> >>>>>>>>>>
> >>>>>>>>>> I'm using read.table and it's taking over 30
> >>>>>>>>>> seconds to read a
> > tiny
> >>>>>>>>>> file. The strange thing is that it takes roughly
> >>>>>>>>>> the same amount of
> > time if
> >>>>>>>>>> the file is 100 times larger.
> >>>>>>>>>>
> >>>>>>>>>> After re-reviewing the data Import / Export
> >>>>>>>>>> manual I think the
> > best
> >>>>>>>>>> approach would be to use Python, or perhaps the
> >>>>>>>>>> readLines
> > function,
> >>>>> but
> >>>>>>>>>> I was hoping to understand why the simple
> >>>>>>>>>> read.table approach
> > wasn't
> >>>>>>>>>> working as expected.
> >>>>>>>>>>
> >>>>>>>>>> Some relevant facts:
> >>>>>>>>>>
> >>>>>>>>>> 1. There are about 3700 columns. Maybe this is
> >>>>>>>>>> the problem?
> > Still
> >>>>>>>>>> the
> >>>>>>>>>>
> >>>>>>>>>> file size is not very large. 2. The file encoding
> >>>>>>>>>> is ANSI, but I'm not specifying that in
> > the
> >>>>>>>>>>
> >>>>>>>>>> function. Setting fileEncoding="ANSI" produces
> >>>>>>>>>> an
> > "unsupported
> >>>>>>>>>> conversion" error 3. readLines imports the lines
> >>>>>>>>>> quickly 4. scan imports the file quickly also
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Obviously, scan and readLines would require more
> >>>>>>>>>> coding to
> > identify
> >>>>>>>>>> columns, etc.
> >>>>>>>>>>
> >>>>>>>>>> my code: system.time(dat <-
> >>>>>>>>>> read.table('C:/test.txt', nrows=-1,
> > sep='\t',
> >>>>>>>>>> header=TRUE))
> >>>>>>>>>>
> >>>>>>>>>> It's taking 33.4 seconds and the file size is
> >>>>>>>>>> only 315 kb!
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>> Gene
> >>>>>>>>>>
> >>>>>>>>>> [[alternative HTML version deleted]]
> >>>>>>>>>>
> >>>>>>>>>> ______________________________________________
> >>>>>>>>>> R-help at r-project.org mailing list
> >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>> PLEASE do read the posting guide
> >>>>>>>>>> http://www.R-project.org/posting-guide.html and
> >>>>>>>>>> provide commented, minimal, self-contained,
> >>>>>>>>>> reproducible
> > code.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> [[alternative HTML version deleted]]
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help at r-project.org mailing list
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
> >>>>>> read the posting guide
> >>>>> http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained,
> >>>>>> reproducible code.
> >>>>>
> >>>>> -- Peter Dalgaard, Professor, Center for Statistics,
> >>>>> Copenhagen Business School Solbjerg Plads 3, 2000
> >>>>> Frederiksberg, Denmark Phone: (+45)38153501 Email:
> >>>>> pd.mes at cbs.dk Priv: PDalgd at gmail.com
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read
> >>>> the posting guide
> > http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible
> >>>> code.
> >>>>
> >>
> >> -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen
> >> Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> >> Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv:
> >> PDalgd at gmail.com
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the
> >> posting guide
> > http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible
> >> code.
> >
> > ______________________________________________ R-help at r-project.org
> > mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
> > read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> - --
> Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
> Biology, UCT), Dipl. Phys. (Germany)
>
> Centre of Excellence for Invasion Biology
> Stellenbosch University
> South Africa
>
> Tel : +33 - (0)9 53 10 27 44
> Cell: +33 - (0)6 85 62 59 98
> Fax : +33 - (0)9 58 10 27 44
>
> Fax (D): +49 - (0)3 21 21 25 22 44
>
> email: Rainer at krugs.de
>
> Skype: RMkrug
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk7gffsACgkQoYgNqgF2egpNpACeLbyAXB1pLGgyt7hAE7QAWe9i
> uV0An1Z8tvGw/1+40JM6YSe3aDqQoRkh
> =/mB7
> -----END PGP SIGNATURE-----
More information about the R-help
mailing list