[R] read.table performance
Rainer M Krug
r.m.krug at gmail.com
Thu Dec 8 10:06:03 CET 2011
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 08/12/11 09:32, Petr PIKAL wrote:
> Hi
>
>> system.time(dat<-read.table("test2.txt"))
> user system elapsed 32.38 0.00 32.40
>
>> system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t',
> header=TRUE)) user system elapsed 32.30 0.03 32.36
>
> Couldn't.it be a Windows issue?
Likely - here on Linux I get:
> system.time(dat <- read.table('tmp/test2.txt', nrows=-1, sep='\t',
header=TRUE))
user system elapsed
1.560 0.000 1.579
> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: i686-pc-linux-gnu (32-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
> version
_
platform i686-pc-linux-gnu
arch i686
os linux-gnu
system i686, linux-gnu
status
major 2
minor 14.0
year 2011
month 10
day 31
svn rev 57496
language R
version.string R version 2.14.0 (2011-10-31)
>
Cheers,
Rainer
> _ platform i386-pc-mingw32 arch i386 os
> mingw32 system i386, mingw32 status Under
> development (unstable) major 2 minor 14.0 year
> 2011 month 04 day 27 svn rev 55657
> language R version.string R version 2.14.0 Under development
> (unstable) (2011-04-27 r55657)
>>
>
>
>> dim(dat)
> [1] 7 3765
>>
>
> But from the dat file it seems to me that its structure is somehow
> weird.
>
>> head(names(dat))
> [1] "X..Hydrogen" "Helium" "Lithium" "Beryllium" "Boron"
> [6] "Carbon"
>> tail(names(dat))
> [1] "Sulfur.32" "Chlorine.32" "Argon.32" "Potassium.32"
> "Calcium.32" [6] "Scandium.32"
>>
>
> There is row of names which has repeating values. Maybe the most
> time is spent by checking the names validity.
>
> Regards Petr
>
> r-help-bounces at r-project.org napsal dne 07.12.2011 23:11:10:
>
>> peter dalgaard <pdalgd at gmail.com> Odeslal:
>> r-help-bounces at r-project.org
>>
>> 07.12.2011 23:11
>>
>> Komu
>>
>> "R. Michael Weylandt" <michael.weylandt at gmail.com>
>>
>> Kopie
>>
>> r-help at r-project.org, Gene Leynes <gleynes at gmail.com>
>>
>> P?edm?t
>>
>> Re: [R] read.table performance
>>
>>
>> On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote:
>>
>>> R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file
>>> verbatim: system.time(read.table("test2.txt"))
>>
>> About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8.
>>
>> Gene, are you by any chance storing the file in a heavily
>> virus-scanned system directory?
>>
>> -pd
>>
>>> Michael
>>>
>>> 2011/12/7 Gene Leynes <gleynes at gmail.com>:
>>>> Peter,
>>>>
>>>> You're quite right; it's nearly impossible to make progress
>>>> without a working example.
>>>>
>>>> I created an ** extremely simplified ** example for
>>>> distribution. The
> real
>>>> data has numeric, character, and boolean classes.
>>>>
>>>> The file still takes 25.08 seconds to read, despite it's
>>>> small size.
>>>>
>>>> I neglected to mention that I'm using R 2.13.0 and I"m on a
>>>> windows 7 machine (not that it should particularly matter
>>>> with this type of
> data /
>>>> functions).
>>>>
>>>> ## The code: options(stringsAsFactors=FALSE) system.time(dat
>>>> <- read.table('test2.txt', nrows=-1, sep='\t',
> header=TRUE))
>>>> str(dat, 0)
>>>>
>>>>
>>>> Thanks again!
>>>>
>>>>
>>>>
>>>> On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard
>>>> <pdalgd at gmail.com>
> wrote:
>>>>
>>>>>
>>>>> On Dec 6, 2011, at 22:33 , Gene Leynes wrote:
>>>>>
>>>>>> Mark,
>>>>>>
>>>>>> Thanks for your suggestions.
>>>>>>
>>>>>> That's a good idea about the NULL columns; I didn't think
>>>>>> of that. Surprisingly, it didn't have any effect on the
>>>>>> time.
>>>>>
>>>>> Hmm, I think you want "character" and "NULL" there (i.e.,
>>>>> quoted).
> Did you
>>>>> fix both?
>>>>>
>>>>>>> read.table(whatever, as.is=TRUE, colClasses =
>>>>>>> c(rep(character,4), rep(NULL,3696)).
>>>>>
>>>>> As a general matter, if you want people to dig into this,
>>>>> they need
> some
>>>>> paraphrase of the file to play with. Would it be possible
>>>>> to set up
> a small
>>>>> R program that generates a data file which displays the
>>>>> issue?
> Everything I
>>>>> try seems to take about a second to read in.
>>>>>
>>>>> -pd
>>>>>
>>>>>>
>>>>>> This problem was just a curiosity, I already did the
>>>>>> import using
> Excel
>>>>> and
>>>>>> VBA. I was just going to illustrate the power and
>>>>>> simplicity of R,
> but
>>>>> it
>>>>>> ironically it's been much slower and harder in R... The
>>>>>> VBA was painful and messy, and took me over an hour to
>>>>>> write;
> but at
>>>>>> least it worked quickly and reliably. The R code was
>>>>>> clean and only took me about 5 minutes to write, but
> the
>>>>> run
>>>>>> time was prohibitively slow!
>>>>>>
>>>>>> I profiled the code, but that offers little insight to
>>>>>> me.
>>>>>>
>>>>>> Profile results with 10 line file:
>>>>>>
>>>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
>>>>>> $by.self self.time self.pct total.time total.pct scan
>>>>>> 12.24 53.50 12.24 53.50 read.table
>>>>>> 10.58 46.24 22.88 100.00 type.convert
>>>>>> 0.04 0.17 0.04 0.17 make.names 0.02
>>>>>> 0.09 0.02 0.09
>>>>>>
>>>>>> $by.total total.time total.pct self.time self.pct
>>>>>> read.table 22.88 100.00 10.58 46.24 scan
>>>>>> 12.24 53.50 12.24 53.50 type.convert
>>>>>> 0.04 0.17 0.04 0.17 make.names 0.02
>>>>>> 0.09 0.02 0.09
>>>>>>
>>>>>> $sample.interval [1] 0.02
>>>>>>
>>>>>> $sampling.time [1] 22.88
>>>>>>
>>>>>>
>>>>>> Profile results with 250 line file:
>>>>>>
>>>>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
>>>>>> $by.self self.time self.pct total.time total.pct scan
>>>>>> 23.88 68.15 23.88 68.15 read.table
>>>>>> 10.78 30.76 35.04 100.00 type.convert
>>>>>> 0.30 0.86 0.32 0.91 character 0.02
>>>>>> 0.06 0.02 0.06 file 0.02 0.06
>>>>>> 0.02 0.06 lapply 0.02 0.06 0.02
>>>>>> 0.06 unlist 0.02 0.06 0.02
>>>>>> 0.06
>>>>>>
>>>>>> $by.total total.time total.pct self.time self.pct
>>>>>> read.table 35.04 100.00 10.78 30.76
>>>>>> scan 23.88 68.15 23.88 68.15
>>>>>> type.convert 0.32 0.91 0.30 0.86
>>>>>> sapply 0.04 0.11 0.00 0.00
>>>>>> character 0.02 0.06 0.02 0.06
>>>>>> file 0.02 0.06 0.02 0.06
>>>>>> lapply 0.02 0.06 0.02 0.06
>>>>>> unlist 0.02 0.06 0.02 0.06
>>>>>> simplify2array 0.02 0.06 0.00 0.00
>>>>>>
>>>>>> $sample.interval [1] 0.02
>>>>>>
>>>>>> $sampling.time [1] 35.04
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds
>>>>>> <markleeds2 at gmail.com>
> wrote:
>>>>>>
>>>>>>> hi gene: maybe someone else will reply with some
>>>>>>> subtleties that
> I'm
>>>>> not
>>>>>>> aware of. one other thing that might help: if you know
>>>>>>> which columns you want , you can set
> the
>>>>>>> others to NULL through colClasses and this should speed
>>>>>>> things up also. For example, say
> you
>>>>> knew
>>>>>>> you only wanted the first four columns and they were
>>>>>>> character. then you could do,
>>>>>>>
>>>>>>> read.table(whatever, as.is=TRUE, colClasses =
>>>>>>> c(rep(character,4), rep(NULL,3696)).
>>>>>>>
>>>>>>> hopefully someone else will say something that does the
>>>>>>> trick. it
> seems
>>>>>>> odd to me as far as the difference in timings ? good
>>>>>>> luck.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes
>>>>>>> <gleynes at gmail.com>
> wrote:
>>>>>>>
>>>>>>>> Mark,
>>>>>>>>
>>>>>>>> Thank you for the reply
>>>>>>>>
>>>>>>>> I neglected to mention that I had already set
>>>>>>>> options(stringsAsFactors=FALSE)
>>>>>>>>
>>>>>>>> I agree, skipping the factor determination can help
>>>>>>>> performance.
>>>>>>>>
>>>>>>>> The main reason that I wanted to use read.table is
>>>>>>>> because it
> will
>>>>>>>> correctly determine the column classes for me. I
>>>>>>>> don't really
> want to
>>>>>>>> specify 3700 column classes! (I'm not sure what they
>>>>>>>> are
> anyway).
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds
> <markleeds2 at gmail.com>
>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Gene: Sometimes using colClasses in read.table
>>>>>>>>> can speed
> things up.
>>>>>>>>> If you know what your variables are ahead of time
>>>>>>>>> and what you
> want
>>>>> them to
>>>>>>>>> be, this allows you to be specific by specifying,
>>>>>>>>> character or
>>>>> numeric,
>>>>>>>>> etc and often it makes things faster. others will
>>>>>>>>> have more to
> say.
>>>>>>>>>
>>>>>>>>> also, if most of your variables are characters, R
>>>>>>>>> will try to
> turn
>>>>>>>>> convert them into factors by default. If you use
>>>>>>>>> as.is = TRUE it
>>>>> won't
>>>>>>>>> do this and that might speed things up also.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Rejoinder: above tidbits are just from
>>>>>>>>> experience. I don't
> know if
>>>>>>>>> it's in stone or a hard and fast rule.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes
>>>>>>>>> <gleynes at gmail.com>
>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> ** Disclaimer: I'm looking for general
>>>>>>>>>> suggestions ** I'm sorry, but can't send out the
>>>>>>>>>> file I'm using, so there is
> no
>>>>>>>>>> reproducible example.
>>>>>>>>>>
>>>>>>>>>> I'm using read.table and it's taking over 30
>>>>>>>>>> seconds to read a
> tiny
>>>>>>>>>> file. The strange thing is that it takes roughly
>>>>>>>>>> the same amount of
> time if
>>>>>>>>>> the file is 100 times larger.
>>>>>>>>>>
>>>>>>>>>> After re-reviewing the data Import / Export
>>>>>>>>>> manual I think the
> best
>>>>>>>>>> approach would be to use Python, or perhaps the
>>>>>>>>>> readLines
> function,
>>>>> but
>>>>>>>>>> I was hoping to understand why the simple
>>>>>>>>>> read.table approach
> wasn't
>>>>>>>>>> working as expected.
>>>>>>>>>>
>>>>>>>>>> Some relevant facts:
>>>>>>>>>>
>>>>>>>>>> 1. There are about 3700 columns. Maybe this is
>>>>>>>>>> the problem?
> Still
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>> file size is not very large. 2. The file encoding
>>>>>>>>>> is ANSI, but I'm not specifying that in
> the
>>>>>>>>>>
>>>>>>>>>> function. Setting fileEncoding="ANSI" produces
>>>>>>>>>> an
> "unsupported
>>>>>>>>>> conversion" error 3. readLines imports the lines
>>>>>>>>>> quickly 4. scan imports the file quickly also
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Obviously, scan and readLines would require more
>>>>>>>>>> coding to
> identify
>>>>>>>>>> columns, etc.
>>>>>>>>>>
>>>>>>>>>> my code: system.time(dat <-
>>>>>>>>>> read.table('C:/test.txt', nrows=-1,
> sep='\t',
>>>>>>>>>> header=TRUE))
>>>>>>>>>>
>>>>>>>>>> It's taking 33.4 seconds and the file size is
>>>>>>>>>> only 315 kb!
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Gene
>>>>>>>>>>
>>>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>>>
>>>>>>>>>> ______________________________________________
>>>>>>>>>> R-help at r-project.org mailing list
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>> PLEASE do read the posting guide
>>>>>>>>>> http://www.R-project.org/posting-guide.html and
>>>>>>>>>> provide commented, minimal, self-contained,
>>>>>>>>>> reproducible
> code.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
>>>>>> read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
>>>>>> reproducible code.
>>>>>
>>>>> -- Peter Dalgaard, Professor, Center for Statistics,
>>>>> Copenhagen Business School Solbjerg Plads 3, 2000
>>>>> Frederiksberg, Denmark Phone: (+45)38153501 Email:
>>>>> pd.mes at cbs.dk Priv: PDalgd at gmail.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read
>>>> the posting guide
> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
>>>> code.
>>>>
>>
>> -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen
>> Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv:
>> PDalgd at gmail.com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the
>> posting guide
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible
>> code.
>
> ______________________________________________ R-help at r-project.org
> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
> read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
- --
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
Biology, UCT), Dipl. Phys. (Germany)
Centre of Excellence for Invasion Biology
Stellenbosch University
South Africa
Tel : +33 - (0)9 53 10 27 44
Cell: +33 - (0)6 85 62 59 98
Fax : +33 - (0)9 58 10 27 44
Fax (D): +49 - (0)3 21 21 25 22 44
email: Rainer at krugs.de
Skype: RMkrug
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk7gffsACgkQoYgNqgF2egpNpACeLbyAXB1pLGgyt7hAE7QAWe9i
uV0An1Z8tvGw/1+40JM6YSe3aDqQoRkh
=/mB7
-----END PGP SIGNATURE-----
More information about the R-help
mailing list