[R] read.table performance

Thu Dec 8 09:32:25 CET 2011

Hi

> system.time(dat<-read.table("test2.txt"))
   user  system elapsed 
  32.38    0.00   32.40

> system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t', 
header=TRUE))
   user  system elapsed 
  32.30    0.03   32.36 

Couldn't.it be a Windows issue?
               _  
platform       i386-pc-mingw32  
arch           i386  
os             mingw32  
system         i386, mingw32  
status         Under development (unstable)  
major          2  
minor          14.0  
year           2011  
month          04  
day            27  
svn rev        55657  
language       R  
version.string R version 2.14.0 Under development (unstable) (2011-04-27 
r55657)
>

> dim(dat)
[1]    7 3765
>

But from the dat file it seems to me that its structure is somehow weird. 

> head(names(dat))
[1] "X..Hydrogen" "Helium"      "Lithium"     "Beryllium"   "Boron" 
[6] "Carbon" 
> tail(names(dat))
[1] "Sulfur.32"    "Chlorine.32"  "Argon.32"     "Potassium.32" 
"Calcium.32" 
[6] "Scandium.32" 
>

There is row of names which has repeating values. Maybe the most time is 
spent by checking the names validity.

Regards
Petr

r-help-bounces at r-project.org napsal dne 07.12.2011 23:11:10:

> peter dalgaard <pdalgd at gmail.com> 
> Odeslal: r-help-bounces at r-project.org
> 
> 07.12.2011 23:11
> 
> Komu
> 
> "R. Michael Weylandt" <michael.weylandt at gmail.com>
> 
> Kopie
> 
> r-help at r-project.org, Gene Leynes <gleynes at gmail.com>
> 
> Předmět
> 
> Re: [R] read.table performance
> 
> 
> On Dec 7, 2011, at 22:37 , R. Michael Weylandt wrote:
> 
> > R 2.13.2 on Mac OS X 10.5.8 takes about 1.8s to read the file
> > verbatim: system.time(read.table("test2.txt"))
> 
> About 2.3s with 2.14 on a 1.86 GHz MacBook Air 10.6.8. 
> 
> Gene, are you by any chance storing the file in a heavily virus-scanned 
> system directory?
> 
> -pd
> 
> > Michael
> > 
> > 2011/12/7 Gene Leynes <gleynes at gmail.com>:
> >> Peter,
> >> 
> >> You're quite right; it's nearly impossible to make progress without a
> >> working example.
> >> 
> >> I created an ** extremely simplified ** example for distribution. The 
real
> >> data has numeric, character, and boolean classes.
> >> 
> >> The file still takes 25.08 seconds to read, despite it's small size.
> >> 
> >> I neglected to mention that I'm using R 2.13.0 and I"m on a windows 7
> >> machine (not that it should particularly matter with this type of 
data /
> >> functions).
> >> 
> >> ## The code:
> >> options(stringsAsFactors=FALSE)
> >> system.time(dat <- read.table('test2.txt', nrows=-1, sep='\t', 
header=TRUE))
> >> str(dat, 0)
> >> 
> >> 
> >> Thanks again!
> >> 
> >> 
> >> 
> >> On Wed, Dec 7, 2011 at 1:21 AM, peter dalgaard <pdalgd at gmail.com> 
wrote:
> >> 
> >>> 
> >>> On Dec 6, 2011, at 22:33 , Gene Leynes wrote:
> >>> 
> >>>> Mark,
> >>>> 
> >>>> Thanks for your suggestions.
> >>>> 
> >>>> That's a good idea about the NULL columns; I didn't think of that.
> >>>> Surprisingly, it didn't have any effect on the time.
> >>> 
> >>> Hmm, I think you want "character" and "NULL" there (i.e., quoted). 
Did you
> >>> fix both?
> >>> 
> >>>>> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
> >>>>> rep(NULL,3696)).
> >>> 
> >>> As a general matter, if you want people to dig into this, they need 
some
> >>> paraphrase of the file to play with. Would it be possible to set up 
a small
> >>> R program that generates a data file which displays the issue? 
Everything I
> >>> try seems to take about a second to read in.
> >>> 
> >>> -pd
> >>> 
> >>>> 
> >>>> This problem was just a curiosity, I already did the import using 
Excel
> >>> and
> >>>> VBA.  I was just going to illustrate the power and simplicity of R, 
but
> >>> it
> >>>> ironically it's been much slower and harder in R...
> >>>> The VBA was painful and messy, and took me over an hour to write; 
but at
> >>>> least it worked quickly and reliably.
> >>>> The R code was clean and only took me about 5 minutes to write, but 
the
> >>> run
> >>>> time was prohibitively slow!
> >>>> 
> >>>> I profiled the code, but that offers little insight to me.
> >>>> 
> >>>> Profile results with 10 line file:
> >>>> 
> >>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
> >>>> $by.self
> >>>>             self.time self.pct total.time total.pct
> >>>> scan             12.24    53.50      12.24     53.50
> >>>> read.table       10.58    46.24      22.88    100.00
> >>>> type.convert      0.04     0.17       0.04      0.17
> >>>> make.names        0.02     0.09       0.02      0.09
> >>>> 
> >>>> $by.total
> >>>>             total.time total.pct self.time self.pct
> >>>> read.table        22.88    100.00     10.58    46.24
> >>>> scan              12.24     53.50     12.24    53.50
> >>>> type.convert       0.04      0.17      0.04     0.17
> >>>> make.names         0.02      0.09      0.02     0.09
> >>>> 
> >>>> $sample.interval
> >>>> [1] 0.02
> >>>> 
> >>>> $sampling.time
> >>>> [1] 22.88
> >>>> 
> >>>> 
> >>>> Profile results with 250 line file:
> >>>> 
> >>>>> summaryRprof("C:/Users/gene.leynes/Desktop/test.out")
> >>>> $by.self
> >>>>             self.time self.pct total.time total.pct
> >>>> scan             23.88    68.15      23.88     68.15
> >>>> read.table       10.78    30.76      35.04    100.00
> >>>> type.convert      0.30     0.86       0.32      0.91
> >>>> character         0.02     0.06       0.02      0.06
> >>>> file              0.02     0.06       0.02      0.06
> >>>> lapply            0.02     0.06       0.02      0.06
> >>>> unlist            0.02     0.06       0.02      0.06
> >>>> 
> >>>> $by.total
> >>>>               total.time total.pct self.time self.pct
> >>>> read.table          35.04    100.00     10.78    30.76
> >>>> scan                23.88     68.15     23.88    68.15
> >>>> type.convert         0.32      0.91      0.30     0.86
> >>>> sapply               0.04      0.11      0.00     0.00
> >>>> character            0.02      0.06      0.02     0.06
> >>>> file                 0.02      0.06      0.02     0.06
> >>>> lapply               0.02      0.06      0.02     0.06
> >>>> unlist               0.02      0.06      0.02     0.06
> >>>> simplify2array       0.02      0.06      0.00     0.00
> >>>> 
> >>>> $sample.interval
> >>>> [1] 0.02
> >>>> 
> >>>> $sampling.time
> >>>> [1] 35.04
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> On Tue, Dec 6, 2011 at 2:34 PM, Mark Leeds <markleeds2 at gmail.com> 
wrote:
> >>>> 
> >>>>> hi gene: maybe someone else will reply with some  subtleties that 
I'm
> >>> not
> >>>>> aware of. one other thing
> >>>>> that might help: if you know which columns you want , you can set 
the
> >>>>> others to NULL through
> >>>>> colClasses and this should speed things up also. For example, say 
you
> >>> knew
> >>>>> you only wanted the
> >>>>> first four columns and they were character. then you could do,
> >>>>> 
> >>>>> read.table(whatever, as.is=TRUE, colClasses = c(rep(character,4),
> >>>>> rep(NULL,3696)).
> >>>>> 
> >>>>> hopefully someone else will say something that does the trick. it 
seems
> >>>>> odd to me as far as the
> >>>>> difference in timings ? good luck.
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> On Tue, Dec 6, 2011 at 1:55 PM, Gene Leynes <gleynes at gmail.com> 
wrote:
> >>>>> 
> >>>>>> Mark,
> >>>>>> 
> >>>>>> Thank you for the reply
> >>>>>> 
> >>>>>> I neglected to mention that I had already set
> >>>>>> options(stringsAsFactors=FALSE)
> >>>>>> 
> >>>>>> I agree, skipping the factor determination can help performance.
> >>>>>> 
> >>>>>> The main reason that I wanted to use read.table is because it 
will
> >>>>>> correctly determine the column classes for me.  I don't really 
want to
> >>>>>> specify 3700 column classes!  (I'm not sure what they are 
anyway).
> >>>>>> 
> >>>>>> 
> >>>>>> On Tue, Dec 6, 2011 at 12:40 PM, Mark Leeds 
<markleeds2 at gmail.com>
> >>> wrote:
> >>>>>> 
> >>>>>>> Hi Gene: Sometimes using colClasses in read.table can speed 
things up.
> >>>>>>> If you know what your variables are ahead of time and what you 
want
> >>> them to
> >>>>>>> be, this allows you to be specific  by specifying, character or
> >>> numeric,
> >>>>>>> etc  and often it makes things faster. others will have more to 
say.
> >>>>>>> 
> >>>>>>> also, if most of your variables are characters, R will try to 
turn
> >>>>>>> convert them into factors by default. If you use as.is = TRUE it
> >>> won't
> >>>>>>> do this and that might speed things up also.
> >>>>>>> 
> >>>>>>> 
> >>>>>>> Rejoinder:  above tidbits are  just from experience. I don't 
know if
> >>>>>>> it's in stone or a hard and fast rule.
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> On Tue, Dec 6, 2011 at 1:15 PM, Gene Leynes <gleynes at gmail.com>
> >>> wrote:
> >>>>>>> 
> >>>>>>>> ** Disclaimer: I'm looking for general suggestions **
> >>>>>>>> I'm sorry, but can't send out the file I'm using, so there is 
no
> >>>>>>>> reproducible example.
> >>>>>>>> 
> >>>>>>>> I'm using read.table and it's taking over 30 seconds to read a 
tiny
> >>>>>>>> file.
> >>>>>>>> The strange thing is that it takes roughly the same amount of 
time if
> >>>>>>>> the
> >>>>>>>> file is 100 times larger.
> >>>>>>>> 
> >>>>>>>> After re-reviewing the data Import / Export manual I think the 
best
> >>>>>>>> approach would be to use Python, or perhaps the readLines 
function,
> >>> but
> >>>>>>>> I
> >>>>>>>> was hoping to understand why the simple read.table approach 
wasn't
> >>>>>>>> working
> >>>>>>>> as expected.
> >>>>>>>> 
> >>>>>>>> Some relevant facts:
> >>>>>>>> 
> >>>>>>>>  1. There are about 3700 columns.  Maybe this is the problem? 
Still
> >>>>>>>> the
> >>>>>>>> 
> >>>>>>>>  file size is not very large.
> >>>>>>>>  2. The file encoding is ANSI, but I'm not specifying that in 
the
> >>>>>>>> 
> >>>>>>>>  function.  Setting fileEncoding="ANSI" produces an 
"unsupported
> >>>>>>>> conversion"
> >>>>>>>>  error
> >>>>>>>>  3. readLines imports the lines quickly
> >>>>>>>>  4. scan imports the file quickly also
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> Obviously, scan and readLines would require more coding to 
identify
> >>>>>>>> columns, etc.
> >>>>>>>> 
> >>>>>>>> my code:
> >>>>>>>> system.time(dat <- read.table('C:/test.txt', nrows=-1, 
sep='\t',
> >>>>>>>> header=TRUE))
> >>>>>>>> 
> >>>>>>>> It's taking 33.4 seconds and the file size is only 315 kb!
> >>>>>>>> 
> >>>>>>>> Thanks
> >>>>>>>> 
> >>>>>>>> Gene
> >>>>>>>> 
> >>>>>>>>       [[alternative HTML version deleted]]
> >>>>>>>> 
> >>>>>>>> ______________________________________________
> >>>>>>>> R-help at r-project.org mailing list
> >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>> PLEASE do read the posting guide
> >>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>> and provide commented, minimal, self-contained, reproducible 
code.
> >>>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>> 
> >>>>> 
> >>>> 
> >>>>       [[alternative HTML version deleted]]
> >>>> 
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>> 
> >>> --
> >>> Peter Dalgaard, Professor,
> >>> Center for Statistics, Copenhagen Business School
> >>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> >>> Phone: (+45)38153501
> >>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >> 
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >> 
> 
> -- 
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.