[R] why is nrow() so slow?
David Winsemius
dwinsemius at comcast.net
Tue Sep 15 17:45:03 CEST 2009
On Sep 15, 2009, at 10:48 AM, ivo welch wrote:
> dear R wizards: here is the strange question for the day. It seems
> to me
> that nrow() is very slow. Let me explain what I mean:
>
> ds= data.frame( NA, x=rnorm(10000) ) ## a sample data set
>
>> system.time( { for (i in 1:10000) NA } ) ## doing nothing takes
> virtually no time
> user system elapsed
> 0.000 0.000 0.001
>
> ## this is something that should take time; we need to add 10,000
> values
> 10,000 times
>> system.time( { for (i in 1:10000) mean(ds$x) } )
> user system elapsed
> 0.416 0.001 0.416
>
> ## alas, this should be very fast. it is just reading off an
> attribute of
> ds. it takes almost a quarter of the time of mean()!
>> system.time( { for (i in 1:10000) nrow(ds) } )
> user system elapsed
> 0.124 0.001 0.125
I am guessing that you are coming from a statistical paradigm where
there is an
implicit looping construct in a data step. In R you find the number of
rows not
with a loop, but with the nrow function used just once.
> ds= data.frame( NA, x=rnorm(10000) )
> system.time(nrow(ds))
user system elapsed
0 0 0
>
> ## here is an alternative way to implement nrows, which is already
> much
> faster:
>> system.time( { for (i in 1:10000) length(ds$x) } )
> user system elapsed
> 0.041 0.000 0.041
>
> is there a faster way to learn how big a data frame is?
> length(ds)
[1] 2
> nrow(ds)
[1] 10000
# Or:
> dim(ds)
[1] 10000 2
> I know this sounds
> silly, but this is inside a "by" statement, where I figure out how
> many
> observations are in each subset. strangely, this takes a whole lot of
> time. I don't believe it is possible to ask "by" to attach an
> attribute to
> the data frame that stores the number of observations that it is
> actually
> passing.
>
> pointers appreciated.
>
> regards,
>
> /iaw
> --
> Ivo Welch (ivo.welch at brown.edu, ivo.welch at gmail.com)
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list