[R] loop vs. apply(): strange behavior with data frame?
Jim Holtman
jholtman at gmail.com
Thu Oct 22 06:05:36 CEST 2009
try running Rprof on the two examples to see what the difference is.
what you will probably see is a lot of the time on the dataframe is
spent in accessing it like a matrix ('['). Rprof is very helpful to
see where time is spent in your scripts.
Sent from my iPhone
On Oct 21, 2009, at 17:17, Roberto Perdisci
<roberto.perdisci at gmail.com> wrote:
> Hi everybody,
> I noticed a strange behavior when using loops versus apply() on a
> data frame.
> The example below "explicitly" computes a distance matrix given a
> dataset. When the dataset is a matrix, everything works fine. But when
> the dataset is a data.frame, the dist.for function written using
> nested loops will take a lot longer than the dist.apply
>
> ######## USING FOR #######
>
> dist.for <- function(data) {
>
> d <- matrix(0,nrow=nrow(data),ncol=nrow(data))
> n <- ncol(data)
> r <- nrow(data)
>
> for(i in 1:r) {
> for(j in 1:r) {
> d[i,j] <- sum(abs(data[i,]-data[j,]))/n
> }
> }
>
> return(as.dist(d))
> }
>
> ######## USING APPLY #######
>
> f <- function(data.row,data.rest) {
>
> r2 <- as.double(apply(data.rest,1,g,data.row))
>
> }
>
> g <- function(row2,row1) {
> return(sum(abs(row1-row2))/length(row1))
> }
>
> dist.apply <- function(data) {
> d <- apply(data,1,f,data)
>
> return(as.dist(d))
> }
>
>
> ######## TESTING #######
>
> library(mvtnorm)
> data <- rmvnorm(100,mean=seq(1,10),sigma=diag(1,nrow=10,ncol=10))
>
> tf <- system.time(df <- dist.for(data))
> ta <- system.time(da <- dist.apply(data))
>
> print(paste('diff = ',sum(as.matrix(df) - as.matrix(da))))
> print("tf = ")
> print(tf)
> print("ta = ")
> print(ta)
>
> print('----------------------------------')
> print('Same experiment on data.frame...')
> data2 <- as.data.frame(data)
>
> tf <- system.time(df <- dist.for(data2))
> ta <- system.time(da <- dist.apply(data2))
>
> print(paste('diff = ',sum(as.matrix(df) - as.matrix(da))))
> print("tf = ")
> print(tf)
> print("ta = ")
> print(ta)
>
> ########################
>
> Here is the output I get on my system (R version 2.7.1 on a Debian
> lenny)
>
> [1] "diff = 0"
> [1] "tf = "
> user system elapsed
> 0.088 0.000 0.087
> [1] "ta = "
> user system elapsed
> 0.128 0.000 0.128
> [1] "----------------------------------"
> [1] "Same experiment on data.frame..."
> [1] "diff = 0"
> [1] "tf = "
> user system elapsed
> 35.031 0.000 35.029
> [1] "ta = "
> user system elapsed
> 0.184 0.000 0.185
>
> Could you explain why that happens?
>
> thank you,
> regards
>
> Roberto
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list