[R] loop vs. apply(): strange behavior with data frame?
Roberto Perdisci
roberto.perdisci at gmail.com
Thu Oct 22 20:01:25 CEST 2009
Thanks for the suggestion.
I found some documentation on why accessing a data.gram using the
matrix notation (e.g., [i,j]) is so expensive, which was the cause of
the problem.
regards,
Roberto
On Thu, Oct 22, 2009 at 12:05 AM, Jim Holtman <jholtman at gmail.com> wrote:
> try running Rprof on the two examples to see what the difference is. what
> you will probably see is a lot of the time on the dataframe is spent in
> accessing it like a matrix ('['). Rprof is very helpful to see where time is
> spent in your scripts.
>
> Sent from my iPhone
>
> On Oct 21, 2009, at 17:17, Roberto Perdisci <roberto.perdisci at gmail.com>
> wrote:
>
>> Hi everybody,
>> I noticed a strange behavior when using loops versus apply() on a data
>> frame.
>> The example below "explicitly" computes a distance matrix given a
>> dataset. When the dataset is a matrix, everything works fine. But when
>> the dataset is a data.frame, the dist.for function written using
>> nested loops will take a lot longer than the dist.apply
>>
>> ######## USING FOR #######
>>
>> dist.for <- function(data) {
>>
>> d <- matrix(0,nrow=nrow(data),ncol=nrow(data))
>> n <- ncol(data)
>> r <- nrow(data)
>>
>> for(i in 1:r) {
>> for(j in 1:r) {
>> d[i,j] <- sum(abs(data[i,]-data[j,]))/n
>> }
>> }
>>
>> return(as.dist(d))
>> }
>>
>> ######## USING APPLY #######
>>
>> f <- function(data.row,data.rest) {
>>
>> r2 <- as.double(apply(data.rest,1,g,data.row))
>>
>> }
>>
>> g <- function(row2,row1) {
>> return(sum(abs(row1-row2))/length(row1))
>> }
>>
>> dist.apply <- function(data) {
>> d <- apply(data,1,f,data)
>>
>> return(as.dist(d))
>> }
>>
>>
>> ######## TESTING #######
>>
>> library(mvtnorm)
>> data <- rmvnorm(100,mean=seq(1,10),sigma=diag(1,nrow=10,ncol=10))
>>
>> tf <- system.time(df <- dist.for(data))
>> ta <- system.time(da <- dist.apply(data))
>>
>> print(paste('diff = ',sum(as.matrix(df) - as.matrix(da))))
>> print("tf = ")
>> print(tf)
>> print("ta = ")
>> print(ta)
>>
>> print('----------------------------------')
>> print('Same experiment on data.frame...')
>> data2 <- as.data.frame(data)
>>
>> tf <- system.time(df <- dist.for(data2))
>> ta <- system.time(da <- dist.apply(data2))
>>
>> print(paste('diff = ',sum(as.matrix(df) - as.matrix(da))))
>> print("tf = ")
>> print(tf)
>> print("ta = ")
>> print(ta)
>>
>> ########################
>>
>> Here is the output I get on my system (R version 2.7.1 on a Debian lenny)
>>
>> [1] "diff = 0"
>> [1] "tf = "
>> user system elapsed
>> 0.088 0.000 0.087
>> [1] "ta = "
>> user system elapsed
>> 0.128 0.000 0.128
>> [1] "----------------------------------"
>> [1] "Same experiment on data.frame..."
>> [1] "diff = 0"
>> [1] "tf = "
>> user system elapsed
>> 35.031 0.000 35.029
>> [1] "ta = "
>> user system elapsed
>> 0.184 0.000 0.185
>>
>> Could you explain why that happens?
>>
>> thank you,
>> regards
>>
>> Roberto
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list