[R] the first and last observation for each subject

Mon Jan 5 22:37:30 CET 2009

Here's some more timing's of Bill's function.  Although in this
example sapply has a clear performance advantage for smaller numbers
of groups (k) , gm is substantially faster for k >> 1000:

 gm <- function(x, group){ # medians by group:
   o<-order(group, x)
   group <- group[o]
   x <- x[o]
   changes <- group[-1] != group[-length(group)]
   first <- which(c(TRUE, changes))
   last <- which(c(changes, TRUE))
   lowerMedian <- x[floor((first+last)/2)]
   upperMedian <- x[ceiling((first+last)/2)]
   median <- (lowerMedian+upperMedian)/2
   names(median) <- group[first]
   median
 }

 ##

 for(k in 10^(1:6)){
  group<-sample(1:k, size=100000, replace=TRUE)
  x<-rnorm(length(group))*10 + group
  cat('number of groups:', k, '\n')
  cat('sapply\n')
  print(s <- unix.time(sapply(split(x,group), median)))
  cat('gm\n')
  print(g <- unix.time(-gm(x,group)))
  cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
 }

number of groups: 10
sapply
   user  system elapsed
   0.01    0.00    0.01
gm
   user  system elapsed
   0.14    0.00    0.16
 sapply/gm user ratio: 0.07142857

number of groups: 100
sapply
   user  system elapsed
   0.02    0.00    0.02
gm
   user  system elapsed
   0.11    0.00    0.09
 sapply/gm user ratio: 0.1818182

number of groups: 1000
sapply
   user  system elapsed
   0.11    0.00    0.11
gm
   user  system elapsed
   0.11    0.00    0.11
 sapply/gm user ratio: 1

number of groups: 10000
sapply
   user  system elapsed
   1.00    0.00    1.04
gm
   user  system elapsed
   0.10    0.00    0.09
 sapply/gm user ratio: 10

number of groups: 1e+05
sapply
   user  system elapsed
   6.24    0.01    6.31
gm
   user  system elapsed
   0.16    0.00    0.17
 sapply/gm user ratio: 39

number of groups: 1e+06
sapply
   user  system elapsed
   9.00    0.03    8.92
gm
   user  system elapsed
   0.15    0.00    0.20
 sapply/gm user ratio: 60

> R.Version()
$platform
[1] "i386-pc-mingw32"

$arch
[1] "i386"

$os
[1] "mingw32"

$system
[1] "i386, mingw32"

$status
[1] ""

$major
[1] "2"

$minor
[1] "8.0"

$year
[1] "2008"

$month
[1] "10"

$day
[1] "20"

$`svn rev`
[1] "46754"

$language
[1] "R"

$version.string
[1] "R version 2.8.0 (2008-10-20)"

Kingsford Jones

On Mon, Jan 5, 2009 at 10:18 AM, William Dunlap <wdunlap at tibco.com> wrote:
> Arg, the 'sapply(...)' in the function was in the initial
> comment,
>   gm <- function(x, group){ # medians by group:
> sapply(split(x,group),median)
> but someone's mailer put a newline before the sapply
>   gm <- function(x, group){ # medians by group:
>   sapply(split(x,group),median)
> so it got executed, wasting all that time.  (Of course the
> same mailer will mess up this message in the same way -
> just delete the sapply() call from gm().)
>
> Bill Dunlap
> TIBCO Software Inc - Spotfire Division
> wdunlap tibco.com
>
>> -----Original Message-----
>> From: hadley wickham [mailto:h.wickham at gmail.com]
>> Sent: Monday, January 05, 2009 9:10 AM
>> To: William Dunlap
>> Cc: gallon.li at gmail.com; R help
>> Subject: Re: [R] the first and last observation for each subject
>>
>> > Another application of that technique can be used to quickly compute
>> > medians by groups:
>> >
>> > gm <- function(x, group){ # medians by group:
>> > sapply(split(x,group),median)
>> >   o<-order(group, x)
>> >   group <- group[o]
>> >   x <- x[o]
>> >   changes <- group[-1] != group[-length(group)]
>> >   first <- which(c(TRUE, changes))
>> >   last <- which(c(changes, TRUE))
>> >   lowerMedian <- x[floor((first+last)/2)]
>> >   upperMedian <- x[ceiling((first+last)/2)]
>> >   median <- (lowerMedian+upperMedian)/2
>> >   names(median) <- group[first]
>> >   median
>> > }
>> >
>> > For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups
>> > (in random order) the times are:
>> >
>> >> group<-sample(1:30000, size=100000, replace=TRUE)
>> >> x<-rnorm(length(group))*10 + group
>> >> unix.time(z0<-sapply(split(x,group), median))
>> >   user  system elapsed
>> >   2.72    0.00    3.20
>> >> unix.time(z1<-gm(x,group))
>> >   user  system elapsed
>> >   0.12    0.00    0.16
>> >> identical(z1,z0)
>> > [1] TRUE
>>
>> I get:
>>
>> > unix.time(z0<-sapply(split(x,group), median))
>>    user  system elapsed
>>   2.733   0.017   2.766
>> > unix.time(z1<-gm(x,group))
>>    user  system elapsed
>>   2.897   0.032   2.946
>>
>>
>> Hadley
>>
>>
>> --
>> http://had.co.nz/
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>