[R] the first and last observation for each subject
Kingsford Jones
kingsfordjones at gmail.com
Mon Jan 5 23:20:03 CET 2009
whoops -- I left the group size unchanged so k became greather than
the length of the group vector. When I increase the size to 1e7,
sapply is faster until it gets to k = 1e6.
warning: this takes awhile (particularly on my machine which seems to
be using just 1 of it's 2 cpus)
> for(k in 10^(1:6)){
+ group<-sample(1:k, size=1e7, replace=TRUE)
+ x<-rnorm(length(group))*10 + group
+ cat('number of groups:', k, '\n')
+ cat('sapply\n')
+ print(s <- unix.time(sapply(split(x,group), median)))
+ cat('gm\n')
+ print(g <- unix.time(-gm(x,group)))
+ cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
+ }
number of groups: 10
sapply
user system elapsed
1.18 0.38 1.56
gm
user system elapsed
53.43 0.70 55.05
sapply/gm user ratio: 0.02208497
number of groups: 100
sapply
user system elapsed
1.17 0.23 1.42
gm
user system elapsed
49.89 0.61 51.60
sapply/gm user ratio: 0.02345159
number of groups: 1000
sapply
user system elapsed
1.29 0.25 1.72
gm
user system elapsed
45.23 0.34 46.55
sapply/gm user ratio: 0.02852089
number of groups: 10000
sapply
user system elapsed
2.80 0.09 2.87
gm
user system elapsed
41.04 0.55 42.85
sapply/gm user ratio: 0.06822612
number of groups: 1e+05
sapply
user system elapsed
14.65 0.16 15.18
gm
user system elapsed
38.28 0.65 39.55
sapply/gm user ratio: 0.3827064
number of groups: 1e+06
sapply
user system elapsed
126.63 0.42 129.21
gm
user system elapsed
37.56 0.84 38.78
sapply/gm user ratio: 3.371406
On Mon, Jan 5, 2009 at 2:37 PM, Kingsford Jones
<kingsfordjones at gmail.com> wrote:
> Here's some more timing's of Bill's function. Although in this
> example sapply has a clear performance advantage for smaller numbers
> of groups (k) , gm is substantially faster for k >> 1000:
>
> gm <- function(x, group){ # medians by group:
> o<-order(group, x)
> group <- group[o]
> x <- x[o]
> changes <- group[-1] != group[-length(group)]
> first <- which(c(TRUE, changes))
> last <- which(c(changes, TRUE))
> lowerMedian <- x[floor((first+last)/2)]
> upperMedian <- x[ceiling((first+last)/2)]
> median <- (lowerMedian+upperMedian)/2
> names(median) <- group[first]
> median
> }
>
> ##
>
> for(k in 10^(1:6)){
> group<-sample(1:k, size=100000, replace=TRUE)
> x<-rnorm(length(group))*10 + group
> cat('number of groups:', k, '\n')
> cat('sapply\n')
> print(s <- unix.time(sapply(split(x,group), median)))
> cat('gm\n')
> print(g <- unix.time(-gm(x,group)))
> cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
> }
>
> number of groups: 10
> sapply
> user system elapsed
> 0.01 0.00 0.01
> gm
> user system elapsed
> 0.14 0.00 0.16
> sapply/gm user ratio: 0.07142857
>
>
> number of groups: 100
> sapply
> user system elapsed
> 0.02 0.00 0.02
> gm
> user system elapsed
> 0.11 0.00 0.09
> sapply/gm user ratio: 0.1818182
>
>
> number of groups: 1000
> sapply
> user system elapsed
> 0.11 0.00 0.11
> gm
> user system elapsed
> 0.11 0.00 0.11
> sapply/gm user ratio: 1
>
>
> number of groups: 10000
> sapply
> user system elapsed
> 1.00 0.00 1.04
> gm
> user system elapsed
> 0.10 0.00 0.09
> sapply/gm user ratio: 10
>
>
> number of groups: 1e+05
> sapply
> user system elapsed
> 6.24 0.01 6.31
> gm
> user system elapsed
> 0.16 0.00 0.17
> sapply/gm user ratio: 39
>
>
> number of groups: 1e+06
> sapply
> user system elapsed
> 9.00 0.03 8.92
> gm
> user system elapsed
> 0.15 0.00 0.20
> sapply/gm user ratio: 60
>
>
>> R.Version()
> $platform
> [1] "i386-pc-mingw32"
>
> $arch
> [1] "i386"
>
> $os
> [1] "mingw32"
>
> $system
> [1] "i386, mingw32"
>
> $status
> [1] ""
>
> $major
> [1] "2"
>
> $minor
> [1] "8.0"
>
> $year
> [1] "2008"
>
> $month
> [1] "10"
>
> $day
> [1] "20"
>
> $`svn rev`
> [1] "46754"
>
> $language
> [1] "R"
>
> $version.string
> [1] "R version 2.8.0 (2008-10-20)"
>
>
> Kingsford Jones
>
>
>
> On Mon, Jan 5, 2009 at 10:18 AM, William Dunlap <wdunlap at tibco.com> wrote:
>> Arg, the 'sapply(...)' in the function was in the initial
>> comment,
>> gm <- function(x, group){ # medians by group:
>> sapply(split(x,group),median)
>> but someone's mailer put a newline before the sapply
>> gm <- function(x, group){ # medians by group:
>> sapply(split(x,group),median)
>> so it got executed, wasting all that time. (Of course the
>> same mailer will mess up this message in the same way -
>> just delete the sapply() call from gm().)
>>
>> Bill Dunlap
>> TIBCO Software Inc - Spotfire Division
>> wdunlap tibco.com
>>
>>> -----Original Message-----
>>> From: hadley wickham [mailto:h.wickham at gmail.com]
>>> Sent: Monday, January 05, 2009 9:10 AM
>>> To: William Dunlap
>>> Cc: gallon.li at gmail.com; R help
>>> Subject: Re: [R] the first and last observation for each subject
>>>
>>> > Another application of that technique can be used to quickly compute
>>> > medians by groups:
>>> >
>>> > gm <- function(x, group){ # medians by group:
>>> > sapply(split(x,group),median)
>>> > o<-order(group, x)
>>> > group <- group[o]
>>> > x <- x[o]
>>> > changes <- group[-1] != group[-length(group)]
>>> > first <- which(c(TRUE, changes))
>>> > last <- which(c(changes, TRUE))
>>> > lowerMedian <- x[floor((first+last)/2)]
>>> > upperMedian <- x[ceiling((first+last)/2)]
>>> > median <- (lowerMedian+upperMedian)/2
>>> > names(median) <- group[first]
>>> > median
>>> > }
>>> >
>>> > For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups
>>> > (in random order) the times are:
>>> >
>>> >> group<-sample(1:30000, size=100000, replace=TRUE)
>>> >> x<-rnorm(length(group))*10 + group
>>> >> unix.time(z0<-sapply(split(x,group), median))
>>> > user system elapsed
>>> > 2.72 0.00 3.20
>>> >> unix.time(z1<-gm(x,group))
>>> > user system elapsed
>>> > 0.12 0.00 0.16
>>> >> identical(z1,z0)
>>> > [1] TRUE
>>>
>>> I get:
>>>
>>> > unix.time(z0<-sapply(split(x,group), median))
>>> user system elapsed
>>> 2.733 0.017 2.766
>>> > unix.time(z1<-gm(x,group))
>>> user system elapsed
>>> 2.897 0.032 2.946
>>>
>>>
>>> Hadley
>>>
>>>
>>> --
>>> http://had.co.nz/
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
More information about the R-help
mailing list