[R] the first and last observation for each subject
William Dunlap
wdunlap at tibco.com
Tue Jan 6 18:06:32 CET 2009
Just in case anyone is still interested, here are some
comparisons of the time it says to compute grouped medians
via sapply(split(x,group),median) and gm(x,group), which
uses the trick used by rle() to find the first and last
entries in each group.
Which method is fastest depends on the nature of your data:
the number of groups, k, the length of the data vector, n,
and probably the distribution of group size (which I didn't
look into).
If the overhead involved in calling a function were to go down
we would expect sapply() to look better in the cases where k/n
is close to 1.
Bill
From: William Dunlap
Sent: Monday, January 05, 2009 4:01 PM
To: 'Kingsford Jones'
Subject: RE: [R] the first and last observation for each subject
> -----Original Message-----
> From: Kingsford Jones [mailto:kingsfordjones at gmail.com]
> Sent: Monday, January 05, 2009 3:19 PM
> To: William Dunlap
> Subject: Re: [R] the first and last observation for each subject
>
> Thanks for the explanation -- it's much more satisfying to have some
> theory rather than 'just trying things out'...
>
> Kingsford
Trying things out helps sort out the constants in those 'order of'
assertions. Here are times for various n and k (both ranging over
4^(3:10), with k<=n) for gm and sapply:
> time.gm
k
n 64 256 1024 4096 16384 65536 262144 1048576
64 0.00 NA NA NA NA NA NA NA
256 0.00 0.00 NA NA NA NA NA NA
1024 0.00 0.00 0.00 NA NA NA NA NA
4096 0.00 0.00 0.00 0.00 NA NA NA NA
16384 0.00 0.01 0.02 0.02 0.04 NA NA NA
65536 0.08 0.06 0.07 0.06 0.05 0.09 NA NA
262144 0.36 0.36 0.33 0.34 0.28 0.26 0.42 NA
1048576 3.46 2.83 1.88 1.91 1.96 1.39 1.17 2.2
> time.sapply
k
n 64 256 1024 4096 16384 65536 262144 1048576
64 0.01 NA NA NA NA NA NA NA
256 0.01 0.03 NA NA NA NA NA NA
1024 0.02 0.03 0.09 NA NA NA NA NA
4096 0.00 0.03 0.14 0.41 NA NA NA NA
16384 0.02 0.05 0.14 0.55 1.61 NA NA NA
65536 0.02 0.03 0.14 0.55 2.22 6.54 NA NA
262144 0.03 0.05 0.17 0.60 2.28 8.93 27.44 NA
1048576 0.27 0.16 0.27 0.71 2.41 9.22 37.56 121.55
> time.gm<time.sapply
k
n 64 256 1024 4096 16384 65536 262144 1048576
64 TRUE NA NA NA NA NA NA NA
256 TRUE TRUE NA NA NA NA NA NA
1024 TRUE TRUE TRUE NA NA NA NA NA
4096 FALSE TRUE TRUE TRUE NA NA NA NA
16384 TRUE TRUE TRUE TRUE TRUE NA NA NA
65536 FALSE FALSE TRUE TRUE TRUE TRUE NA NA
262144 FALSE FALSE FALSE TRUE TRUE TRUE TRUE NA
1048576 FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
> round(time.gm/time.sapply,3)
k
n 64 256 1024 4096 16384 65536 262144 1048576
64 0.000 NA NA NA NA NA NA NA
256 0.000 0.000 NA NA NA NA NA NA
1024 0.000 0.000 0.000 NA NA NA NA NA
4096 NaN 0.000 0.000 0.000 NA NA NA NA
16384 0.000 0.200 0.143 0.036 0.025 NA NA NA
65536 4.000 2.000 0.500 0.109 0.023 0.014 NA NA
262144 12.000 7.200 1.941 0.567 0.123 0.029 0.015 NA
1048576 12.815 17.688 6.963 2.690 0.813 0.151 0.031 0.018
>
> On Mon, Jan 5, 2009 at 4:00 PM, William Dunlap
> <wdunlap at tibco.com> wrote:
> > Note that gm fully sorts the whole vector, taking order n*log(n)
> > time, then does an order n subscripting operation, then
> > some order k stuff.
> >
> > The sapply approach does k calls to median, each of which
> > takes order n/k time to partially sort its short input (assuming
> > all groups are about the same size), resulting in order n total
> > sorting time (I think the partial sort is order n, if not it is
> > pretty close to that for n up to 2^25).
> > It also involves some other order n and order k time operations.
> >
> > Thus for a fixed k, as n grows the sapply approach ought to
> > do better, but for a fixed n, as k grows gm ought to do better.
> > gm() only does better in some cases (when k/n is close to 1)
> > because it avoids the function call overhead involved in calling
> > median() many times.
> >
> > Bill Dunlap
> > TIBCO Software Inc - Spotfire Division
> > wdunlap tibco.com
> >
> >> -----Original Message-----
> >> From: Kingsford Jones [mailto:kingsfordjones at gmail.com]
> >> Sent: Monday, January 05, 2009 2:20 PM
> >> To: William Dunlap
> >> Cc: hadley wickham; R help
> >> Subject: Re: [R] the first and last observation for each subject
> >>
> >> whoops -- I left the group size unchanged so k became greather than
> >> the length of the group vector. When I increase the size to 1e7,
> >> sapply is faster until it gets to k = 1e6.
> >>
> >> warning: this takes awhile (particularly on my machine
> which seems to
> >> be using just 1 of it's 2 cpus)
> >>
> >> > for(k in 10^(1:6)){
> >> + group<-sample(1:k, size=1e7, replace=TRUE)
> >> + x<-rnorm(length(group))*10 + group
> >> + cat('number of groups:', k, '\n')
> >> + cat('sapply\n')
> >> + print(s <- unix.time(sapply(split(x,group), median)))
> >> + cat('gm\n')
> >> + print(g <- unix.time(-gm(x,group)))
> >> + cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
> >> + }
> >> number of groups: 10
> >> sapply
> >> user system elapsed
> >> 1.18 0.38 1.56
> >> gm
> >> user system elapsed
> >> 53.43 0.70 55.05
> >> sapply/gm user ratio: 0.02208497
> >>
> >>
> >> number of groups: 100
> >> sapply
> >> user system elapsed
> >> 1.17 0.23 1.42
> >> gm
> >> user system elapsed
> >> 49.89 0.61 51.60
> >> sapply/gm user ratio: 0.02345159
> >>
> >>
> >> number of groups: 1000
> >> sapply
> >> user system elapsed
> >> 1.29 0.25 1.72
> >> gm
> >> user system elapsed
> >> 45.23 0.34 46.55
> >> sapply/gm user ratio: 0.02852089
> >>
> >>
> >> number of groups: 10000
> >> sapply
> >> user system elapsed
> >> 2.80 0.09 2.87
> >> gm
> >> user system elapsed
> >> 41.04 0.55 42.85
> >> sapply/gm user ratio: 0.06822612
> >>
> >>
> >> number of groups: 1e+05
> >> sapply
> >> user system elapsed
> >> 14.65 0.16 15.18
> >> gm
> >> user system elapsed
> >> 38.28 0.65 39.55
> >> sapply/gm user ratio: 0.3827064
> >>
> >>
> >> number of groups: 1e+06
> >> sapply
> >> user system elapsed
> >> 126.63 0.42 129.21
> >> gm
> >> user system elapsed
> >> 37.56 0.84 38.78
> >> sapply/gm user ratio: 3.371406
> >>
> >> On Mon, Jan 5, 2009 at 2:37 PM, Kingsford Jones
> >> <kingsfordjones at gmail.com> wrote:
> >> > Here's some more timing's of Bill's function. Although in this
> >> > example sapply has a clear performance advantage for
> smaller numbers
> >> > of groups (k) , gm is substantially faster for k >> 1000:
> >> >
> >> > gm <- function(x, group){ # medians by group:
> >> > o<-order(group, x)
> >> > group <- group[o]
> >> > x <- x[o]
> >> > changes <- group[-1] != group[-length(group)]
> >> > first <- which(c(TRUE, changes))
> >> > last <- which(c(changes, TRUE))
> >> > lowerMedian <- x[floor((first+last)/2)]
> >> > upperMedian <- x[ceiling((first+last)/2)]
> >> > median <- (lowerMedian+upperMedian)/2
> >> > names(median) <- group[first]
> >> > median
> >> > }
> >> >
> >> > ##
> >> >
> >> > for(k in 10^(1:6)){
> >> > group<-sample(1:k, size=100000, replace=TRUE)
> >> > x<-rnorm(length(group))*10 + group
> >> > cat('number of groups:', k, '\n')
> >> > cat('sapply\n')
> >> > print(s <- unix.time(sapply(split(x,group), median)))
> >> > cat('gm\n')
> >> > print(g <- unix.time(-gm(x,group)))
> >> > cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
> >> > }
> >> >
> >> > number of groups: 10
> >> > sapply
> >> > user system elapsed
> >> > 0.01 0.00 0.01
> >> > gm
> >> > user system elapsed
> >> > 0.14 0.00 0.16
> >> > sapply/gm user ratio: 0.07142857
> >> >
> >> >
> >> > number of groups: 100
> >> > sapply
> >> > user system elapsed
> >> > 0.02 0.00 0.02
> >> > gm
> >> > user system elapsed
> >> > 0.11 0.00 0.09
> >> > sapply/gm user ratio: 0.1818182
> >> >
> >> >
> >> > number of groups: 1000
> >> > sapply
> >> > user system elapsed
> >> > 0.11 0.00 0.11
> >> > gm
> >> > user system elapsed
> >> > 0.11 0.00 0.11
> >> > sapply/gm user ratio: 1
> >> >
> >> >
> >> > number of groups: 10000
> >> > sapply
> >> > user system elapsed
> >> > 1.00 0.00 1.04
> >> > gm
> >> > user system elapsed
> >> > 0.10 0.00 0.09
> >> > sapply/gm user ratio: 10
> >> >
> >> >
> >> > number of groups: 1e+05
> >> > sapply
> >> > user system elapsed
> >> > 6.24 0.01 6.31
> >> > gm
> >> > user system elapsed
> >> > 0.16 0.00 0.17
> >> > sapply/gm user ratio: 39
> >> >
> >> >
> >> > number of groups: 1e+06
> >> > sapply
> >> > user system elapsed
> >> > 9.00 0.03 8.92
> >> > gm
> >> > user system elapsed
> >> > 0.15 0.00 0.20
> >> > sapply/gm user ratio: 60
> >> >
> >> >
> >> >> R.Version()
> >> > $platform
> >> > [1] "i386-pc-mingw32"
> >> >
> >> > $arch
> >> > [1] "i386"
> >> >
> >> > $os
> >> > [1] "mingw32"
> >> >
> >> > $system
> >> > [1] "i386, mingw32"
> >> >
> >> > $status
> >> > [1] ""
> >> >
> >> > $major
> >> > [1] "2"
> >> >
> >> > $minor
> >> > [1] "8.0"
> >> >
> >> > $year
> >> > [1] "2008"
> >> >
> >> > $month
> >> > [1] "10"
> >> >
> >> > $day
> >> > [1] "20"
> >> >
> >> > $`svn rev`
> >> > [1] "46754"
> >> >
> >> > $language
> >> > [1] "R"
> >> >
> >> > $version.string
> >> > [1] "R version 2.8.0 (2008-10-20)"
> >> >
> >> >
> >> > Kingsford Jones
> >> >
> >> >
> >> >
> >> > On Mon, Jan 5, 2009 at 10:18 AM, William Dunlap
> >> <wdunlap at tibco.com> wrote:
> >> >> Arg, the 'sapply(...)' in the function was in the initial
> >> >> comment,
> >> >> gm <- function(x, group){ # medians by group:
> >> >> sapply(split(x,group),median)
> >> >> but someone's mailer put a newline before the sapply
> >> >> gm <- function(x, group){ # medians by group:
> >> >> sapply(split(x,group),median)
> >> >> so it got executed, wasting all that time. (Of course the
> >> >> same mailer will mess up this message in the same way -
> >> >> just delete the sapply() call from gm().)
> >> >>
> >> >> Bill Dunlap
> >> >> TIBCO Software Inc - Spotfire Division
> >> >> wdunlap tibco.com
> >> >>
> >> >>> -----Original Message-----
> >> >>> From: hadley wickham [mailto:h.wickham at gmail.com]
> >> >>> Sent: Monday, January 05, 2009 9:10 AM
> >> >>> To: William Dunlap
> >> >>> Cc: gallon.li at gmail.com; R help
> >> >>> Subject: Re: [R] the first and last observation for
> each subject
> >> >>>
> >> >>> > Another application of that technique can be used to
> >> quickly compute
> >> >>> > medians by groups:
> >> >>> >
> >> >>> > gm <- function(x, group){ # medians by group:
> >> >>> > sapply(split(x,group),median)
> >> >>> > o<-order(group, x)
> >> >>> > group <- group[o]
> >> >>> > x <- x[o]
> >> >>> > changes <- group[-1] != group[-length(group)]
> >> >>> > first <- which(c(TRUE, changes))
> >> >>> > last <- which(c(changes, TRUE))
> >> >>> > lowerMedian <- x[floor((first+last)/2)]
> >> >>> > upperMedian <- x[ceiling((first+last)/2)]
> >> >>> > median <- (lowerMedian+upperMedian)/2
> >> >>> > names(median) <- group[first]
> >> >>> > median
> >> >>> > }
> >> >>> >
> >> >>> > For a 10^5 long x and a somewhat fewer than 3*10^4
> >> distinct groups
> >> >>> > (in random order) the times are:
> >> >>> >
> >> >>> >> group<-sample(1:30000, size=100000, replace=TRUE)
> >> >>> >> x<-rnorm(length(group))*10 + group
> >> >>> >> unix.time(z0<-sapply(split(x,group), median))
> >> >>> > user system elapsed
> >> >>> > 2.72 0.00 3.20
> >> >>> >> unix.time(z1<-gm(x,group))
> >> >>> > user system elapsed
> >> >>> > 0.12 0.00 0.16
> >> >>> >> identical(z1,z0)
> >> >>> > [1] TRUE
> >> >>>
> >> >>> I get:
> >> >>>
> >> >>> > unix.time(z0<-sapply(split(x,group), median))
> >> >>> user system elapsed
> >> >>> 2.733 0.017 2.766
> >> >>> > unix.time(z1<-gm(x,group))
> >> >>> user system elapsed
> >> >>> 2.897 0.032 2.946
> >> >>>
> >> >>>
> >> >>> Hadley
> >> >>>
> >> >>>
> >> >>> --
> >> >>> http://had.co.nz/
> >> >>>
> >> >>
> >> >> ______________________________________________
> >> >> R-help at r-project.org mailing list
> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> >> and provide commented, minimal, self-contained,
> reproducible code.
> >> >>
> >> >
> >>
> >
>
More information about the R-help
mailing list