[R] finding peaks in a simple dataset with R

Gabor Grothendieck ggrothendieck at gmail.com
Wed Nov 23 18:29:03 CET 2005


On 11/23/05, Martin Maechler <maechler at stat.math.ethz.ch> wrote:
> >>>>> "Marc" == Marc Kirchner <marc.kirchner at iwr.uni-heidelberg.de>
> >>>>>     on Wed, 23 Nov 2005 14:33:28 +0000 writes:
>
>    >>
>    >> I wonder if we shouldn't polish that a bit and add to R's
>    >> standard 'utils' package.
>    >>
>
>    Marc> Hm, I figured out there are (at least) two versions out there, one being
>    Marc> the "original" idea and a modification.
>
>    Marc> === Petr Pikal in 2001 (based on Brian Ripley's idea)==
>    Marc> peaks <- function(series, span=3) {
>    Marc> z <- embed(series, span)
>    Marc> result <- max.col(z) == 1 + span %/% 2
>    Marc> result
>    Marc> }
>
>    Marc> versus
>
>    Marc> === Petr Pikal in 2004 ==
>    Marc> peaks2<-function(series,span=3) {
>    Marc> z <- embed(series, span)
>    Marc> s <- span%/%2
>    Marc> v<- max.col(z) == 1 + s
>    Marc> result <- c(rep(FALSE,s),v)
>    Marc> result <- result[1:(length(result)-s)]
>    Marc> result
>    Marc> }
>
> Thank you, Marc,
>
>    Marc> Comparison shows
>    >> peaks(c(1,4,1,1,6,1,5,1,1),3)
>    Marc> [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
>    Marc> which is a logical vector for elements 2:N-1 and
>
>    >> peaks2(c(1,4,1,1,6,1,5,1,1),3)
>    Marc> [1] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
>    Marc> which is a logical vector for elements 1:N-2.
>
>    Marc> As I would expect to "lose" (span-1)/2 elements on each side
>    Marc> of the vector, to me the 2001 version feels more natural.
>
> I think for the function to be more useful it the result should
> have the original vector length and hence I'd propose to also
> pad with FALSE at the upper end.
>
>    Marc> Also, both "suffer" from being non-deterministic in the
>    Marc> multiple-maxima-case (the two 4s here)
>
> yes, of course, because of max.col().
> Note that Venables & Ripley would consider this to be rather a feature.
>
>    Marc> This could (should?) be fixed by modifying the call to max.col()
>    Marc> result <- max.col(z, "first") == 1 + span %/% 2;
>
> Actually I think it should become an option, but I'd use "first"
> as default.
>
>    Marc> Just my two cents,
>
> Thank you.
>
> Here is my current proposal which also demonstrates why it's
> useful to pad with FALSE :
>
> peaks <-function(series, span = 3, ties.method = "first") {
>    if((span <- as.integer(span)) %% 2 != 1) stop("'span' must be odd")
>    z <- embed(series, span)
>    s <- span %/% 2
>    v <- max.col(z, ties.method = ties.method) == s + 1:1
>    pad <- rep(FALSE, s)
>    c(pad, v, pad)
> }
>
> y <- c(1,4,1,1,6,1,5,1,1) ; (ii <- which(peaks(y))); y[ii]
> ##- [1] 2 5 7
> ##- [1] 4 6 5
>
> set.seed(7)
> y <- rpois(100, lambda = 7)
> py <- peaks(y)
> plot(y, type="o", cex = 1/4, main = "y and peaks(y,3)")
> points(seq(y)[py], y[py], col = 2, cex = 1.5)
>
> p7 <- peaks(y,7)
> points(seq(y)[p7], y[p7], col = 3, cex = 2)
> mtext("peaks(y,7)", col=3)

I think ties are still problems with this approach
as:

set.seed(1)
peaks( c(1,2,2,2,3), 3 )

gives a peak in the 2,2,2 stretch.

Also NA would seem to be a better pad than false or maybe
it should be specifiable including whether there is padding at
all.   The zoo rapply and rollmax functions which can also be used
to specify a similar naive peak function have na.pad=
and align= arguments.  Also any peaks function should
be generic so that various time series classes can implement
their own methods and in the case of irregularly spaced series
note that there are two possibilities for span, the time distance
and the number of points, and they are not the same.

It might also be nice to be able to get back peaks, troughs or
both via  1,0,-1 in the output.




More information about the R-help mailing list