# [R] Classifying time series by shape over time

Philippe Grosjean phgrosjean at sciviews.org
Wed Mar 22 11:55:14 CET 2006

```Hi,

turnpoints() in library(pastecs) determines if the succession of peaks
and pits is random, or not. I think that the hypothesis here is little
bit stronger: it should fit a Gaussian.

I just think a little bit to this problem, and I don't get a simple
solution. Here is what I got, but this is subject certainly to many
criticisms (feel free to do so!). The idea is to draw the cumulative
distribution of the hits and fit it with a logistic curve. Then,
predicted hits are back calculated (knowind that the logistic curve is
symmetrical around 'xmid'), and the observed and predicted distributions
of the hits are compared using a Kolmogorv-Smirnov goodness-of-fit test:

# Enter example data
id1 <- data.frame(
dates = as.Date(c("2004-12-01", "2005-01-01", "2005-02-01",
"2005-03-01", "2005-04-01", "2005-05-01", "2005-06-01",
"2005-07-01", "2005-08-01", "2005-09-01", "2005-10-01",
"2005-11-01", "2005-12-01")),
hits  = c(3, 4, 10, 6, 35, 14, 33, 13, 3, 9, 8, 4, 3))
id2 <- data.frame(
dates =  as.Date(c("2001-01-01", "2001-02-01", "2001-03-01",
"2001-04-01", "2001-05-01", "2001-06-01", "2001-07-01",
"2001-08-01", "2001-09-01", "2001-10-01", "2001-11-01",
"2001-12-01", "2002-01-01", "2002-02-01", "2002-03-01",
"2002-04-01", "2002-05-01", "2002-06-01", "2002-07-01",
"2002-08-01", "2002-09-01", "2002-10-01", "2002-11-01",
"2002-12-01", "2003-01-01", "2003-02-01", "2003-03-01")),
hits  = c(6, 5, 5, 6, 2, 5, 1, 6, 4, 10, 0, 3, 6,
5, 1, 2, 4, 4, 0, 1, 0, 2, 2, 2, 2, 3, 7))

# How does it look like?
plot(id1\$dates, id1\$hits, type = "l")
plot(id2\$dates, id2\$hits, type = "l")

# Cumsum of hits and fit models
id1\$datenum <- as.numeric(id1\$dates)
id1\$cumhits <- cumsum(id1\$hits)
id1.fit <- nls(cumhits ~ SSlogis(datenum, Asym, xmid, scal), data = id1)
summary(id1.fit)
plot(id1\$dates, id1\$cumhits)
lines(id1\$dates, predict(id1.fit))

id2\$datenum <- as.numeric(id2\$dates)
id2\$cumhits <- cumsum(id2\$hits)
id2.fit <- nls(cumhits ~ SSlogis(datenum, Asym, xmid, scal), data = id2)
summary(id2.fit)
plot(id2\$dates, id2\$cumhits)
lines(id2\$dates, predict(id2.fit))

# Get xmid and recalculate predicted values for hits
xmid1 <- coef(id1.fit)["xmid"]
id1\$hitspred <- predict(id1.fit,
newdata = data.frame(datenum = xmid1 - abs(id1\$datenum - xmid1)))
plot(id1\$dates, id1\$hits, ylim = range(c(id1\$hits, id1\$hitspred)))
lines(id1\$dates, id1\$hitspred)

xmid2 <- coef(id2.fit)["xmid"]
id2\$hitspred <- predict(id2.fit,
newdata = data.frame(datenum = xmid2 - abs(id2\$datenum - xmid2)))
plot(id2\$dates, id2\$hits, ylim = range(c(id2\$hits, id2\$hitspred)))
lines(id2\$dates, id2\$hitspred)

# A two samples Kolmogorov-Smirnov test of goodness-of-fit
ks.test(id1\$hits, id1\$hitspred)  # H0 not rejected
ks.test(id2\$hits, id2\$hitspred)  # H0 rejected

Best,

Philippe Grosjean

Kjetil Brinchmann Halvorsen wrote:
> Andreas Neumann wrote:
>
>>Dear all,
>>
>>I have hundreds of thousands of univariate time series of the form:
>>character "seriesid", vector of Date, vector of integer
>>(some exemplary data is at the end of the mail)
>>
>>I am trying to find the ones which somehow "have a shape" over time that
>>looks like the histogramm of a (skewed) normal distribution:
>>
>>> hist(rnorm(200,10,2))
>>
>>The "mean" is not interesting, i.e. it does not matter if the first
>>nonzero observation happens in the 2. or the 40. month of observation.
>>So all that matters is: They should start sometime, the hits per month
>>increase, at some point they decrease and then they more or less
>>disappear.
>>
>>Short Example (hits at consecutive months (Dates omitted)):
>>1. series: 0 0 0 2 5 8 20 42 30 19 6 1 0 0 0                -> Good
>>2. series: 0 3 8 9 20 6 0 3 25 67 7 1 0 4 60 20 10 0 4      -> Bad
>>
>>Series 1 would be an ideal case of what I am looking for.
>>
>>Graphical inspection would be easy but is not an option due to the huge
>>amount of series.
>>
>
>
> Does function turnpoints)= in package pastecs help_
>
> Kjetil
>
>
>>Questions:
>>
>>1. Which (if at all) of the many packages that handle time series is
>>appropriate for my problem?
>>
>>2. Which general approach seems to be the most straightforward and best
>>supported by R?
>>- Is there a way to test the time series directly (preferably)?
>>- Or do I need to "type-cast" them as some kind of histogram
>>  data and then test against the pdf of e.g. a normal distribution (but
>>  how)?
>>- Or something totally different?
>>
>>
>>
>>     Andreas Neumann
>>
>>
>>
>>
>>Data Examples (id1 is good, id2 is bad):
>>
>>
>>>id1
>>
>>        dates       hits
>>1  2004-12-01         3
>>2  2005-01-01         4
>>3  2005-02-01        10
>>4  2005-03-01         6
>>5  2005-04-01        35
>>6  2005-05-01        14
>>7  2005-06-01        33
>>8  2005-07-01        13
>>9  2005-08-01         3
>>10 2005-09-01         9
>>11 2005-10-01         8
>>12 2005-11-01         4
>>13 2005-12-01         3
>>
>>
>>
>>>id2
>>
>>        dates       hits
>>1  2001-01-01         6
>>2  2001-02-01         5
>>3  2001-03-01         5
>>4  2001-04-01         6
>>5  2001-05-01         2
>>6  2001-06-01         5
>>7  2001-07-01         1
>>8  2001-08-01         6
>>9  2001-09-01         4
>>10 2001-10-01        10
>>11 2001-11-01         0
>>12 2001-12-01         3
>>13 2002-01-01         6
>>14 2002-02-01         5
>>15 2002-03-01         1
>>16 2002-04-01         2
>>17 2002-05-01         4
>>18 2002-06-01         4
>>19 2002-07-01         0
>>20 2002-08-01         1
>>21 2002-09-01         0
>>22 2002-10-01         2
>>23 2002-11-01         2
>>24 2002-12-01         2
>>25 2003-01-01         2
>>26 2003-02-01         3
>>27 2003-03-01         7
>>
>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>
>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help