[R] Classifying time series by shape over time
Philippe Grosjean
phgrosjean at sciviews.org
Wed Mar 22 11:55:14 CET 2006
Hi,
turnpoints() in library(pastecs) determines if the succession of peaks
and pits is random, or not. I think that the hypothesis here is little
bit stronger: it should fit a Gaussian.
I just think a little bit to this problem, and I don't get a simple
solution. Here is what I got, but this is subject certainly to many
criticisms (feel free to do so!). The idea is to draw the cumulative
distribution of the hits and fit it with a logistic curve. Then,
predicted hits are back calculated (knowind that the logistic curve is
symmetrical around 'xmid'), and the observed and predicted distributions
of the hits are compared using a Kolmogorv-Smirnov goodness-of-fit test:
# Enter example data
id1 <- data.frame(
dates = as.Date(c("2004-12-01", "2005-01-01", "2005-02-01",
"2005-03-01", "2005-04-01", "2005-05-01", "2005-06-01",
"2005-07-01", "2005-08-01", "2005-09-01", "2005-10-01",
"2005-11-01", "2005-12-01")),
hits = c(3, 4, 10, 6, 35, 14, 33, 13, 3, 9, 8, 4, 3))
id2 <- data.frame(
dates = as.Date(c("2001-01-01", "2001-02-01", "2001-03-01",
"2001-04-01", "2001-05-01", "2001-06-01", "2001-07-01",
"2001-08-01", "2001-09-01", "2001-10-01", "2001-11-01",
"2001-12-01", "2002-01-01", "2002-02-01", "2002-03-01",
"2002-04-01", "2002-05-01", "2002-06-01", "2002-07-01",
"2002-08-01", "2002-09-01", "2002-10-01", "2002-11-01",
"2002-12-01", "2003-01-01", "2003-02-01", "2003-03-01")),
hits = c(6, 5, 5, 6, 2, 5, 1, 6, 4, 10, 0, 3, 6,
5, 1, 2, 4, 4, 0, 1, 0, 2, 2, 2, 2, 3, 7))
# How does it look like?
plot(id1$dates, id1$hits, type = "l")
plot(id2$dates, id2$hits, type = "l")
# Cumsum of hits and fit models
id1$datenum <- as.numeric(id1$dates)
id1$cumhits <- cumsum(id1$hits)
id1.fit <- nls(cumhits ~ SSlogis(datenum, Asym, xmid, scal), data = id1)
summary(id1.fit)
plot(id1$dates, id1$cumhits)
lines(id1$dates, predict(id1.fit))
id2$datenum <- as.numeric(id2$dates)
id2$cumhits <- cumsum(id2$hits)
id2.fit <- nls(cumhits ~ SSlogis(datenum, Asym, xmid, scal), data = id2)
summary(id2.fit)
plot(id2$dates, id2$cumhits)
lines(id2$dates, predict(id2.fit))
# Get xmid and recalculate predicted values for hits
xmid1 <- coef(id1.fit)["xmid"]
id1$hitspred <- predict(id1.fit,
newdata = data.frame(datenum = xmid1 - abs(id1$datenum - xmid1)))
plot(id1$dates, id1$hits, ylim = range(c(id1$hits, id1$hitspred)))
lines(id1$dates, id1$hitspred)
xmid2 <- coef(id2.fit)["xmid"]
id2$hitspred <- predict(id2.fit,
newdata = data.frame(datenum = xmid2 - abs(id2$datenum - xmid2)))
plot(id2$dates, id2$hits, ylim = range(c(id2$hits, id2$hitspred)))
lines(id2$dates, id2$hitspred)
# A two samples Kolmogorov-Smirnov test of goodness-of-fit
ks.test(id1$hits, id1$hitspred) # H0 not rejected
ks.test(id2$hits, id2$hitspred) # H0 rejected
Best,
Philippe Grosjean
Kjetil Brinchmann Halvorsen wrote:
> Andreas Neumann wrote:
>
>>Dear all,
>>
>>I have hundreds of thousands of univariate time series of the form:
>>character "seriesid", vector of Date, vector of integer
>>(some exemplary data is at the end of the mail)
>>
>>I am trying to find the ones which somehow "have a shape" over time that
>>looks like the histogramm of a (skewed) normal distribution:
>>
>>> hist(rnorm(200,10,2))
>>
>>The "mean" is not interesting, i.e. it does not matter if the first
>>nonzero observation happens in the 2. or the 40. month of observation.
>>So all that matters is: They should start sometime, the hits per month
>>increase, at some point they decrease and then they more or less
>>disappear.
>>
>>Short Example (hits at consecutive months (Dates omitted)):
>>1. series: 0 0 0 2 5 8 20 42 30 19 6 1 0 0 0 -> Good
>>2. series: 0 3 8 9 20 6 0 3 25 67 7 1 0 4 60 20 10 0 4 -> Bad
>>
>>Series 1 would be an ideal case of what I am looking for.
>>
>>Graphical inspection would be easy but is not an option due to the huge
>>amount of series.
>>
>
>
> Does function turnpoints)= in package pastecs help_
>
> Kjetil
>
>
>>Questions:
>>
>>1. Which (if at all) of the many packages that handle time series is
>>appropriate for my problem?
>>
>>2. Which general approach seems to be the most straightforward and best
>>supported by R?
>>- Is there a way to test the time series directly (preferably)?
>>- Or do I need to "type-cast" them as some kind of histogram
>> data and then test against the pdf of e.g. a normal distribution (but
>> how)?
>>- Or something totally different?
>>
>>
>>Thank you for your time,
>>
>> Andreas Neumann
>>
>>
>>
>>
>>Data Examples (id1 is good, id2 is bad):
>>
>>
>>>id1
>>
>> dates hits
>>1 2004-12-01 3
>>2 2005-01-01 4
>>3 2005-02-01 10
>>4 2005-03-01 6
>>5 2005-04-01 35
>>6 2005-05-01 14
>>7 2005-06-01 33
>>8 2005-07-01 13
>>9 2005-08-01 3
>>10 2005-09-01 9
>>11 2005-10-01 8
>>12 2005-11-01 4
>>13 2005-12-01 3
>>
>>
>>
>>>id2
>>
>> dates hits
>>1 2001-01-01 6
>>2 2001-02-01 5
>>3 2001-03-01 5
>>4 2001-04-01 6
>>5 2001-05-01 2
>>6 2001-06-01 5
>>7 2001-07-01 1
>>8 2001-08-01 6
>>9 2001-09-01 4
>>10 2001-10-01 10
>>11 2001-11-01 0
>>12 2001-12-01 3
>>13 2002-01-01 6
>>14 2002-02-01 5
>>15 2002-03-01 1
>>16 2002-04-01 2
>>17 2002-05-01 4
>>18 2002-06-01 4
>>19 2002-07-01 0
>>20 2002-08-01 1
>>21 2002-09-01 0
>>22 2002-10-01 2
>>23 2002-11-01 2
>>24 2002-12-01 2
>>25 2003-01-01 2
>>26 2003-02-01 3
>>27 2003-03-01 7
>>
>>______________________________________________
>>R-help at stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>>
>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
>
More information about the R-help
mailing list