[R] time series processing - count of datestamp delta's, per group
Stephan Kolassa
Stephan.Kolassa at gmx.de
Sun Mar 23 12:07:44 CET 2014
Hi Martin,
it sounds like you want the difference between the first and the last
observation per user, not, e.g., all the date differences between
successive observations of each separate user. Correct me if I'm wrong.
That said, let's build some toy data:
set.seed(1)
dataset <- data.frame(User=sample(LETTERS[1:5],100,replace=TRUE),
Date=sample(as.Date("2014-01-01")+0:364,100,replace=TRUE))
Now we can calculate these differences and plot a histogram or tabulate:
foo <- with(dataset,by(Date,User,function(xx)diff(range(xx))))
hist(foo)
table(foo)
The key here is really the by() function, which calculates a function
(here an anonymous function "function(xx)diff(range(xx))") applied to
some data (here dataset$Date) separately for each level of a grouping
factor (here dataset$User).
HTH,
Stephan
On 23.03.2014 01:32, Martin Tomko wrote:
> Apologies if the question is a but naïve, I am a novice in time series data handling in R
>
> I have the following type of data, in a long format ( as called by the spacetime vignette – the table contains also space, not noted here):
>
> User | Date | Otherdata |
> A | 01/01/2014 | aa
> A | 01/01/2014 | bb
> A | 01/01/2014 | cc
> B | 01/01/2014 | aa
> B | 05/01/2014 | cc
> A | 07/01/2014 | aa
> C | 05/02/2014 | xx
> C | 20/02/2014 | yy
>
> Etc
> [A,B,C,…] are user Ids (some strings).
> Date is converted into a Date format (2013-10-15)
>
> The table is sorted by User and then by Date, and is over 800K records long. There are about 20K users.
>
> User | Date | Otherdata |
> A | 2014-01-01 | aa
> A | 2014-01-01 | bb
> A | 2014-01-01 | cc
> A | 2014-01-07 | aa
> B | 2014-01-01 | aa
> B | 2014-01-05 | cc
> C | 2014-02-05 | xx
> C | 2014-02-20 | yy
>
> I want to:
> Get a frequency table ( and ultimately plot) of the count of differences (in days) between records of a user. Meaning, I would first get the unique days recorded:
>
> A | 2014-01-01
> A | 2014-01-07
> B | 2014-01-01
> B | 2014-01-05
> C | 2014-02-05
> C | 2014-02-20
>
> And then want to run the differences between timestamps within a group defined by the user, in days:
> A| 6
> B| 4
> C|15
>
> Imagining that I have tens of thousands of records, I then want the table with the counts of differences ( across all users) ( in our case it would be 6, 4 and 15, all counte = 1)
> IN the larger sample, something like this:
> DeltaDays | Count
> 1 | 150
> 2 | 320
> …
> N | X
>
> I know there are all sorts of packages for time analysis, but I could not find a simple function like this (incl searching here http://www.statoek.wiso.uni-goettingen.de/veranstaltungen/zeitreihen/sommer03/ts_r_intro.pdf ). I assume that something working on a simple data frame would be sufficient, but I am happy ( prefer?) to use TS. I would appreciate any hints. The ultimate analysis involves also space, so hints in the direction of space-time are welcome. Ultimately, I would like to separate records for each user into a dataset that can be handled separately, but splitting it into a large number of files does not seem wise. Any hint also appreciated.
>
> Thanks,
> Martin
>
>
>
> [[alternative HTML version deleted]]
>
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list