[R] time series processing - count of datestamp delta's, per group
Martin Tomko
tomkom at unimelb.edu.au
Mon Mar 24 07:55:07 CET 2014
Dear Stephen,
Thank you for your suggestion - I will give it a try. It looks like going
in the right direction with the by() function, but your assumption about
the successive observations is incorrect:
I want "all the date differences between successive observations of each
separate user.²
So I tried to modify the function foo as follows:
foo2 <- with(dataset,by(Date,User,function(yy)diff(yy)))
The structure of the resulting object is:
structure(list(A = structure(c(0, 3, 75, 41, 4, 6, 7, 19, 16,
30, 0, 3), units = "days", class = "difftime"), B = structure(c(6,
53, 14, 8, 7, 5, 3, 0, 15, 3, 10, 24, 13, 1, 7, 21, 11, 5, 9,
28, 11, 3, 18, 12), units = "days", class = "difftime"), C =
structure(c(8,
11, 4, 1, 12, 37, 17, 4, 9, 13, 18, 24, 23, 24, 15, 42, 6, 34
), units = "days", class = "difftime"), D = structure(c(2, 16,
15, 18, 14, 2, 19, 8, 35, 9, 2, 28, 11, 3, 1, 1, 13, 28, 15,
9, 30, 5, 1, 16, 22), units = "days", class = "difftime"), E =
structure(c(10,
30, 6, 20, 13, 7, 80, 32, 14, 57, 20, 3, 9, 0, 24, 0), units = "days",
class = "difftime")), .Dim = 5L, .Dimnames = structure(list(
User = c("A", "B", "C", "D", "E")), .Names = "User"), call =
quote(by.default(data = Date,
INDICES = User, FUN = function(yy) diff(yy))), class = "by")
I can access it as: foo2$A for example.
I have then used the following approach to unlisting them:
x <- unlist(foo2)
hist(x)
table(x)
This seems to do what I was after - thanks for your help!
Martin
Dr Martin Tomko
Lecturer/AURIN Technical Architecture Implementation Manager, Department
of Computing and Information Systems
Melbourne School of Engineering
The University of Melbourne, Victoria 3010
T: +61 (0) 9035 3298 | E: tomkom at unimelb.edu.au | W: www.tomko.org
This email and any attachments may contain personal information or
information that is otherwise confidential or the subject of copyright.
Any use, disclosure or copying of any part of it is prohibited. The
University does not warrant that this email or any attachments are free
from viruses or defects. Please check any attachments for viruses and
defects before opening them. If this email is received in error please
delete it and notify us by return email.
On 23/03/2014 10:07 pm, "Stephan Kolassa" <Stephan.Kolassa at gmx.de> wrote:
>Hi Martin,
>
>it sounds like you want the difference between the first and the last
>observation per user, not, e.g., all the date differences between
>successive observations of each separate user. Correct me if I'm wrong.
>That said, let's build some toy data:
>
>set.seed(1)
>dataset <- data.frame(User=sample(LETTERS[1:5],100,replace=TRUE),
> Date=sample(as.Date("2014-01-01")+0:364,100,replace=TRUE))
>
>Now we can calculate these differences and plot a histogram or tabulate:
>
>foo <- with(dataset,by(Date,User,function(xx)diff(range(xx))))
>hist(foo)
>table(foo)
>
>The key here is really the by() function, which calculates a function
>(here an anonymous function "function(xx)diff(range(xx))") applied to
>some data (here dataset$Date) separately for each level of a grouping
>factor (here dataset$User).
>
>HTH,
>Stephan
>
>
>On 23.03.2014 01:32, Martin Tomko wrote:
>> Apologies if the question is a but naïve, I am a novice in time series
>>data handling in R
>>
>> I have the following type of data, in a long format ( as called by the
>>spacetime vignette the table contains also space, not noted here):
>>
>> User | Date | Otherdata |
>> A | 01/01/2014 | aa
>> A | 01/01/2014 | bb
>> A | 01/01/2014 | cc
>> B | 01/01/2014 | aa
>> B | 05/01/2014 | cc
>> A | 07/01/2014 | aa
>> C | 05/02/2014 | xx
>> C | 20/02/2014 | yy
>>
>> Etc
>> [A,B,C,Š] are user Ids (some strings).
>> Date is converted into a Date format (2013-10-15)
>>
>> The table is sorted by User and then by Date, and is over 800K records
>>long. There are about 20K users.
>>
>> User | Date | Otherdata |
>> A | 2014-01-01 | aa
>> A | 2014-01-01 | bb
>> A | 2014-01-01 | cc
>> A | 2014-01-07 | aa
>> B | 2014-01-01 | aa
>> B | 2014-01-05 | cc
>> C | 2014-02-05 | xx
>> C | 2014-02-20 | yy
>>
>> I want to:
>> Get a frequency table ( and ultimately plot) of the count of
>>differences (in days) between records of a user. Meaning, I would first
>>get the unique days recorded:
>>
>> A | 2014-01-01
>> A | 2014-01-07
>> B | 2014-01-01
>> B | 2014-01-05
>> C | 2014-02-05
>> C | 2014-02-20
>>
>> And then want to run the differences between timestamps within a group
>>defined by the user, in days:
>> A| 6
>> B| 4
>> C|15
>>
>> Imagining that I have tens of thousands of records, I then want the
>>table with the counts of differences ( across all users) ( in our case
>>it would be 6, 4 and 15, all counte = 1)
>> IN the larger sample, something like this:
>> DeltaDays | Count
>> 1 | 150
>> 2 | 320
>> Š
>> N | X
>>
>> I know there are all sorts of packages for time analysis, but I could
>>not find a simple function like this (incl searching here
>>http://www.statoek.wiso.uni-goettingen.de/veranstaltungen/zeitreihen/somm
>>er03/ts_r_intro.pdf ). I assume that something working on a simple data
>>frame would be sufficient, but I am happy ( prefer?) to use TS. I would
>>appreciate any hints. The ultimate analysis involves also space, so
>>hints in the direction of space-time are welcome. Ultimately, I would
>>like to separate records for each user into a dataset that can be
>>handled separately, but splitting it into a large number of files does
>>not seem wise. Any hint also appreciated.
>>
>> Thanks,
>> Martin
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
More information about the R-help
mailing list