[R] Must be a better way to collate sequenced data

Burke, Robin rburke at cs.depaul.edu
Mon Jun 8 11:28:46 CEST 2009


Thanks for the quick response. Sorry for being unclear with my example. Here is something more concrete:

user <- c(1, 2, 1, 2, 3, 1, 3, 4, 2,  3,  4,  1);
time <- c(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200);
userCount <- c(1, 1, 2, 2, 1, 3, 2, 1, 3,  3,  2,  4);

period <- 100

utime.data <- data.frame(USER=user, TIME=time, USER_COUNT=userCount);

The answer

>utime.rcount
  TIME TIME      PERC
1    0    0 1.4166667
2    1    4 1.4166667
3    3    9 0.9166667
4    6    6 0.2500000

I'm investigating the plyr package. I think splitting by users and re-merging may do the trick, providing I can re-merge in order of the transformed time value. That would avoid the costly sort operation in aggregate.

Robin Burke
Associate Professor
School of Computer Science, Telecommunications, and
   Information Systems
DePaul University 
(currently on leave at University College Dublin)

http://josquin.cti.depaul.edu/~rburke/

"The universe is made of stories, not of atoms" - Muriel Rukeyser



-----Original Message-----
From: Petr PIKAL [mailto:petr.pikal at precheza.cz] 
Sent: Monday, June 08, 2009 8:36 AM
To: Burke, Robin
Cc: r-help at r-project.org
Subject: Odp: [R] Must be a better way to collate sequenced data

Hi

nobody has your data and so your code is irreproducible. Here are only few 
comments 

augdata <<- as.data.frame(cbind(utime.atimes, utime.aperc))

data.frame(utime.atimes, utime.aperc) is enough. cbinding is rather 
dangerous as it produce matrix and it has to have only one type of values.

I am a little bit puzzled by your example.

u.profile<-c(50,20,10)
u.days<-c(1,2,3)
proc.prof<-u.profile/sum(u.profile)
data.frame(u.days, proc.prof)
  u.days proc.prof
1      1     0.625
2      2     0.250
3      3     0.125

OTOH you speak about normalization by max value

proc.prof<-u.profile/max(u.profile)
data.frame(u.days, proc.prof)
  u.days proc.prof
1      1       1.0
2      2       0.4
3      3       0.2

Some suggestion which comes to my mind is to

1. Transfer time.stamp to POSIX class
2. Split your data according to users
mylist <- split(data, users)
3. transform your data by lapply(mylist, desired transformation)
4. perform aggregation by days for each part of the list
5. reprocess list to data frame

Maybe some functions from plyr or  doBy library could help you.

Regards
Petr




r-help-bounces at r-project.org napsal dne 07.06.2009 23:55:00:

> I have data that looks like this
> 
> time_stamp (seconds)  user_id
> 
> The data is (partial) ordered by time - in that sometimes transactions 
occur 
> at the same timestamp. The output I want is collated by transaction time 
on a 
> per user basis, normalized by the maximum number of transactions per 
user, and
> aggregated over each day. So, if the users have 50 transactions in the 
first 
> day and 20 transactions on the second day, and 10 transactions on the 
third 
> day, the output would be as follows, if each transaction represents 
0.01% of 
> each user's total profile. (In reality, they all have different profile 
> lengths so a transaction represents a different percentage for each 
user.)
> 
> time_since_first_transaction (days)        percent_of_profile
> 1       0.50
> 2       0.20
> 3       0.10
> 
> I have the following code that computes the right answer, but it is 
really 
> inefficient, so I'm sure that I'm doing something wrong. Really 
inefficient 
> means > 30 minutes for an 100 k item data frame on a 2.2 GHz machine, 
and my 
> 1-million data set has never finished. I'm no stranger to functional 
> programming (Lisp programmer) but I can't figure out a way to subtract 
the 
> first timestamp for user A from all of the other timestamps for user A 
without
> either (a) building a separate table of "first entries for each user", 
which I
> do here, or (b) re-computing the initial entry for each user with every 
row, 
> which is what I did before and is even more inefficient. Another killer 
> operation seems to be the aggregate step on the last line, which I use 
to 
> collate the data by days. It seems very slow, but I don't know any other 
way 
> to do this. I realize that I am living proof that one can program in C 
no 
> matter what language one uses - so I would appreciate any enlightenment 
on offer. If !
>  there's no better way, I'll pre-process everything in Perl, but I'd 
rather 
> learn the "R" way to do things like this. Thanks.
> 
>                 # Build table of times
> utime.times <<- utime.data["TIME"] %/% period;
>                 utime.tstart <<- vector("numeric", 
length=max(utime.data["USER"]));
>                 for (i in 1:nrow(utime.data))
>                 {
>                                 if (as.numeric(utime.data[i, 
"USER_COUNT"])==1)
>                                 {
>                                                 day <- utime.times[i, 
"TIME"];
>                                                 user <- utime.data[i, 
"USER"];
>                                                 utime.tstart[user] <<- 
day;
>                                 }
>                 }
> 
>                 # Build table of maximum profile sizes
>                 utime.userMax <<- aggregate(utime.data["USER_COUNT"],
> utime.data["USER"],
>                                                                 max);
> 
>                 utime.atimes <<- vector("numeric", 
length=nrow(utime.data));
>                 utime.aperc <<- vector("numeric", 
length=nrow(utime.data));
>                 augdata <<- as.data.frame(cbind(utime.atimes, 
utime.aperc));
>                 names(augdata) <<- c("TIME", "PERC");
>                 for (i in 1:nrow(utime.data))
>                 {
>                                 # adjust time according to user start 
time
> augdata[i, "TIME"] <<-
>                                                 utime.times[i,"TIME"] -
> utime.tstart[utime.data[i,"USER"]];
>                                 # look up maximum user count
>                                 umax <- subset(utime.userMax,
> 
> USER==as.numeric(utime.data[i, "USER"]))["USER_COUNT"];
>                                 augdata[i, "PERC"] <<- 1.0/umax;
>                 }
> 
>                 utime.rcount <<- aggregate(augdata, augdata["TIME"], 
sum);
>                 ....
> 
> 
> Robin Burke
> Associate Professor
> School of Computer Science, Telecommunications, and
>    Information Systems
> DePaul University
> (currently on leave at University College Dublin)
> 
> http://josquin.cti.depaul.edu/~rburke/
> 
> "The universe is made of stories, not of atoms" - Muriel Rukeyser
> 
> 
>    [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list