[R] Read 2 rows in 1 dataframe for diff - longitudinal data

William Dunlap wdunlap at tibco.com
Tue Jun 4 21:25:47 CEST 2013


Since you have sorted the data.frame by 'subid', breaking ties with 'year',
doesn't the following do the same thing as the other solutions.
  f4 <- function(df) df[ c(TRUE,diff(df$var1)!=0) & c(FALSE,diff(df$subid)==0), ]
It gives the same answer for your df2 and is quicker than the others.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of arun
> Sent: Tuesday, June 04, 2013 10:19 AM
> To: R help
> Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data
> 
> 
> 
> Hi,
> 
> By comparing some of the solutions:
>  set.seed(25)
>  subid<- sample(30:50,22e5,replace=TRUE)
> set.seed(27)
> year<- sample(1990:2012,22e5,replace=TRUE)
> set.seed(35)
>  var1<- sample(c(1,3,5,7),22e5,replace=TRUE)
> df2<- data.frame(subid,year,var1)
> df2<- df2[order(df2$subid,df2$year),]
> system.time(res<-subset(ddply(df2,.(subid),mutate,delta=c(FALSE,var1[-1]!=var1[-
> length(var1)])),delta)[,-4])
> #  user  system elapsed
>  # 8.036   0.132   8.188
> 
> system.time(res2<-df2[ as.logical( ave( df2$var1, df2$subid, FUN=function(x) c( FALSE,
> x[-1] != x[-length(x)]) ) ), ])
> #  user  system elapsed
>  # 1.220   0.000   1.222
> system.time(res3<-df2[with(df2,unlist(tapply(var1,list(subid),FUN=function(x)
> c(FALSE,diff(x)!=0)),use.names=FALSE)),])
> #  user  system elapsed
>  # 1.729   0.000   1.730
> identical(res2,res3)
> #[1] TRUE
> 
> row.names(res)<-1:nrow(res)
>  row.names(res2)<-1:nrow(res)
>  identical(res,res2)
> #[1] TRUE
> 
> I found half an hour a bit too extreme by comparing the above numbers.
> 
> 
> A.K.
> 
> 
> David:
> 
> 6     47 1999   1
> 
> should not be included in the output list because, we are trying
>  to detect changes within the subid's.  1999 was the first year for
> subject 47 and changes have to be detected after that year - hence we
> were using ddply to group. Your solution ran very fast as expected.
> 
> AK- I have a large dataset and your solution is taking too long -
>  as a matter of fact i had to kill it afte 1/2 hr on a 22K row dataset.
> 
> Thanks for the suggestions.
> 
> -ST
> 
> 
> ----- Original Message -----
> From: David Winsemius <dwinsemius at comcast.net>
> To: arun <smartpink111 at yahoo.com>
> Cc: R help <r-help at r-project.org>
> Sent: Tuesday, June 4, 2013 11:13 AM
> Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data
> 
> 
> On Jun 3, 2013, at 9:51 PM, arun wrote:
> 
> > If it is grouped by "subid" (that would be the difference in the number of changes)
> >
> > subset(ddply(df1,.(subid),mutate,delta=c(FALSE,var[-1]!=var[-length(var)])),delta)[,-4]
> > #   subid year var
> > #3     36 2003   3
> > #7     47 2001   3
> > #9     47 2005   1
> > #10    47 2007   3
> > A.K.
> 
> I'm not sure why the first one retruns integer values from the ave() call but the second
> version works:
> 
> > df1[ ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]) ), ]
>     subid year var
> 1      36 1999   1
> 1.1    36 1999   1
> 1.2    36 1999   1
> 1.3    36 1999   1
> 
> ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]))
> [1] 0 0 1 0 0 0 1 0 1 1
> 
> Perhaps one of the single item groups sabotaged my simple function.
> 
> 
> > df1[ as.logical( ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)])
> ) ), ]
>    subid year var
> 3     36 2003   3
> 7     47 2001   3
> 9     47 2005   1
> 10    47 2007   3
> 
> --
> David.
> >
> >
> > ----- Original Message -----
> > From: David Winsemius <dwinsemius at comcast.net>
> > To: arun <smartpink111 at yahoo.com>
> > Cc: R help <r-help at r-project.org>
> > Sent: Tuesday, June 4, 2013 12:37 AM
> > Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data
> >
> >
> > On Jun 3, 2013, at 7:10 PM, arun wrote:
> >
> >> Hi,
> >> May be this helps:
> >> res1<-df1[with(df1,unlist(tapply(var,list(subid),FUN=function(x)
> c(FALSE,diff(x)!=0)),use.names=FALSE)),]
> >>   res1
> >> #   subid year var
> >> #3     36 2003   3
> >> #7     47 2001   3
> >> #9     47 2005   1
> >> #10    47 2007   3
> >> #or
> >> library(plyr)
> >>   subset(ddply(df1,.(subid),mutate,delta=c(FALSE,diff(var)!=0)),delta)[,-4]
> >> #   subid year var
> >> #3     36 2003   3
> >> #7     47 2001   3
> >> #9     47 2005   1
> >> #10    47 2007   3
> >> A.K.
> >>
> > It's pretty simple with logical indexing:
> >
> >> df1[ c(FALSE, df1$var[-1]!=df1$var[-length(df1$var)]), ]
> >    subid year var
> > 3     36 2003   3
> > 6     47 1999   1
> > 7     47 2001   3
> > 9     47 2005   1
> > 10    47 2007   3
> >
> >
> > When I count the number of changes in value of var is give me 5. Not sure why you are
> both leaving out row 6.
> >
> > --
> > David.
> >>
> >>
> >> I need to output a dataframe whenever var changes a value.
> >>
> >> df1 <-
> data.frame(subid=rep(c(36,47),each=5),year=rep(seq(1999,2007,2),2),var=c(1,1,3,3,3,1,3
> ,3,1,3))
> >>     subid year var
> >> 1     36 1999   1
> >> 2     36 2001   1
> >> 3     36 2003   3
> >> 4     36 2005   3
> >> 5     36 2007   3
> >> 6     47 1999   1
> >> 7     47 2001   3
> >> 8     47 2003   3
> >> 9     47 2005   1
> >> 10    47 2007   3
> >>>
> >>
> >> I need:
> >> 36 2003   3
> >> 47 2001   3
> >> 47 2005   1
> >> 47 2007   3
> >>
> >> I am trying to use ddply over subid and use the diff function, but it is not working quiet
> right.
> >>
> >>> dd <- ddply(df1,.(subid),summarize,delta=diff(var) != 0)
> >>> dd
> >>    subid delta
> >> 1    36 FALSE
> >> 2    36  TRUE
> >> 3    36 FALSE
> >> 4    36 FALSE
> >> 5    47  TRUE
> >> 6    47 FALSE
> >> 7    47  TRUE
> >> 8    47  TRUE
> >>
> >> I would appreciate any help on this.
> >> Thank You!
> >> -ST
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> > David Winsemius
> > Alameda, CA, USA
> >
> 
> David Winsemius
> Alameda, CA, USA
> 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list