[R] Read 2 rows in 1 dataframe for diff - longitudinal data
William Dunlap
wdunlap at tibco.com
Tue Jun 4 21:25:47 CEST 2013
Since you have sorted the data.frame by 'subid', breaking ties with 'year',
doesn't the following do the same thing as the other solutions.
f4 <- function(df) df[ c(TRUE,diff(df$var1)!=0) & c(FALSE,diff(df$subid)==0), ]
It gives the same answer for your df2 and is quicker than the others.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of arun
> Sent: Tuesday, June 04, 2013 10:19 AM
> To: R help
> Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data
>
>
>
> Hi,
>
> By comparing some of the solutions:
> set.seed(25)
> subid<- sample(30:50,22e5,replace=TRUE)
> set.seed(27)
> year<- sample(1990:2012,22e5,replace=TRUE)
> set.seed(35)
> var1<- sample(c(1,3,5,7),22e5,replace=TRUE)
> df2<- data.frame(subid,year,var1)
> df2<- df2[order(df2$subid,df2$year),]
> system.time(res<-subset(ddply(df2,.(subid),mutate,delta=c(FALSE,var1[-1]!=var1[-
> length(var1)])),delta)[,-4])
> # user system elapsed
> # 8.036 0.132 8.188
>
> system.time(res2<-df2[ as.logical( ave( df2$var1, df2$subid, FUN=function(x) c( FALSE,
> x[-1] != x[-length(x)]) ) ), ])
> # user system elapsed
> # 1.220 0.000 1.222
> system.time(res3<-df2[with(df2,unlist(tapply(var1,list(subid),FUN=function(x)
> c(FALSE,diff(x)!=0)),use.names=FALSE)),])
> # user system elapsed
> # 1.729 0.000 1.730
> identical(res2,res3)
> #[1] TRUE
>
> row.names(res)<-1:nrow(res)
> row.names(res2)<-1:nrow(res)
> identical(res,res2)
> #[1] TRUE
>
> I found half an hour a bit too extreme by comparing the above numbers.
>
>
> A.K.
>
>
> David:
>
> 6 47 1999 1
>
> should not be included in the output list because, we are trying
> to detect changes within the subid's. 1999 was the first year for
> subject 47 and changes have to be detected after that year - hence we
> were using ddply to group. Your solution ran very fast as expected.
>
> AK- I have a large dataset and your solution is taking too long -
> as a matter of fact i had to kill it afte 1/2 hr on a 22K row dataset.
>
> Thanks for the suggestions.
>
> -ST
>
>
> ----- Original Message -----
> From: David Winsemius <dwinsemius at comcast.net>
> To: arun <smartpink111 at yahoo.com>
> Cc: R help <r-help at r-project.org>
> Sent: Tuesday, June 4, 2013 11:13 AM
> Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data
>
>
> On Jun 3, 2013, at 9:51 PM, arun wrote:
>
> > If it is grouped by "subid" (that would be the difference in the number of changes)
> >
> > subset(ddply(df1,.(subid),mutate,delta=c(FALSE,var[-1]!=var[-length(var)])),delta)[,-4]
> > # subid year var
> > #3 36 2003 3
> > #7 47 2001 3
> > #9 47 2005 1
> > #10 47 2007 3
> > A.K.
>
> I'm not sure why the first one retruns integer values from the ave() call but the second
> version works:
>
> > df1[ ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]) ), ]
> subid year var
> 1 36 1999 1
> 1.1 36 1999 1
> 1.2 36 1999 1
> 1.3 36 1999 1
>
> ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]))
> [1] 0 0 1 0 0 0 1 0 1 1
>
> Perhaps one of the single item groups sabotaged my simple function.
>
>
> > df1[ as.logical( ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)])
> ) ), ]
> subid year var
> 3 36 2003 3
> 7 47 2001 3
> 9 47 2005 1
> 10 47 2007 3
>
> --
> David.
> >
> >
> > ----- Original Message -----
> > From: David Winsemius <dwinsemius at comcast.net>
> > To: arun <smartpink111 at yahoo.com>
> > Cc: R help <r-help at r-project.org>
> > Sent: Tuesday, June 4, 2013 12:37 AM
> > Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data
> >
> >
> > On Jun 3, 2013, at 7:10 PM, arun wrote:
> >
> >> Hi,
> >> May be this helps:
> >> res1<-df1[with(df1,unlist(tapply(var,list(subid),FUN=function(x)
> c(FALSE,diff(x)!=0)),use.names=FALSE)),]
> >> res1
> >> # subid year var
> >> #3 36 2003 3
> >> #7 47 2001 3
> >> #9 47 2005 1
> >> #10 47 2007 3
> >> #or
> >> library(plyr)
> >> subset(ddply(df1,.(subid),mutate,delta=c(FALSE,diff(var)!=0)),delta)[,-4]
> >> # subid year var
> >> #3 36 2003 3
> >> #7 47 2001 3
> >> #9 47 2005 1
> >> #10 47 2007 3
> >> A.K.
> >>
> > It's pretty simple with logical indexing:
> >
> >> df1[ c(FALSE, df1$var[-1]!=df1$var[-length(df1$var)]), ]
> > subid year var
> > 3 36 2003 3
> > 6 47 1999 1
> > 7 47 2001 3
> > 9 47 2005 1
> > 10 47 2007 3
> >
> >
> > When I count the number of changes in value of var is give me 5. Not sure why you are
> both leaving out row 6.
> >
> > --
> > David.
> >>
> >>
> >> I need to output a dataframe whenever var changes a value.
> >>
> >> df1 <-
> data.frame(subid=rep(c(36,47),each=5),year=rep(seq(1999,2007,2),2),var=c(1,1,3,3,3,1,3
> ,3,1,3))
> >> subid year var
> >> 1 36 1999 1
> >> 2 36 2001 1
> >> 3 36 2003 3
> >> 4 36 2005 3
> >> 5 36 2007 3
> >> 6 47 1999 1
> >> 7 47 2001 3
> >> 8 47 2003 3
> >> 9 47 2005 1
> >> 10 47 2007 3
> >>>
> >>
> >> I need:
> >> 36 2003 3
> >> 47 2001 3
> >> 47 2005 1
> >> 47 2007 3
> >>
> >> I am trying to use ddply over subid and use the diff function, but it is not working quiet
> right.
> >>
> >>> dd <- ddply(df1,.(subid),summarize,delta=diff(var) != 0)
> >>> dd
> >> subid delta
> >> 1 36 FALSE
> >> 2 36 TRUE
> >> 3 36 FALSE
> >> 4 36 FALSE
> >> 5 47 TRUE
> >> 6 47 FALSE
> >> 7 47 TRUE
> >> 8 47 TRUE
> >>
> >> I would appreciate any help on this.
> >> Thank You!
> >> -ST
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> > David Winsemius
> > Alameda, CA, USA
> >
>
> David Winsemius
> Alameda, CA, USA
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list