[R] problem of data manipulation

Bert Gunter gunter.berton at gene.com
Mon Jan 18 21:31:37 CET 2010


Absolutely... so long as you assume the dates are in order -- or at least
that the earliest date of a group appears first. 

-- Bert



-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of William Dunlap
Sent: Monday, January 18, 2010 12:15 PM
To: Bert Gunter; rusers.sh; r-help at r-project.org
Subject: Re: [R] problem of data manipulation

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
> Sent: Monday, January 18, 2010 11:54 AM
> To: 'rusers.sh'; r-help at r-project.org
> Subject: Re: [R] problem of data manipulation
> 
> One way to do it:
> 
> 1. Convert your date column to the Date class using the 
> as.Date() function.
> This allows you to do the necessary arithmetic on the dates below.
> dt <- as.Date(a[,4],"%d/%m/%Y")
> 
> 2. Create a factor out of your first three columns whose 
> levels are in the
> same order as the unique rows. Something likes the following 
> should do it:
> fac <- do.call(paste,a[,-4])
> fac <- factor(fac, levels=unique(fac))
> 
> This allows you to choose the groups of rows whose dates you 
> wish to compare
> and maintain their correct order in the data frame
> 
> 3. Then use tapply: 
> a[unlist(tapply(dt,fac,function(x)x-min(x) < 7)),]

You can do this without unpacking and repacking
the data.frame (with tapply) based on the following
sort of calculation:

  > isFirstInRun <- function(x)c(TRUE, x[-1] != x[-length(x)])
  > f <- with(a, isFirstInRun(var1) | isFirstInRun(var2) |
isFirstInRun(var3))
  > firstRowInRun <- which(f)
  > runNumber <- cumsum(f)
  > dt <- as.Date(a$var4, "%d/%m/%Y")
  > DaysSinceStartOfRun <- dt - dt[firstRowInRun[runNumber]]
  > DaysSinceStartOfRun
  Time differences in days
  [1]  0  0  3  0  4 12
  > a[ DaysSinceStartOfRun < 7, ]
    var1 var2 var3       var4
  1    s    1    2 01/01/1999
  2    c    1    2 10/02/2000
  3    c    1    2 13/02/2000
  4    n    2    1 11/02/2000
  5    n    2    1 15/02/2000

Is that what you wanted?

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 
> 
> (unlist is needed to remove the list structure and 
> concatenate the logical
> indices to obtain the subscripting vector).
> 
> Bert Gunter
> Genentech Nonclinical Statistics
> 
> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On
> Behalf Of rusers.sh
> Sent: Monday, January 18, 2010 10:40 AM
> To: r-help at r-project.org
> Subject: [R] problem of data manipulation
> 
> Hello,
>   See my problem below.
> a<-data.frame(c("s","c","c","n","n","n"),c(rep(1,3),rep(2,3)),
> c(rep(2,3),rep
> (1,3)),c("01/01/1999","10/02/2000","13/02/2000","11/02/2000","
> 15/02/2000","2
> 3/02/2000"))
> colnames(a)<-c("var1","var2","var3","var4")
> > a
>   var1 var2 var3       var4
> 1    s    1    2    01/01/1999
> 2    c    1    2    10/02/2000
> 3    c    1    2    13/02/2000
> 4    n    2    1    11/02/2000
> 5    n    2    1    15/02/2000
> 6    n    2    1    23/02/2000
> 
>   I want to select the observations whose difference of 
> "var4" is less than
> 7 for the cases with the same values of var1,var2 andvar3.
>   The obervations have the same var1, var2 and var3 are, 
> part1 (obs2 and
> obs3) and part2 (obs4,obs5, and obs6).
>   For obs2 and obs3, their date difference is less than 7, so 
> we donot need
> to delete any of them.
>   For obs4,obs5, and obs6,we can see that obs6 should be 
> deleted becuase its
> date is over 7 dyas longer than obs4.
>   So the final dataset should obs1,obs2,obs3,obs4, and obs5.
>   I have a lot of observations in my dataset, so i hope to do this
> automatically.  Any ideas on this?
>   Thanks.
> -- 
> -----------------
> Jane Chang
> Queen's
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list