[R] problem of data manipulation

Bert Gunter gunter.berton at gene.com
Tue Jan 19 00:06:44 CET 2010


Subject to finite precision arithmetic, it should work as before (either
approach – perhaps with an explicit as.factor() cast first for your numeric
columns in my case).

HOWEVER, as usual , finite precision arithmetic may now mess up either way
of doing it, depending how the numbers in the columns were derived and what
you mean by "the same." Remember, on a computer,  sqrt(2)^2  != 2  .

-- Bert Gunter
Genentech Nonclinical Statistics 

________________________________________
From: rusers.sh [mailto:rusers.sh at gmail.com] 
Sent: Monday, January 18, 2010 2:29 PM
To: William Dunlap
Cc: Bert Gunter; r-help at r-project.org
Subject: Re: [R] problem of data manipulation

I just remembered that my actual dataset for var2 and var3
are numerical data,e.g. 12.34, not factors. The above example data is
misleading.
  Suppose var2 and var3 are numerical variables, not factors. How should we
do it?
  Very sorry for the misleading.
2010/1/18 William Dunlap <wdunlap at tibco.com>
> -----Original Message-----
> From: Bert Gunter [mailto:gunter.berton at gene.com]
> Sent: Monday, January 18, 2010 12:32 PM
> To: William Dunlap; 'rusers.sh'; r-help at r-project.org
> Subject: RE: [R] problem of data manipulation
>
> Absolutely... so long as you assume the dates are in order --
> or at least
> that the earliest date of a group appears first.
>
> -- Bert
>
Yes, I forgot to mention that requirement.  When
there are a lot of small groups run-based methods
(sort then deal with a run at a time) can save a
lot of time.  They may also make the intent of
the code more clear, but not everyone sees it that way.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On
> Behalf Of William Dunlap
> Sent: Monday, January 18, 2010 12:15 PM
> To: Bert Gunter; rusers.sh; r-help at r-project.org
> Subject: Re: [R] problem of data manipulation
>
> > -----Original Message-----
> > From: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
> > Sent: Monday, January 18, 2010 11:54 AM
> > To: 'rusers.sh'; r-help at r-project.org
> > Subject: Re: [R] problem of data manipulation
> >
> > One way to do it:
> >
> > 1. Convert your date column to the Date class using the
> > as.Date() function.
> > This allows you to do the necessary arithmetic on the dates below.
> > dt <- as.Date(a[,4],"%d/%m/%Y")
> >
> > 2. Create a factor out of your first three columns whose
> > levels are in the
> > same order as the unique rows. Something likes the following
> > should do it:
> > fac <- do.call(paste,a[,-4])
> > fac <- factor(fac, levels=unique(fac))
> >
> > This allows you to choose the groups of rows whose dates you
> > wish to compare
> > and maintain their correct order in the data frame
> >
> > 3. Then use tapply:
> > a[unlist(tapply(dt,fac,function(x)x-min(x) < 7)),]
>
> You can do this without unpacking and repacking
> the data.frame (with tapply) based on the following
> sort of calculation:
>
>   > isFirstInRun <- function(x)c(TRUE, x[-1] != x[-length(x)])
>   > f <- with(a, isFirstInRun(var1) | isFirstInRun(var2) |
> isFirstInRun(var3))
>   > firstRowInRun <- which(f)
>   > runNumber <- cumsum(f)
>   > dt <- as.Date(a$var4, "%d/%m/%Y")
>   > DaysSinceStartOfRun <- dt - dt[firstRowInRun[runNumber]]
>   > DaysSinceStartOfRun
>   Time differences in days
>   [1]  0  0  3  0  4 12
>   > a[ DaysSinceStartOfRun < 7, ]
>     var1 var2 var3       var4
>   1    s    1    2 01/01/1999
>   2    c    1    2 10/02/2000
>   3    c    1    2 13/02/2000
>   4    n    2    1 11/02/2000
>   5    n    2    1 15/02/2000
>
> Is that what you wanted?
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
> >
> > (unlist is needed to remove the list structure and
> > concatenate the logical
> > indices to obtain the subscripting vector).
> >
> > Bert Gunter
> > Genentech Nonclinical Statistics
> >
> > -----Original Message-----
> > From: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org] On
> > Behalf Of rusers.sh
> > Sent: Monday, January 18, 2010 10:40 AM
> > To: r-help at r-project.org
> > Subject: [R] problem of data manipulation
> >
> > Hello,
> >   See my problem below.
> > a<-data.frame(c("s","c","c","n","n","n"),c(rep(1,3),rep(2,3)),
> > c(rep(2,3),rep
> > (1,3)),c("01/01/1999","10/02/2000","13/02/2000","11/02/2000","
> > 15/02/2000","2
> > 3/02/2000"))
> > colnames(a)<-c("var1","var2","var3","var4")
> > > a
> >   var1 var2 var3       var4
> > 1    s    1    2    01/01/1999
> > 2    c    1    2    10/02/2000
> > 3    c    1    2    13/02/2000
> > 4    n    2    1    11/02/2000
> > 5    n    2    1    15/02/2000
> > 6    n    2    1    23/02/2000
> >
> >   I want to select the observations whose difference of
> > "var4" is less than
> > 7 for the cases with the same values of var1,var2 andvar3.
> >   The obervations have the same var1, var2 and var3 are,
> > part1 (obs2 and
> > obs3) and part2 (obs4,obs5, and obs6).
> >   For obs2 and obs3, their date difference is less than 7, so
> > we donot need
> > to delete any of them.
> >   For obs4,obs5, and obs6,we can see that obs6 should be
> > deleted becuase its
> > date is over 7 dyas longer than obs4.
> >   So the final dataset should obs1,obs2,obs3,obs4, and obs5.
> >   I have a lot of observations in my dataset, so i hope to do this
> > automatically.  Any ideas on this?
> >   Thanks.
> > --
> > -----------------
> > Jane Chang
> > Queen's
> >
> >     [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>



-- 
-----------------
Jane Chang
Queen's



More information about the R-help mailing list