[R] Help with selection of continuous data

Mon Jun 21 12:32:26 CEST 2021

Hi André

another approach using split/lapply

lll <- lapply(split(A$ID, A$Date), function(x) x<9)
A$select <- unlist(lapply(lll, function(x) x*sum(x)>=8))
A
A[A$select,]

However if your real data frame does not have same properties as the one you showed, results could be wrong.

e.g. if A$ID has not 8 consecutive values  (1:8) but e.g. 1,1,2,2, 3, 3, 4, 4, 5, 5, ... or 1,1,1,1,1,1,1,1, ...

Cheers
Petr

> -----Original Message-----
> From: R-help <r-help-bounces using r-project.org> On Behalf Of Jim Lemon
> Sent: Monday, June 21, 2021 12:11 PM
> To: Eric Berger <ericjberger using gmail.com>
> Cc: R mailing list <r-help using r-project.org>; André Luis Neves
> <andrluis using ualberta.ca>
> Subject: Re: [R] Help with selection of continuous data
> 
> Hi Andre,
> I've taken a different approach to that employed by Eric:
> 
> A<-
> data.frame(c("01/01/2020","01/01/2020","01/01/2020","01/01/2020","01/01/
> 2020",
> 
> "01/01/2020","01/01/2020","01/01/2020","01/01/2020","01/01/2020","01/01/
> 2020",
> 
> "01/01/2020","01/02/2020","01/02/2020","01/02/2020","01/02/2020","01/03/
> 2020",
> 
> "01/03/2020","01/03/2020","01/03/2020","01/03/2020","01/03/2020","01/03/
> 2020",
>  "01/04/2020","01/04/2020","01/04/2020","01/04/2020","01/04/2020",
>       "01/04/2020","01/04/2020","01/04/2020","01/04/2020"),
> c(23,22,12,24,26,19,34,15,17,19,23,33,23,34,25,23,25,24,34,33,31,32,24,22,21,
>  23,22,22,21,23,23,21),
> c(13,11,12,9,8,9,7,10,11,9,6,11,9,8,9,10,11,12,9,8,10,4,6,9,8,9,10,11,14,12,
>  13,11),
> c(1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,1,2,
>  3,4,5,6,7,1,2,3,4,5,6,7,8,9))
> colnames(A) <- c("Date", "CO2", "CH4", "ID") # add a variable to compile
> selected rows A$select<-FALSE # get all unique dates
> alldates<-unique(A$Date)
> for(date in alldates) {
>  # get indices for this date
>  date_indices<-which(A$Date == date)
>  # only mark the first 8 as TRUE
>  A$select[date_indices[1:8]]<-all(1:8 %in% A$ID[date_indices]) } A
> A[A$select,]
> 
> If you don't want to add a column you can set up "select" as a vector.
> 
> Jim
> 
> On Mon, Jun 21, 2021 at 6:18 PM Eric Berger <ericjberger using gmail.com> wrote:
> >
> > Hi André,
> > It's not 100% clear to me what you are asking. I am interpreting the
> > question as selecting the data from those dates for which all of
> > 1,2,3,4,5,6,7,8 appear in the ID column.
> > My approach determines the dates satisfying this property, which I put
> > into a vector dtV. Then I take the rows of A for which the date is in
> > the vector dtV.
> >
> > library(dplyr)
> > dtV <- A %>% mutate(x=2^(ID-1)) %>% group_by(Date) %>%
> > summarise(y=(sum(unique(x))%%256==255)) %>% filter(y==TRUE) %>%
> > select(Date) B <- A[ A$Date %in% dtV$Date, ]
> >
> > B is the subset of A that you want.
> >
> > HTH,
> > Eric
> >
> >
> >
> > On Mon, Jun 21, 2021 at 10:23 AM André Luis Neves
> > <andrluis using ualberta.ca>
> > wrote:
> >
> > > Dear R users,
> > >
> > > I want to select only the data containing a continuous number of
> > > *ID* from
> > > 1-8 in each *DATE*. Note, I do not want to select data that do not
> > > contain a continuous number in *ID *from 1-8 (eg. Data on *DATE*
> > > 1/2/2020, and 01/03/2020). The dataset is a huge matrix with 24
> > > columns and 1.5 million rows, but I have prepared a reproducible code for
> your reference below.
> > >
> > > Here it is the reproducible code:
> > >
> > > A =
> > >
> > > data.frame(c("01/01/2020","01/01/2020","01/01/2020","01/01/2020","01
> > > /01/2020","01/01/2020","01/01/2020",
> > >
> > >
> > >
> > > "01/01/2020","01/01/2020","01/01/2020","01/01/2020","01/01/2020","01
> > > /02/2020","01/02/2020",
> > >
> > >
> > >
> > > "01/02/2020","01/02/2020","01/03/2020","01/03/2020","01/03/2020","01
> > > /03/2020","01/03/2020",
> > >
> > >
> > >
> "01/03/2020","01/03/2020","01/04/2020","01/04/2020","01/04/2020","01/04/
> 2020","01/04/2020",
> > >                "01/04/2020","01/04/2020","01/04/2020","01/04/2020"),
> > > c(23,22,12,24,26,19,34,15,17,19,23,33,
> > >
> > >  23,34,25,23,25,24,34,33,31,32,24,22,21,23,22,22,21,23,23,21),
> > > c(13,11,12,9,8,9,7,10,11,9,6,11,
> > >                9,8,9,10,11,12,9,8,10,4,6,9,8,9,10,11,14,12,13,11),
> > > c(1,2,3,4,5,6,7,8,9,10,11,12,1,2,3,4,1,2,
> > >                3,4,5,6,7,1,2,3,4,5,6,7,8,9))
> > > colnames(A) <- c("Date", "CO2", "CH4", "ID") A
> > >
> > > Thank you,
> > > --
> > > Andre
> > >
> > >         [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.