[R] Data-frame selection

peter dalgaard pdalgd at gmail.com
Sun Oct 11 00:47:56 CEST 2015


These situations where the desired results depend on the order of observations in a dataset do tend to get a little tricky (this is one kind of problem that is easier to handle in a SAS DATA step with its sequential processing paradigm). I think this will do it:

keep <- function(d)
   with(d, {
     n <- length(Group)
     i <- c(TRUE,Group[-n] != Group[-1]) 
     unsplit(lapply(split(i,Group), cumsum), Group) == 1
   })
kp <- unsplit(lapply(split(teste, teste$ID), keep), teste$ID)
teste[kp,]

I.e. keep() is a function applied to each ID-subset of the data frame, returning a logical vector of the observations that you want to keep. 

i is an indicator that an observation is the first in a sequence. Splitting by group and cumsum'ing gives 1 for the first sequence, 2 for the next, etc. The observations to keep are the ones for which this value is 1.

-pd

> On 10 Oct 2015, at 22:27 , Cacique Samurai <caciquesamurai at gmail.com> wrote:
> 
> Hello Jeff!
> 
> Thanks very much for your prompt reply, but this is not exactly what I
> need. I need the first sequence of records. In example that I send, I
> need the first seven lines of group "T2" in ID "1" (lines 3 to 9) and
> others six lines of group "T3" in ID "1" (lines 10 to 15). I have to
> discard lines 16 to 20, that represent repeated sequential records of
> those groups in same ID.
> 
> Others ID (I sent just a small piece of my data) I have much more
> sequential lines of records of each group in each ID, and many
> sequential records that should be discarded. I some cases, I have just
> one record of a group in an ID.
> 
> As I told, I tried to use a labeling variable, that mark first seven
> lines 3 to 9 as 1 (first sequence of T2 in ID 1), lines 10 to 15 as 1
> (first sequence of T3 in ID 1), lines 16 and 17 as 2 (second sequence
> of T2 in ID 1) and lines 18 to 20 as 2 (second sequence of T3 in ID
> 1), and so on... Then will be easy take just the first record by each
> ID. But the code that I made was a long long loop sequence that at end
> did not work as I want to.
> 
> Once more, thanks in advanced for your atention and help,
> 
> Raoni
> 
> 2015-10-10 13:13 GMT-03:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:
>> ?aggregate
>> 
>> in base R. Make a short function that returns the first element of a vector and give that to aggregate.
>> 
>> Or...
>> 
>> library(dplyr)
>> ( test %>% group_by( ID, Group ) %>% summarise( Var=first( Var ) ) %>% as.data.frame )
>> ---------------------------------------------------------------------------
>> Jeff Newmiller                        The     .....       .....  Go Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
>>                                      Live:   OO#.. Dead: OO#..  Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
>> ---------------------------------------------------------------------------
>> Sent from my phone. Please excuse my brevity.
>> 
>> On October 10, 2015 8:38:00 AM PDT, Cacique Samurai <caciquesamurai at gmail.com> wrote:
>>> Hello R-Helpers!
>>> 
>>> I have a data-frame as below (dput in the end of mail) and need to
>>> select just the first sequence of occurrence of each "Group" in each
>>> "ID".
>>> 
>>> For example, for ID "1" I have two sequential occurrences of T2 and
>>> two sequential occurrences of T3:
>>> 
>>>> test [test$ID == 1, ]
>>>  ID Group  Var
>>> 3   1    T2 2.94
>>> 4   1    T2 3.23
>>> 5   1    T2 1.40
>>> 6   1    T2 1.62
>>> 7   1    T2 2.43
>>> 8   1    T2 2.53
>>> 9   1    T2 2.25
>>> 10  1    T3 1.66
>>> 11  1    T3 2.86
>>> 12  1    T3 0.53
>>> 13  1    T3 1.66
>>> 14  1    T3 3.24
>>> 15  1    T3 1.34
>>> 16  1    T2 1.86
>>> 17  1    T2 3.03
>>> 18  1    T3 3.63
>>> 19  1    T3 2.78
>>> 20  1    T3 1.49
>>> 
>>> As output, I need just the first group of T2 and T3 for this ID, like:
>>> 
>>> ID Group  Var
>>> 3   1    T2 2.94
>>> 4   1    T2 3.23
>>> 5   1    T2 1.40
>>> 6   1    T2 1.62
>>> 7   1    T2 2.43
>>> 8   1    T2 2.53
>>> 9   1    T2 2.25
>>> 10  1    T3 1.66
>>> 11  1    T3 2.86
>>> 12  1    T3 0.53
>>> 13  1    T3 1.66
>>> 14  1    T3 3.24
>>> 15  1    T3 1.34
>>> 
>>> For others ID I have just one occurrence or sequence of occurrence of
>>> each Group.
>>> 
>>> I tried to use a labeling variable, but cannot figure out do this
>>> without many many loops..
>>> 
>>> Thanks in advanced,
>>> 
>>> Raoni
>>> 
>>> dput (teste)
>>> structure(list(ID = structure(c(3L, 4L, 1L, 1L, 1L, 1L, 1L, 1L,
>>> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
>>> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>>> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("1", "2",
>>> "3", "4"), class = "factor"), Group = structure(c(1L, 2L, 1L,
>>> 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L,
>>> 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L,
>>> 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label =
>>> c("T2",
>>> "T3"), class = "factor"), Var = c(0.32, 1.59, 2.94, 3.23, 1.4,
>>> 1.62, 2.43, 2.53, 2.25, 1.66, 2.86, 0.53, 1.66, 3.24, 1.34, 1.86,
>>> 3.03, 3.63, 2.78, 1.49, 2, 2.39, 1.65, 2.05, 2.75, 2.23, 1.39,
>>> 2.66, 1.05, 2.52, 2.49, 2.97, 0.43, 1.36, 0.79, 1.71, 1.95, 2.73,
>>> 2.73, 2.39, 2.17, 2.34, 2.42, 1.75, 0.66, 1.64, 0.24, 2.11, 2.11,
>>> 1.18)), .Names = c("ID", "Group", "Var"), row.names = c(NA, 50L
>>> ), class = "data.frame")
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 
> 
> 
> -- 
> Raoni Rosa Rodrigues
> Research Associate of Fish Transposition Center CTPeixes
> Universidade Federal de Minas Gerais - UFMG
> Brasil
> rodrigues.raoni at gmail.com
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com



More information about the R-help mailing list