[R] R Processing dataframe by group - equivalent to SAS by group processing with a first. and retain statments

Wed Nov 27 22:38:36 CET 2024

The grouping solutions offered seem to be the obvious way to do this and
may even be more efficient in R then what follows below.  However, note
that they are to some extent doing unnecessary work, since the ordering in
the data frame already implicitly provides the grouping, and the hashing or
whatever is under the hood of the grouping functions to determine this is
therefore unnecessary.

So I was wondering how easy it would be to use purely elementary means to
take advantage of this and avoid the "unnecessary" work.  A more or less
obvious approach that occurred to me was to use R's rle() function. I'll
first give a prolix, step-by-step explanation for those who may not have
used rle(). Then I'll give a concise version of code.

Assume "dat" is the example data frame of two columns that John gave.
Then:
rle(dat[,1])  ##gives a list with two components:
Run Length Encoding
  lengths: int [1:3] 10 6 2
  values : int [1:3] 1 2 3

This gives us the grouping for the ID column: 10 1's, followed by 6 2's,
followed by 2 3's.
Clearly, the row indices for the first row in each group are 1, 11, and 17.
we can get this from the "lengths" component of rle() by:

lens <- rle(dat[,1)]$lengths
## Then
cumsum(c(1, lens[-length(lens)]))
1]  1 11 17
## Therefore, the first days are
dat[cumsum(c(1, lens[-length(lens)])), 2]
[1]  1  5 10
##  So just rep() this with lens to give the FirstDay column:
rep(dat[cumsum(c(1, lens[-length(lens)])), 2], lens)
[1]  1  1  1  1  1  1  1  1  1  1  5  5  5  5  5  5 10 10

Here's a concise version of the code:

lens <- rle(dat$ID)$lengths
dat <- within(dat,
   FirstDay <- Day[cumsum(c(1, lens[-length(lens)]))] |> rep(lens)
)

Again, I realize that this sacrifices the clarity of the other solutions
that have been given, so I certainly do not claim that it is "better".
Nevertheless, I hope it shows another approach that might be interesting
and occasionally even useful.

Cheers,
Bert

On Wed, Nov 27, 2024 at 11:38 AM Jeff Newmiller via R-help <
r-help using r-project.org> wrote:

> Was wondering when this would be suggested. But the question was about
> getting the final dataframe...
>
>
> newdta <- olddta
> newdta$FirstDay <- ave(newdata$date, newdata$ID, FUN = \(x) x[1L])
>
> On November 27, 2024 11:13:49 AM PST, Rui Barradas <ruipbarradas using sapo.pt>
> wrote:
> >Às 16:30 de 27/11/2024, Sorkin, John escreveu:
> >> I am an old, long time SAS programmer. I need to produce R code that
> processes a dataframe in a manner that is equivalent to that produced by
> using a by statement in SAS and an if first.day statement and a retain
> statement:
> >>
> >> I want to take data (olddata) that looks like this
> >> ID   Day
> >> 1    1
> >> 1    1
> >> 1    2
> >> 1    2
> >> 1    3
> >> 1    3
> >> 1    4
> >> 1    4
> >> 1    5
> >> 1    5
> >> 2    5
> >> 2    5
> >> 2    5
> >> 2    6
> >> 2    6
> >> 2    6
> >> 3    10
> >> 3    10
> >>
> >> and make it look like this:
> >> (withing each ID I am copying the first value of Day into a new
> variable, FirstDay, and propagating the FirstDay value through all rows
> that have the same ID:
> >>
> >> ID   Day     FirstDay
> >> 1    1       1
> >> 1    1       1
> >> 1    2       1
> >> 1    2       1
> >> 1    3       1
> >> 1    3       1
> >> 1    4       1
> >> 1    4       1
> >> 1    5       1
> >> 1    5       1
> >> 2    5       5
> >> 2    5       5
> >> 2    5       5
> >> 2    6       5
> >> 2    6       5
> >> 2    6       5
> >> 3    10      3
> >> 3    10      3
> >>
> >> SAS code that can do this is:
> >>
> >> proc sort data=olddata;
> >>    by ID Day;
> >> run;
> >>
> >> data newdata;
> >>    retain FirstDay;
> >>    set olddata;
> >>    by ID;
> >>    if first.ID then FirstDay=Day;
> >> run;
> >>
> >> I have NO idea how to do this is R (so I can't post test-code), but
> below I have R code that creates olddata:
> >>
> >> ID <- c(rep(1,10),rep(2,6),rep(3,2))
> >> date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
> >>            rep(5,3),rep(6,3),rep(10,2))
> >> date
> >> olddata <- data.frame(ID=ID,date=date)
> >> olddata
> >>
> >> Any suggestions on how to do this would be appreciated. . . I have
> worked on this for more than 12-hours, despite multiple we searches I have
> gotten nowhere. . .
> >>
> >> Thanks
> >> John
> >>
> >>
> >>
> >>
> >> John David Sorkin M.D., Ph.D.
> >> Professor of Medicine, University of Maryland School of Medicine;
> >> Associate Director for Biostatistics and Informatics, Baltimore VA
> Medical Center Geriatrics Research, Education, and Clinical Center;
> >> PI Biostatistics and Informatics Core, University of Maryland School of
> Medicine Claude D. Pepper Older Americans Independence Center;
> >> Senior Statistician University of Maryland Center for Vascular Research;
> >>
> >> Division of Gerontology and Paliative Care,
> >> 10 North Greene Street
> >> GRECC (BT/18/GR)
> >> Baltimore, MD 21201-1524
> >> Cell phone 443-418-5382
> >>
> >>
> >>
> >> ______________________________________________
> >> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> https://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >Hello,
> >
> >Isn't ?ave the simplest way?
> >The first one-liner assumes the dates are sorted in ascending order.
> >
> >
> >ave(olddata$date, olddata$ID, FUN = \(x) x[1L])
> >#>  [1]  1  1  1  1  1  1  1  1  1  1  5  5  5  5  5  5 10 10
> >
> >
> >If the dates are not sorted,
> >
> >
> >ave(olddata$date, olddata$ID, FUN = \(x) min(x))
> >
> >
> >
> >Hope this helps,
> >
> >Rui Barradas
> >
> >
>
> --
> Sent from my phone. Please excuse my brevity.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]