[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

Mon Dec 2 05:18:40 CET 2024

Dear Colleagues,

I am grateful to all of you for helping me with my question, how to write R code that will identify the first row of each ID within a data frame, create a variable first=1 for the first row and first=0 for all repeats of the ID.

WOW!!!
I just saw Boris Steipe's answer to my question:
olddata$first <- as.numeric(! duplicated(olddata$ID))
The solution is elegant, short, easy to understand, and it uses base R! All important characteristics of a good solution, at least for me. While I want to learn solutions using packages that extend base R, I believe that a good programmer learns how to do something using the base language and once that is learned, explores way to solve a programing problem using advanced packages.

Each and every one of you (I hope I did not miss anyone in my list of email addresses) took the time to read my emails and respond to me. Your collective help is invaluable, and I am in your collect debt.

Many, many thanks,
John

John David Sorkin M.D., Ph.D.
Professor of Medicine, University of Maryland School of Medicine;
Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center;
PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center;
Senior Statistician University of Maryland Center for Vascular Research;

Division of Gerontology and Paliative Care,
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
Cell phone 443-418-5382

________________________________________
From: Bert Gunter <bgunter.4567 using gmail.com>
Sent: Sunday, December 1, 2024 11:30 AM
To: Rui Barradas
Cc: Sorkin, John; r-help using r-project.org (r-help using r-project.org)
Subject: Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

Rui:
"f these two, diff is faster. But of all the solutions posted so far,
Ben Bolker's is the fastest."

But the explicit version of diff is still considerably faster:

> D <- c(rep(1,10),rep(2,6),rep(3,2))

> microbenchmark(c(1L,diff(D)), times = 1000L)
Unit: microseconds
           expr   min    lq    mean median    uq    max neval
 c(1L, diff(D)) 3.075 3.198 3.34396   3.28 3.362 29.684  1000

> microbenchmark( as.integer(!duplicated(D)), times =1000L)
Unit: microseconds
                       expr   min    lq     mean median   uq  max neval
 as.integer(!duplicated(D)) 1.476 1.558 1.644264  1.599 1.64 16.4  1000

> microbenchmark( D - c(0L, D[-length(D)]), times = 1000L)
Unit: nanoseconds  ## note that unit is nanoseconds not microseconds
                     expr min  lq    mean median  uq  max neval
 D - c(0L, D[-length(D)]) 369 410 489.335    492 533 9840  1000

Cheers,
Bert

On Sat, Nov 30, 2024 at 11:05 PM Rui Barradas <ruipbarradas using sapo.pt> wrote:
>
> Às 02:27 de 01/12/2024, Sorkin, John escreveu:
> > Dear R help folks,
> >
> > First my apologizes for sending several related questions to the list server. I am trying to learn how to manipulate data in R . . . and am having difficulty getting my program to work. I greatly appreciate the help and support list member give!
> >
> > I am trying to write a program that will run through a data frame organized by ID and for the first line of each new group of data lines that has the same ID create a new variable first that will be 1 for the first line of the group and 0 for all other lines.
> >
> > e.g. if my original data is
> >   olddata
> >     ID date
> >      1     1
> >      1     1
> >      1     2
> >      1     2
> >      1     3
> >      1     3
> >      1     4
> >      1     4
> >      1     5
> >      1     5
> >      2     5
> >      2     5
> >      2     5
> >      2     6
> >      2     6
> >      2     6
> >      3   10
> >      3   10
> >
> > the new data will be
> > newdata
> >     ID date  first
> >      1     1       1
> >      1     1       0
> >      1     2       0
> >      1     2       0
> >      1     3       0
> >      1     3       0
> >      1     4       0
> >      1     4       0
> >      1     5       0
> >      1     5       0
> >      2     5       1
> >      2     5       0
> >      2     5       0
> >      2     6       0
> >      2     6       0
> >      2     6       0
> >      3   10       1
> >      3   10       0
> >
> > When I run the program below, I receive the following error:
> > Error in df[, "ID"] : incorrect number of dimensions
> >
> > My code:
> > # Create data.frame
> > ID <- c(rep(1,10),rep(2,6),rep(3,2))
> > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
> >            rep(5,3),rep(6,3),rep(10,2))
> > olddata <- data.frame(ID=ID,date=date)
> > class(olddata)
> > cat("This is the original data frame","\n")
> > print(olddata)
> >
> > # This function is supposed to identify the first row
> > # within each level of ID and, for the first row, set
> > # the variable first to 1, and for all rows other than
> > # the first row set first to 0.
> > mydoit <- function(df){
> >    value <- ifelse (first(df[,"ID"]),1,0)
> >    cat("value=",value,"\n")
> >    df[,"first"] <- value
> > }
> > newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)
> >
> > Thank you,
> > John
> >
> >
> > John David Sorkin M.D., Ph.D.
> > Professor of Medicine, University of Maryland School of Medicine;
> > Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center;
> > PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center;
> > Senior Statistician University of Maryland Center for Vascular Research;
> >
> > Division of Gerontology and Paliative Care,
> > 10 North Greene Street
> > GRECC (BT/18/GR)
> > Baltimore, MD 21201-1524
> > Cell phone 443-418-5382
> >
> >
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide https://www.r-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> And here are two other solutions.
>
>
> olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x ==
> x[1L]))
>
> olddata$first <- c(1L, diff(olddata$ID))
>
>
> Of these two, diff is faster. But of all the solutions posted so far,
> Ben Bolker's is the fastest. And it can be made a little faster if
> as.integer substitutes for as.numeric.
> And dplyr::mutate now has a .by argument, which avoids explicit the call
> to group_by, with a performance gain.
>
>
> library(microbenchmark)
>
> mb <- microbenchmark(
>    ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])),
>    dup_num = as.numeric(! duplicated(olddata$ID)),
>    dup_int = as.integer(! duplicated(olddata$ID)),
>    diff = diff = c(1L, diff(olddata$ID)),
>    dplyr_grp = olddata %>% group_by(ID) %>% mutate(first =
> as.integer(row_number() == 1)),
>    dplyr = olddata %>% mutate(first = as.integer(row_number() == 1), .by
> = ID)
> )
> print(mb, order = "median")
>
>
>
> However, note that dplyr operates in entire data.frames and therefore is
> expected to be slower when tested against instructions that process one
> column only.
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> --
> Este e-mail foi analisado pelo software antivírus AVG para verificar a presença de vírus.
> http://www.avg.com/
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.r-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.