[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

Mon Dec 2 06:39:28 CET 2024

John,

Thanks for enlightening us so we better understand.

I won't argue with your wish to learn to do things in base R first. I started that way, myself, and found lots of the commands not particularly easy to fit into a single worldview. Many functions I read about were promptly forgotten, especially those without great documentation and not enough examples of real world usage.

This is why some packages that came later are important as they generally try to come up with a somewhat consistent set of tools that often are also faster and more flexible. There is often a set of reasons various packages are created in the first place to meet real needs. And, I note that some may be subtle. Original R was often inconsistent in the order of command arguments while the dplyr and other tidyverse command try as much as possible to make the first argument be the one normally passed through a pipeline. R fairly recently added a native pipe operator that may be faster than the magrittr pipe but in some ways makes some functionality harder. The rest of R has not really been changed to make using commands in pipelines easy.

You seem to have also looked at data.table and given you may have large amounts of data, it may be designed in ways that might also be beneficial.

But as I do not want to relearn lots of R functions I never use, I will bow out from further discussion as what I would offer these days would probably not be what you want.

My personal opinion is that proper use of R can actually be far easier and more flexible than you had with the proprietary software that may largely consist of canned reports often used.

I do want to point out a few things to consider.

When you go grouping, you may want to consider grouping (as well as sorting) by multipole variables. You mention a variable with about 500 possibilities and then another variable with an ID number but did not say the ID number was unique across them all. 

And, I want to note you may want to also look into testing the sanity of your data. That is a wide area too. Things like duplicates, for example.

I do not know how many steps you can handle but there are sometimes designs that make an algorithm work differently.

Consider your request to find  the first row in each grouping and add a column with a 1, and 0 for all others. If that is what you need, fine.

But, what if instead you just added a row number. Some rows would have a 1, and some may have a 2, 3, or 4.

When you wanted  to so something to just the rows with a 1, you can filter out a subset of the data easily enough or apply a command only to those rows. But if you want to test if any entry has more than 4 rows, this could allow you to detect an error. Other ideas might be possible if that is how the data was saved.

And, if it really is a 0/1 choice, fine, but consider the advantages or disadvantages of what you save in the new column. Storing a numeric or an int can take up space when storing a Boolean or TRUE/FALSE is what you need. R gives you lots of flexibility which perhaps you did not have to think about before.

All I know is that so much of what you want to do is easily enough done with a pipeline or two in dplyr. But this is your task and you choose what makes sense. It specializes in group analysis and generates reports and so on. It may not be how you think. 

-----Original Message-----
From: Sorkin, John <jsorkin using som.umaryland.edu> 
Sent: Sunday, December 1, 2024 11:19 PM
To: Bert Gunter <bgunter.4567 using gmail.com>; Rui Barradas <ruipbarradas using sapo.pt>; twoolman using ontargettek.com; tebert using ufl.edu; Bert Gunter <bgunter.4567 using gmail.com>; jdnewmil using dcn.davis.ca.us; avi.e.gross using gmail.com; therneau using mayo.edu; dwinsemius using comcast.net; tebert using ufl.edu; rmh using temple.edu; ken.knoblauch using inserm.fr; boris.steipe using utoronto.ca
Cc: r-help using r-project.org (r-help using r-project.org) <r-help using r-project.org>; kimmo.elo using uef.fi
Subject: Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

Dear Colleagues,

I am grateful to all of you for helping me with my question, how to write R code that will identify the first row of each ID within a data frame, create a variable first=1 for the first row and first=0 for all repeats of the ID.

WOW!!!
I just saw Boris Steipe's answer to my question:
olddata$first <- as.numeric(! duplicated(olddata$ID))
The solution is elegant, short, easy to understand, and it uses base R! All important characteristics of a good solution, at least for me. While I want to learn solutions using packages that extend base R, I believe that a good programmer learns how to do something using the base language and once that is learned, explores way to solve a programing problem using advanced packages.

Each and every one of you (I hope I did not miss anyone in my list of email addresses) took the time to read my emails and respond to me. Your collective help is invaluable, and I am in your collect debt.

Many, many thanks,
John

John David Sorkin M.D., Ph.D.
Professor of Medicine, University of Maryland School of Medicine;
Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center;
PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center;
Senior Statistician University of Maryland Center for Vascular Research;

Division of Gerontology and Paliative Care,
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
Cell phone 443-418-5382

________________________________________
From: Bert Gunter <bgunter.4567 using gmail.com>
Sent: Sunday, December 1, 2024 11:30 AM
To: Rui Barradas
Cc: Sorkin, John; r-help using r-project.org (r-help using r-project.org)
Subject: Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

Rui:
"f these two, diff is faster. But of all the solutions posted so far,
Ben Bolker's is the fastest."

But the explicit version of diff is still considerably faster:

> D <- c(rep(1,10),rep(2,6),rep(3,2))

> microbenchmark(c(1L,diff(D)), times = 1000L)
Unit: microseconds
           expr   min    lq    mean median    uq    max neval
 c(1L, diff(D)) 3.075 3.198 3.34396   3.28 3.362 29.684  1000

> microbenchmark( as.integer(!duplicated(D)), times =1000L)
Unit: microseconds
                       expr   min    lq     mean median   uq  max neval
 as.integer(!duplicated(D)) 1.476 1.558 1.644264  1.599 1.64 16.4  1000

> microbenchmark( D - c(0L, D[-length(D)]), times = 1000L)
Unit: nanoseconds  ## note that unit is nanoseconds not microseconds
                     expr min  lq    mean median  uq  max neval
 D - c(0L, D[-length(D)]) 369 410 489.335    492 533 9840  1000

Cheers,
Bert

On Sat, Nov 30, 2024 at 11:05 PM Rui Barradas <ruipbarradas using sapo.pt> wrote:
>
> Às 02:27 de 01/12/2024, Sorkin, John escreveu:
> > Dear R help folks,
> >
> > First my apologizes for sending several related questions to the list server. I am trying to learn how to manipulate data in R . . . and am having difficulty getting my program to work. I greatly appreciate the help and support list member give!
> >
> > I am trying to write a program that will run through a data frame organized by ID and for the first line of each new group of data lines that has the same ID create a new variable first that will be 1 for the first line of the group and 0 for all other lines.
> >
> > e.g. if my original data is
> >   olddata
> >     ID date
> >      1     1
> >      1     1
> >      1     2
> >      1     2
> >      1     3
> >      1     3
> >      1     4
> >      1     4
> >      1     5
> >      1     5
> >      2     5
> >      2     5
> >      2     5
> >      2     6
> >      2     6
> >      2     6
> >      3   10
> >      3   10
> >
> > the new data will be
> > newdata
> >     ID date  first
> >      1     1       1
> >      1     1       0
> >      1     2       0
> >      1     2       0
> >      1     3       0
> >      1     3       0
> >      1     4       0
> >      1     4       0
> >      1     5       0
> >      1     5       0
> >      2     5       1
> >      2     5       0
> >      2     5       0
> >      2     6       0
> >      2     6       0
> >      2     6       0
> >      3   10       1
> >      3   10       0
> >
> > When I run the program below, I receive the following error:
> > Error in df[, "ID"] : incorrect number of dimensions
> >
> > My code:
> > # Create data.frame
> > ID <- c(rep(1,10),rep(2,6),rep(3,2))
> > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
> >            rep(5,3),rep(6,3),rep(10,2))
> > olddata <- data.frame(ID=ID,date=date)
> > class(olddata)
> > cat("This is the original data frame","\n")
> > print(olddata)
> >
> > # This function is supposed to identify the first row
> > # within each level of ID and, for the first row, set
> > # the variable first to 1, and for all rows other than
> > # the first row set first to 0.
> > mydoit <- function(df){
> >    value <- ifelse (first(df[,"ID"]),1,0)
> >    cat("value=",value,"\n")
> >    df[,"first"] <- value
> > }
> > newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)
> >
> > Thank you,
> > John
> >
> >
> > John David Sorkin M.D., Ph.D.
> > Professor of Medicine, University of Maryland School of Medicine;
> > Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center;
> > PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center;
> > Senior Statistician University of Maryland Center for Vascular Research;
> >
> > Division of Gerontology and Paliative Care,
> > 10 North Greene Street
> > GRECC (BT/18/GR)
> > Baltimore, MD 21201-1524
> > Cell phone 443-418-5382
> >
> >
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide https://www.r-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> And here are two other solutions.
>
>
> olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x ==
> x[1L]))
>
> olddata$first <- c(1L, diff(olddata$ID))
>
>
> Of these two, diff is faster. But of all the solutions posted so far,
> Ben Bolker's is the fastest. And it can be made a little faster if
> as.integer substitutes for as.numeric.
> And dplyr::mutate now has a .by argument, which avoids explicit the call
> to group_by, with a performance gain.
>
>
> library(microbenchmark)
>
> mb <- microbenchmark(
>    ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])),
>    dup_num = as.numeric(! duplicated(olddata$ID)),
>    dup_int = as.integer(! duplicated(olddata$ID)),
>    diff = diff = c(1L, diff(olddata$ID)),
>    dplyr_grp = olddata %>% group_by(ID) %>% mutate(first =
> as.integer(row_number() == 1)),
>    dplyr = olddata %>% mutate(first = as.integer(row_number() == 1), .by
> = ID)
> )
> print(mb, order = "median")
>
>
>
> However, note that dplyr operates in entire data.frames and therefore is
> expected to be slower when tested against instructions that process one
> column only.
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> --
> Este e-mail foi analisado pelo software antivírus AVG para verificar a presença de vírus.
> http://www.avg.com/
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.r-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.