[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

@vi@e@gross m@iii@g oii gm@ii@com @vi@e@gross m@iii@g oii gm@ii@com
Sun Dec 1 07:18:09 CET 2024


I was wondering along similar lines, Bert.

One way to get help is to ask how to do some single step of a larger strategy. That can lead to answers that may not be as applicable to the scenario.

Another way would be to include a synopsis of what they are trying to do.

But, as John says he is trying to learn and improve his abilities, perhaps he s getting what he wants.
After watching some of the exchanges in multiple questions, many seem to revolve around a wish to deal with sorted grouped data. He seems to have looked at some base R methods as well as packages like dplyr using tibbles as well as another package and format.

What interests me from a dplyr perspective is how many little embedded functions it makes available and some have been mentioned here. If you want to  add a column that contains the same value for each group, such as the minimum, mean, first and many other things, it is very easily doable.

The latest request seems to be a bit different as it wants a column with a 1 (presumably for TRUE) only for the first entry in  the group. Again, fairly easy using one of several hooks such as the rownumber being "1" versus not. There are many variations on the answer supplied depending on style and need, such as making a column that contains the row number, and in a later step, set those to zero that are not a one. 

But sometimes you want to ask what the overall algorithm is. Do you need extra columns to then use for some purpose, or could that purpose have been done another way such as doing some calculation only when rownumber is one.

As noted, R makes some operations fairly natural, in ways that differ from the "natural" way another program/environment does it. Sometimes a translation is not worth doing as compared to a reworked algorithm that makes good use of whichever package and related functionality you want to use. 

Assuming all these questions relate to the same project, I am not clear if and where the lookback at previous row/value fits.

Of course, John may not be free to share more in public.

Anyone want to suggest a book or two on data processing of this sort using R that might illustrate with examples galore on how various problems are solved and then perhaps some will be similar enough ...

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Bert Gunter
Sent: Saturday, November 30, 2024 11:34 PM
To: Sorkin, John <jsorkin using som.umaryland.edu>
Cc: r-help using r-project.org (r-help using r-project.org) <r-help using r-project.org>
Subject: Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

May I ask *why* you want to do this?

It sounds to me like like you're using SAS-like strategies for your
data analysis rather than R-like.

-- Bert

-- Bert

On Sat, Nov 30, 2024 at 6:27 PM Sorkin, John <jsorkin using som.umaryland.edu> wrote:
>
> Dear R help folks,
>
> First my apologizes for sending several related questions to the list server. I am trying to learn how to manipulate data in R . . . and am having difficulty getting my program to work. I greatly appreciate the help and support list member give!
>
> I am trying to write a program that will run through a data frame organized by ID and for the first line of each new group of data lines that has the same ID create a new variable first that will be 1 for the first line of the group and 0 for all other lines.
>
> e.g. if my original data is
>  olddata
>    ID date
>     1     1
>     1     1
>     1     2
>     1     2
>     1     3
>     1     3
>     1     4
>     1     4
>     1     5
>     1     5
>     2     5
>     2     5
>     2     5
>     2     6
>     2     6
>     2     6
>     3   10
>     3   10
>
> the new data will be
> newdata
>    ID date  first
>     1     1       1
>     1     1       0
>     1     2       0
>     1     2       0
>     1     3       0
>     1     3       0
>     1     4       0
>     1     4       0
>     1     5       0
>     1     5       0
>     2     5       1
>     2     5       0
>     2     5       0
>     2     6       0
>     2     6       0
>     2     6       0
>     3   10       1
>     3   10       0
>
> When I run the program below, I receive the following error:
> Error in df[, "ID"] : incorrect number of dimensions
>
> My code:
> # Create data.frame
> ID <- c(rep(1,10),rep(2,6),rep(3,2))
> date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
>           rep(5,3),rep(6,3),rep(10,2))
> olddata <- data.frame(ID=ID,date=date)
> class(olddata)
> cat("This is the original data frame","\n")
> print(olddata)
>
> # This function is supposed to identify the first row
> # within each level of ID and, for the first row, set
> # the variable first to 1, and for all rows other than
> # the first row set first to 0.
> mydoit <- function(df){
>   value <- ifelse (first(df[,"ID"]),1,0)
>   cat("value=",value,"\n")
>   df[,"first"] <- value
> }
> newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)
>
> Thank you,
> John
>
>
> John David Sorkin M.D., Ph.D.
> Professor of Medicine, University of Maryland School of Medicine;
> Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center;
> PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center;
> Senior Statistician University of Maryland Center for Vascular Research;
>
> Division of Gerontology and Paliative Care,
> 10 North Greene Street
> GRECC (BT/18/GR)
> Baltimore, MD 21201-1524
> Cell phone 443-418-5382
>
>
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list