[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Mon Dec 2 06:43:45 CET 2024


OK.

(Note: I am ccing this to the list so that others can correct any
mistakes, misunderstandings, or misstatements that I may make; and
also give you more and better advice about how to proceed. For
example, there may be environmental packages ( see the environmetrics
task view, https://CRAN.R-project.org/view=Environmetrics) that
already do everything you want from your data source that someone else
could tell you about. Why reinvent the wheel if you don't have to? )

Unfortunately, you failed to show us the critical information
necessary to give you a definitive answer: the structure of a single
record. However, I will wing it based on your description and assume
that each record contains the following information something like
this: Latitude   Longitude  Date Time pm25 + other stuff maybe.  If
this is the case,

1) You do not need to sort anything. R has robust date-time
manipulation and computation capabilities to handle dates and times,
though in this case you only need dates. R knows all about how to
order dates and times.

2) You do not have to physically group your records according to
geographic location. R knows how to extract and manipulate groups
without this;

3) And as all you need is a daily average, you do not need to worry
about when days begin or end.

If this is a reasonably accurate interpretation of what needs to be
done, and your data are in a data frame called dat, then something
like the following will do it:
## First convert latitude and longitude into a single location factor, loc
dat$loc <- with(dat, paste0(Latitude, Longitude))

## you could also convert the latitude and longitude into actual
location names if you like; the "factor" data structure in R is one
simple way to do this, for example.

## Then a one liner in base R does what I think you want:
avgs <- aggregate(pm25 ~ loc + Date, FUN = mean, data = dat)

See ?aggregate for details. Note in particular that you can get
results simultaneously for several different pm's or whatevers. I
should also note that the so-called "Tidyverse" and "data.table"
groups of packages, and likely others, also can easily do these sorts
of things, though of course with different syntax, semantics, and
functionality, perhaps in ways that you might find simpler to master..
There are many good tutorials available for both base R and these
packages, but from my ignorant perspective, you need to first spend
some time to learn about R's basic data structures (factors, lists,
vectors, etc.) if you want to use R for serious data manipulation.
Finally, my best advice would be to forget about SAS if you wish to
use R. Trying to translate SAS paradigms into R is the devil's work.

Cheers,
Bert






On Sun, Dec 1, 2024 at 7:29 PM Sorkin, John <jsorkin using som.umaryland.edu> wrote:
>
> Bert, Avi:
>
> I stand accused of "using SAS-like strategies for your data analysis rather than R-like [analyses]." Although I am guilty, but I beg the court's mercy ;).  I have been a SAS programmer for more than 35 years. I have used R (and S-Plus) for about 20-years, but mostly for statistical analyses that required little or no data manipulation. I now need to use R for both statistical analyses and data manipulation and am trying to do in R what I can do easily is SAS.
>
> Here is a full description of my data manipulation problem:
> I have satellite data of airborne pollutants, obtained every 15-minutes over four-days from approximately 500 geographic areas=438 observations/4 days*500 geographic locations=approximately 250,000 individual observations. (I say approximately because I have slightly less than four-days data and I have slightly less than 500 geographic areas. The exact number of observations is of minor importance.)
>
> Each of the 500 geographic areas has a fixed longitude and latitude. Each of the 438 observations for each of the approximately 500 geographic areas has a date-time stamp. I need to compute average 24-hour pollutant (e.g. pm2.5) exposure for each day, across ALL 500 geographic areas. To accomplish this, I need to
>
> 1) Group data from each of the 500 geographic areas together
>
> 2) Within each geographic area order the observations by day and time
>         1) and 2) are easily accomplished using the R order function,
>       mydata<-mydata[order(mydata$lat_lon,mydata$Time),]
>
>  or SAS proc sort:
>         proc sort data=mydata;
>            by lat_lon daytime;  /* lat long is a string giving latitude and longitude */
>         run;
>
> 3) For each geographic area determine the records that mark the start and stop of each of the four days, let's say from 00:00 hrs to 23:59 hours, and create a variable, daynum that indicates the day number (valid values 1 to 4). I do not know how to accomplish this in R. This is the part of my analysis that I asked the R community to help me write. I know how this this can be easily done in SAS:
> * Arrange data date and time within each geographic region (i.e. lat_lon and daytime);
> proc sort data=mydata;
>   by lat_lon daytime;
> run;
>
> data mydata;
>   /* For each getgraphic area, each time a new record is run, keep the preceding value of daynum */
>   retain daynum;
>   set mydata;
>      by lat_lon daytime;
>   /* initialize daynum to 0 for first record from a given geographic location*/
>   if _n_ eq 1 then daynum=0;
>  /* Determine start of each day */
> mytime = timepart(daytime)  /* Extract time from date-time constant */
>   if mytime eq '00:00:00't then daynum=daynum+1; /* Increment daynum for each new day */
> run;
>
> 4) Get average value for a pollutant, pm25, by day across all 500 geographic areas. This is easily done in SAS using proc sort and proc means.
> proc sort data=mydata;
>   by daynum;
> run;
>
> * Get mean pm 2.5 by day accross all 500 geographic regions.;
> proc means data=mydata;
>   by daynum;
>   var pm25;
> run;
>
> If I can get step (3) above accomplished in R, I know how to accomplish step 4) in R using the by function:
> by(mydata[,"pm25"], mydata[,"daynum"],mean)
>
> I am trying to write the analysis described, and written in SAS, for 3) above in R. Please understand that I am fluent in SAS, and (except for straight forward analyses that require little or no data manipulation, where I am an intermediate programmer) i am an R tyro.
>
> Thank you for your help. My apologies for the long description of what I am trying to do. I sent this because you asked what I was trying to do and why I was doing it from the perspective of a SAS programmer rather than a matrix-based R programmer.
>
> John David Sorkin M.D., Ph.D.
> Professor of Medicine, University of Maryland School of Medicine;
> Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center;
> PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center;
> Senior Statistician University of Maryland Center for Vascular Research;
>
> Division of Gerontology and Paliative Care,
> 10 North Greene Street
> GRECC (BT/18/GR)
> Baltimore, MD 21201-1524
> Cell phone 443-418-5382
>
>
>
>
> ________________________________________
> From: Bert Gunter <bgunter.4567 using gmail.com>
> Sent: Saturday, November 30, 2024 11:33 PM
> To: Sorkin, John
> Cc: r-help using r-project.org (r-help using r-project.org)
> Subject: Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
>
> May I ask *why* you want to do this?
>
> It sounds to me like like you're using SAS-like strategies for your
> data analysis rather than R-like.
>
> -- Bert
>
> -- Bert
>
> On Sat, Nov 30, 2024 at 6:27 PM Sorkin, John <jsorkin using som.umaryland.edu> wrote:
> >
> > Dear R help folks,
> >
> > First my apologizes for sending several related questions to the list server. I am trying to learn how to manipulate data in R . . . and am having difficulty getting my program to work. I greatly appreciate the help and support list member give!
> >
> > I am trying to write a program that will run through a data frame organized by ID and for the first line of each new group of data lines that has the same ID create a new variable first that will be 1 for the first line of the group and 0 for all other lines.
> >
> > e.g. if my original data is
> >  olddata
> >    ID date
> >     1     1
> >     1     1
> >     1     2
> >     1     2
> >     1     3
> >     1     3
> >     1     4
> >     1     4
> >     1     5
> >     1     5
> >     2     5
> >     2     5
> >     2     5
> >     2     6
> >     2     6
> >     2     6
> >     3   10
> >     3   10
> >
> > the new data will be
> > newdata
> >    ID date  first
> >     1     1       1
> >     1     1       0
> >     1     2       0
> >     1     2       0
> >     1     3       0
> >     1     3       0
> >     1     4       0
> >     1     4       0
> >     1     5       0
> >     1     5       0
> >     2     5       1
> >     2     5       0
> >     2     5       0
> >     2     6       0
> >     2     6       0
> >     2     6       0
> >     3   10       1
> >     3   10       0
> >
> > When I run the program below, I receive the following error:
> > Error in df[, "ID"] : incorrect number of dimensions
> >
> > My code:
> > # Create data.frame
> > ID <- c(rep(1,10),rep(2,6),rep(3,2))
> > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
> >           rep(5,3),rep(6,3),rep(10,2))
> > olddata <- data.frame(ID=ID,date=date)
> > class(olddata)
> > cat("This is the original data frame","\n")
> > print(olddata)
> >
> > # This function is supposed to identify the first row
> > # within each level of ID and, for the first row, set
> > # the variable first to 1, and for all rows other than
> > # the first row set first to 0.
> > mydoit <- function(df){
> >   value <- ifelse (first(df[,"ID"]),1,0)
> >   cat("value=",value,"\n")
> >   df[,"first"] <- value
> > }
> > newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)
> >
> > Thank you,
> > John
> >
> >
> > John David Sorkin M.D., Ph.D.
> > Professor of Medicine, University of Maryland School of Medicine;
> > Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center;
> > PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center;
> > Senior Statistician University of Maryland Center for Vascular Research;
> >
> > Division of Gerontology and Paliative Care,
> > 10 North Greene Street
> > GRECC (BT/18/GR)
> > Baltimore, MD 21201-1524
> > Cell phone 443-418-5382
> >
> >
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide https://www.r-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list