[R] how to find "first" or "last" record after sort in R

Avi Gross @v|gro@@ @end|ng |rom ver|zon@net
Thu Sep 9 21:43:21 CEST 2021


I am sure there are many good ways to do the task including taking the data.frame out into a list of data.frames and making the change to each by taking the nth row that matches nrow(it) and changing it and then recombining.

What follows are several attempts leading up to one at the end I find is probably the best choice.

I did the following sample demo using the dplyr package in the tidyverse but want to explain. My data was three small groups of 1 then 2 then 3. The second column in each had the same number as the group and it was unique for that group. If the last item can be a duplicate of another item, this method changes too much:

library(dplyr)

mydf <-
  tribble(
    ~grouper, ~val,
    1, 1,
    2, 1,
    2, 2,
    3, 1,
    3, 2,
    3, 3,
  )

mydf %>% group_by(grouper) %>% mutate(val2 = last(val), val=ifelse(val==val2,0,val))

The result is this:

> mydf %>% group_by(grouper) %>% mutate(val2 = last(val), val=ifelse(val==val2,0,val))
# A tibble: 6 x 3
# Groups:   grouper [3]
grouper   val  val2
<dbl> <dbl> <dbl>
  1       1     0     1
2       2     1     2
3       2     0     2
4       3     1     3
5       3     2     3
6       3     0     3

Now obviously this introduced an extra temporary row called val2, which is easily removed by many methods like piping to select(-val2) ...

But that is not needed as a shorter and more direct method is this:

mydf %>% 
  group_by(grouper) %>% 
  mutate(val = ifelse(val==last(val), 
                      0, 
                      val))

But some more research shows the helper functions that make this trivial.

Recall you wanted the last row in each group altered, I think to have an NA in column. I used 0 above but can use NA just as easily or any constant. The functions are:

n() gives the number of rows in the group.
row_number() gives the number of the current row as the functionality is being applied, within that group. The condition being offered is that n() == row_number() so this version surgically changes just the last rows no matter what other rows contain.

mydf %>% 
  group_by(grouper) %>% 
  mutate(val = ifelse(row_number() == n(), 
                      0, 
                      val))

If you have no interest in using a package like this, someone else will likely point you to a way. I suspect using something like split() to make a list of data.frames then applying some functionality to each smaller data.frame to get the result then recombining it back.


-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Kai Yang via R-help
Sent: Thursday, September 9, 2021 3:00 PM
To: R-help Mailing List <r-help using r-project.org>
Subject: [R] how to find "first" or "last" record after sort in R

Hello List,
Please look at the sample data frame below:

ID         date1              date2             date3
1    2015-10-08    2015-12-17    2015-07-23

2    2016-01-16    NA                 2015-10-08
3    2016-08-01    NA                 2017-01-10
3    2017-01-10    NA                 2016-01-16
4    2016-01-19    2016-02-24   2016-08-01
5    2016-03-01    2016-03-10   2016-01-19 This data frame was sorted by ID and date1. I need to set the column date3 as missing for the "last" record for each ID. In the sample data set, the ID 1, 2, 4 and 5 has one row only, so they can be consider as first and last records. the data3 can be set as missing. But the ID 3 has 2 rows. Since I sorted the data by ID and date1, the ID=3 and date1=2017-01-10 should be the last record only. I need to set date3=NA for this row only.

the question is, how can I identify the "last" record and set it as NA in date3 column.
Thank you,
Kai
	[[alternative HTML version deleted]]

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list