[R] Creating a new by variable in a dataframe

Sat Oct 20 18:28:42 CEST 2012

HI Bill,

Thanks for the reply.
It was unnecessarily complicated.
d$flag<-unlist(lapply(split(d,d$date),function(x) x[3]==max(x[3])),use.names=FALSE)
#or
d$flag<-unlist(lapply(split(d,d$date),function(x) x[3]==max(x[3])))
should have done the same job.
str(d)
#'data.frame':    10 obs. of  4 variables:
# $ transaction: chr  "T01" "T02" "T03" "T04" ...
# $ date       : Date, format: "2012-10-19" "2012-10-19" ...
# $ time       : int  8 9 10 11 12 13 14 15 16 17
 #$ flag       : logi  FALSE FALSE FALSE TRUE TRUE FALSE ...

I am getting error messages with:
d$flag2 <- unlist(lapply(unname(split(d[[3]], d$date), function(x)x==max(x))))
Error in match.fun(FUN) : argument "FUN" is missing, with no default

A.K.

----- Original Message -----
From: William Dunlap <wdunlap at tibco.com>
To: arun <smartpink111 at yahoo.com>; Flavio Barros <flaviomargarito at gmail.com>
Cc: R help <r-help at r-project.org>; ramoss <ramine.mossadegh at finra.org>
Sent: Saturday, October 20, 2012 12:04 PM
Subject: RE: [R] Creating a new by variable in a dataframe

> d$flag<-unlist(rbind(lapply(split(d,d$date),function(x) x[3]==max(x[3]))))

I think that line is unnecessarily complicated. lapply() returns a list
and rbind applied to one argument, L, mainly adds dimensions c(length(L),1)
to it (it also changes its names to rownames).  unlist doesn't care about
the dimensions, so you may as well leave out the rbind.  The only difference
in the results with and without calling rbind is that the rbind version omits
the names from flag.  Use the more direct unname() on split's output or
unlists's output if that concerns you. 

Also, if you are interested in saving time and memory when the input, d, is large,
you will be better off applying split() to just the column of the data.frame
that you want split instead of to the entire data.frame.
   d$flag2 <- unlist(lapply(unname(split(d[[3]], d$date), function(x)x==max(x))))
(I used d[[3]] instead of the more readable d$time to follow your original more closely.)

You ought to check that the data is sorted by date: otherwise these give the
wrong answer.

What result do you want when there are several transactions at the last time
in the day?

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of arun
> Sent: Friday, October 19, 2012 7:49 PM
> To: Flavio Barros
> Cc: R help; ramoss
> Subject: Re: [R] Creating a new by variable in a dataframe
> 
> 
> 
> HI,
> Without using "ifelse()" on the same example dataset.
> d <- data.frame(stringsAsFactors = FALSE, transaction = c("T01", "T02",
> "T03", "T04", "T05", "T06", "T07", "T08", "T09", "T10"),date =
> c("2012-10-19", "2012-10-19", "2012-10-19", "2012-10-19", "2012-10-22",
> "2012-10-23", "2012-10-23", "2012-10-23", "2012-10-23", "2012-10-23"),time
> = c("08:00", "09:00", "10:00", "11:00", "12:00", "13:00", "14:00", "15:00",
> "16:00", "17:00"))
> 
> d$date <- as.Date(d$date,format="%Y-%m-%d")
> d$time<-strptime(d$time,format="%H:%M")$hour
> d$flag<-unlist(rbind(lapply(split(d,d$date),function(x) x[3]==max(x[3]))))
> d$datetime<-as.POSIXct(paste(d$date,d$time," "),format="%Y-%m-%d %H")
> d1<-d[,c(1,5,4)]
>  d1
> #   transaction            datetime  flag
> #1          T01 2012-10-19 08:00:00 FALSE
> #2          T02 2012-10-19 09:00:00 FALSE
> #3          T03 2012-10-19 10:00:00 FALSE
> #4          T04 2012-10-19 11:00:00  TRUE
> #5          T05 2012-10-22 12:00:00  TRUE
> #6          T06 2012-10-23 13:00:00 FALSE
> #7          T07 2012-10-23 14:00:00 FALSE
> #8          T08 2012-10-23 15:00:00 FALSE
> #9          T09 2012-10-23 16:00:00 FALSE
> #10         T10 2012-10-23 17:00:00  TRUE
> 
> str(d1)
> #'data.frame':    10 obs. of  3 variables:
> # $ transaction: chr  "T01" "T02" "T03" "T04" ...
> # $ datetime   : POSIXct, format: "2012-10-19 08:00:00" "2012-10-19 09:00:00" ...
> # $ flag       : logi  FALSE FALSE FALSE TRUE TRUE FALSE ...
> 
> A.K.
> 
> 
> ----- Original Message -----
> From: Flavio Barros <flaviomargarito at gmail.com>
> To: William Dunlap <wdunlap at tibco.com>
> Cc: "r-help at r-project.org" <r-help at r-project.org>; ramoss
> <ramine.mossadegh at finra.org>
> Sent: Friday, October 19, 2012 4:24 PM
> Subject: Re: [R] Creating a new by variable in a dataframe
> 
> I think i have a better solution
> 
> *## Example data.frame*
> d <- data.frame(stringsAsFactors = FALSE, transaction = c("T01", "T02",
> "T03", "T04", "T05", "T06", "T07", "T08", "T09", "T10"),date =
> c("2012-10-19", "2012-10-19", "2012-10-19", "2012-10-19", "2012-10-22",
> "2012-10-23", "2012-10-23", "2012-10-23", "2012-10-23", "2012-10-23"),time
> = c("08:00", "09:00", "10:00", "11:00", "12:00", "13:00", "14:00", "15:00",
> "16:00", "17:00"))
> 
> *## As date tranfomation*
> d$date <- as.Date(d$date)
> d$time <- strptime(d$time, format='%H')
> 
> library(reshape)
> 
> *## Create factor to split the data*
> fdate <- factor(format(d$date, '%D'))
> 
> *## Create a list with logical TRUE when is the last transaction*
> ex <- sapply(split(d, fdate), function(x)
> ifelse(as.numeric(x[,'time'])==max(as.numeric(x[,'time'])),T,F))
> 
> *## Coerce to logical vector*
> flag <- unlist(rbind(ex))
> 
> *## With reshape we have the transform function e can add the flag column *
> d <- transform(d, flag = flag)
> 
> On Fri, Oct 19, 2012 at 3:51 PM, William Dunlap <wdunlap at tibco.com> wrote:
> 
> > Suppose your data frame is
> > d <- data.frame(
> >      stringsAsFactors = FALSE,
> >      transaction = c("T01", "T02", "T03", "T04", "T05", "T06",
> >         "T07", "T08", "T09", "T10"),
> >      date = c("2012-10-19", "2012-10-19", "2012-10-19",
> >         "2012-10-19", "2012-10-22", "2012-10-23",
> >         "2012-10-23", "2012-10-23", "2012-10-23",
> >         "2012-10-23"),
> >      time = c("08:00", "09:00", "10:00", "11:00", "12:00",
> >         "13:00", "14:00", "15:00", "16:00", "17:00"
> >         ))
> > (Convert the date and time to your favorite classes, it doesn't matter
> > here.)
> >
> > A general way to say if an item is the last of its group is:
> >   isLastInGroup <- function(...)  ave(logical(length(..1)), ...,
> > FUN=function(x)seq_along(x)==length(x))
> >   is_last_of_dayA <- with(d, isLastInGroup(date))
> > If you know your data is sorted by date you could save a little time for
> > large
> > datasets by using
> >   isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
> >   is_last_of_dayB <- isLastInRun(d$date)
> > The above d is sorted by date so you get the same results for both:
> >   > cbind(d, is_last_of_dayA, is_last_of_dayB)
> >      transaction       date  time is_last_of_dayA is_last_of_dayB
> >   1          T01 2012-10-19 08:00           FALSE           FALSE
> >   2          T02 2012-10-19 09:00           FALSE           FALSE
> >   3          T03 2012-10-19 10:00           FALSE           FALSE
> >   4          T04 2012-10-19 11:00            TRUE            TRUE
> >   5          T05 2012-10-22 12:00            TRUE            TRUE
> >   6          T06 2012-10-23 13:00           FALSE           FALSE
> >   7          T07 2012-10-23 14:00           FALSE           FALSE
> >   8          T08 2012-10-23 15:00           FALSE           FALSE
> >   9          T09 2012-10-23 16:00           FALSE           FALSE
> >   10         T10 2012-10-23 17:00            TRUE            TRUE
> >
> >
> > Bill Dunlap
> > Spotfire, TIBCO Software
> > wdunlap tibco.com
> >
> >
> > > -----Original Message-----
> > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
> > On Behalf
> > > Of ramoss
> > > Sent: Friday, October 19, 2012 10:52 AM
> > > To: r-help at r-project.org
> > > Subject: [R] Creating a new by variable in a dataframe
> > >
> > > Hello,
> > >
> > > I have a dataframe w/ 3 variables of interest: transaction,date(tdate) &
> > > time(event_tim).
> > > How could I create a 4th variable (last_trans) that would flag the last
> > > transaction of the day for each day?
> > > In SAS I use:
> > > proc sort data=all6;
> > > by tdate event_tim;
> > > run;
> > >          /*Create last transaction flag per day*/
> > > data all6;
> > >   set all6;
> > >   by tdate event_tim;
> > >   last_trans=last.tdate;
> > >
> > > Thanks ahead for any suggestions.
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > http://r.789695.n4.nabble.com/Creating-a-new-by-
> > > variable-in-a-dataframe-tp4646782.html
> > > Sent from the R help mailing list archive at Nabble.com.
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> 
> 
> --
> Att,
> 
> Flávio Barros
> 
>     [[alternative HTML version deleted]]
> 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.