[R] run a calculation function over time fields, ordered and grouped by variables
jdnewmil
jdnewmil at dcn.davis.ca.us
Tue May 26 02:04:17 CEST 2015
Another way:
#create four columns with route id, stop sequence interval time and
route start time
ssq <- c( 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 2, 3, 4, 5, 6, 7, 8 )
tint <- c( "00:00", "00:12", "00:03", "00:06", "00:09", "00:02", "00:04"
, "00:00", "00:08", "00:10", "00:10"
, "00:00", "00:02", "00:04", "00:08", "00:02", "00:01", "00:01"
)
tst <- c( rep( "18:20", 7 )
, rep( "10:50", 4 )
, rep( "16:15", 7 ) )
rtid <- c( rep( "a", 7 )
, rep( "b", 4 )
, rep( "c", 7 ) )
# Don't use cbind to make data frames... it usually ends up
# forcing all columns to be character or factors
# Also, avoid using "df" as a variable name... it is the name of
# a function in base R, so that gets confusing fast
DF <- data.frame( ssq, tint, tst, rtid, stringsAsFactors=FALSE )
DF
#correct data set should look like this
tarr <- c( "18:20", "18:32", "18:35", "18:41", "18:50", "18:52", "18:56"
, "10:50", "10:58", "11:08", "11:18"
, "16:15", "16:17", "16:21", "16:29", "16:31", "16:32", "16:33"
)
DF2 <- data.frame( DF, tarr, stringsAsFactors=FALSE )
DF2
library(dplyr)
DFs <- ( DF
%>% group_by( rtid )
%>% mutate( tarr
= as.character( as.POSIXct( tst, format="%H:%M", tz="GMT"
)
+ as.difftime(
cumsum(
as.numeric(
as.difftime( tint, format="%H:%M"
)
, units="mins"
)
)
, units="mins"
)
, format="%H:%M" )
)
%>% as.data.frame # removes grouping behavior from result
)
identical( DFs, DF2 )
On 2015-05-25 15:43, Jim Lemon wrote:
> Hi gavinr,
> Perhaps this will do what you want.
>
> add_HH_MM<-function(x) {
> t1bits<-strsplit(as.character(x$tst),":")
> t2bits<-strsplit(as.character(x$tint),":")
>
> hours<-as.numeric(lapply(t1bits,"[",1))+cumsum(as.numeric(lapply(t2bits,"[",1)))
>
> minutes<-as.numeric(lapply(t1bits,"[",2))+cumsum(as.numeric(lapply(t2bits,"[",2)))
> next_hour<-minutes > 59
> # adjust for running into the next hour
> minutes[next_hour]<-minutes[next_hour]-60
> hours[next_hour]<-hours[next_hour]+1
> # adjust for running into the next day
> hours[hours > 23]<-hours[hours > 23]-24
>
> return(paste(formatC(hours,width=2,flag=0),formatC(minutes,width=2,flag=0),sep=":"))
> }
>
> df$tarr<-unlist(by(df,df$rtid,add_HH_MM))
>
> Jim
>
>
> On Tue, May 26, 2015 at 5:28 AM, gavinr <g.rudge at bham.ac.uk> wrote:
>> I’ve got some transit data relating to bus stops for a GIS data set.
>> Each
>> row represents one stop on a route. For each record I have the start
>> time
>> of the route, a sequence in which a bus stops, the time the bus
>> arrives at
>> the first stop and the time taken to get to each of the stops from the
>> last
>> one in the sequence. Not all sequences of stops starts with the
>> number 1,
>> some may start with a higher number.
>> I need to make a new variable which has the time the bus arrives at
>> each
>> stop by using the start time from the stop with the lowest sequence
>> number,
>> to populate all of the arrival times for each stop in each route.
>>
>> I have a very simple example below with just three routes and a few
>> stops in
>> each. My actual data set has a few million rows. I've also created a
>> version of the data set I'm aiming to get.
>>
>> There are two problems here. Firstly getting the data into the
>> correct
>> format to do the calculations with
>> durations, and secondly running a function over the data set to obtain
>> the
>> times.
>> It is the durations that are critical not the date, so using the POSIX
>> methods doesn’t really seem appropriate here. Ultimately the times
>> are
>> going to be used in a route solver in an ArcSDE geodatabase. I tried
>> to use
>> strptime to format my times, but could not get them into a data.frame
>> as
>> presumably they are a list. In this example I’ve left them as
>> strings.
>>
>> Any help is much appreciated.
>>
>> #create four columns with route id, stop sequence interval time and
>> route
>> start time
>> ssq<-c(3,4,5,6,7,8,9,1,2,3,4,2,3,4,5,6,7,8)
>> tint<-c("00:00","00:12","00:03","00:06","00:09","00:02","00:04","00:00","00:08","00:10","00:10","00:00","00:02","00:04","00:08","00:02","00:01","00:01")
>> tst<-c(rep("18:20",7),rep("10:50",4),rep("16:15",7))
>> rtid<-c(rep("a",7),rep("b",4),rep("c",7))
>> df<-data.frame(cbind(ssq,tint,tst,rtid))
>> df
>>
>> #correct data set should look like this
>> tarr<-c("18:20","18:32","18:35","18:41","18:50","18:52","18:56","10:50","10:58","11:08","11:18","16:15","16:17","16:21","16:29","16:31","16:32","16:33")
>> df2<-cbind(df,tarr)
>> df2
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/run-a-calculation-function-over-time-fields-ordered-and-grouped-by-variables-tp4707655.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list