[R] Bus stop sequence matching problem
Adam Lawrence
alaw005 at gmail.com
Mon Sep 1 02:07:06 CEST 2014
Thank you for the help everyone, it has been a very helpful but steep
learning curve for me. I have ended up doing a loop as suggested by David
as I could understand this a bit better and seems can apply more generally.
I have set out my solution below in case that helps anyone.
I am interested though if the loop can be vectorized, as from my reading
this is the preferred (faster) R approach but I can't get my head around
how it works.
Regards
Adam
#
# I could either 1) loop through stops and check onoff, or 2) loop through
onoff and check stops.
#
# I chose option 2 because the first option might miss data if a stop
occurs more than once in the sequence and the
# first occurance of stop_onoff does not relate to the first occurance (e.g
for stop_sequence A,B,C,D,B,A if stop_onoff
# records are B, A then on first iteration option 1 will link the first
stop_sequence A to stop_onoff A and subsequent
# iterations will ignore stop_onoff B because occurs before A).
#
bus_load_profile <- function(stop_onoff, stop_sequence) {
# Add additional columns to stop_sequence to hold matched on/off/load data
stop_sequence <- cbind(stop_sequence, on=0, off=0, load=0)
# Start stop_sequence index at 1 (i.e. over all records)
idx_stop_sequence <- 1
# Loop through stop_onoff records and assign to stop_sequence
for (idx_stop_onoff in 1:nrow(stop_onoff)) {
# Get stop_onoff ref
ref_onoff <- as.character(stop_onoff$ref[idx_stop_onoff])
# Get candidate refs from stop_sequence (i.e only those occuring after
any previous matches)
ref_sequence_canidates <-
as.character(stop_sequence$ref[idx_stop_sequence:nrow(stop_sequence)])
# Match the first occurance on ref_onoff in ref_sequence_canidates
match_stop_sequence <- which(ref_onoff == ref_sequence_canidates)[1]
# Update stop_sequence index to match current record
idx_stop_sequence <- idx_stop_sequence + match_stop_sequence
# Add stop_onoff data to the stop_sequence ()
stop_sequence$on[idx_stop_sequence - 1] <- stop_onoff$on[idx_stop_onoff]
stop_sequence$off[idx_stop_sequence - 1] <-
stop_onoff$off[idx_stop_onoff]
}
# Generate load profile
stop_sequence$load <- cumsum(stop_sequence$on - stop_sequence$off)
# return data.frame with load profile
return(stop_sequence)
}
# Original test data:
# Test data, note use of data.frames because data is pulled from database,
noting real stop_equence data has
# additional columns. NB I have removed the seq column from stop_sequence
as the stops are by definition in sequence
stop_sequence <- data.frame(ref=c('A','B','C','D','B','A'))
stop_onoff <-
data.frame(ref=c('A','D','B','A'),on=c(5,0,10,0),off=c(0,2,2,6))
# Get bus load profile
bus_load_profile(stop_onoff, stop_sequence)
#Aditional test:
stop_sequence <- data.frame(ref=c('A','B','C','D','B','A'))
stop_onoff <- data.frame(ref=c('B','A'),on=c(10,0),off=c(4,6))
bus_load_profile(stop_onoff, stop_sequence)
On 31 August 2014 01:08, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> Try dtw. First convert ref to numeric since dtw does not handle
> character input. Then align using dtw and NA out repeated values in
> the alignment. Finally zap ugly row names and calculate loading:
>
> library(dtw)
> s1 <- as.numeric(stop_sequence$ref)
> s2 <- as.numeric(factor(as.character(stop_onoff$ref),
> levels(stop_sequence$ref)))
> a <- dtw(s1, s2)
> DF <- cbind(stop_sequence,
> stop_onoff[replace(a$index2, c(FALSE, diff(a$index2) == 0), NA),
> ])[-3]
> rownames(DF) <- NULL
> transform(DF, loading = cumsum(ifelse(is.na(on), 0, on)) -
> cumsum(ifelse(is.na(off), 0, off)))
>
> giving:
>
> seq ref on off loading
> 1 10 A 5 0 5
> 2 20 B NA NA 5
> 3 30 C NA NA 5
> 4 40 D 0 2 3
> 5 50 B 10 2 11
> 6 60 A 0 6 5
>
> You will need to test this with more data and tweak it if necessary
> via the various dtw arguments.
>
>
> On Fri, Aug 29, 2014 at 8:46 PM, Adam Lawrence <alaw005 at gmail.com> wrote:
> > I am hoping someone can help me with a bus stop sequencing problem in R,
> > where I need to match counts of people getting on and off a bus to the
> > correct stop in the bus route stop sequence. I have tried looking
> > online/forums for sequence matching but seems to refer to numeric
> sequences
> > or DNA matching and over my head. I am after a simple example if anyone
> can
> > please help.
> >
> > I have two data series as per below (from database), that I want to
> > combine. In this example “stop_sequence” includes the equence (seq) of
> bus
> > stops and “stop_onoff” is a count of people getting on and off at certain
> > stops (there is no entry if noone gets on or off).
> >
> > stop_sequence <- data.frame(seq=c(10,20,30,40,50,60),
> > ref=c('A','B','C','D','B','A'))
> > ## seq ref
> > ## 1 10 A
> > ## 2 20 B
> > ## 3 30 C
> > ## 4 40 D
> > ## 5 50 B
> > ## 6 60 A
> > stop_onoff <-
> > data.frame(ref=c('A','D','B','A'),on=c(5,0,10,0),off=c(0,2,2,6))
> > ## ref on off
> > ## 1 A 5 0
> > ## 2 D 0 2
> > ## 3 B 10 2
> > ## 4 A 0 6
> >
> > I need to match the stop_onoff numbers in the right sto sequence, with
> the
> > correctly matched output as follows (load is a cumulative count of on and
> > off)
> >
> > desired_output <- data.frame(seq=c(10,20,30,40,50,60),
> > ref=c('A','B','C','D','B','A'),
> > on=c(5,'-','-',0,10,0),off=c(0,'-','-',2,2,6), load=c(5,0,0,3,11,5))
> > ## seq ref on off load
> > ## 1 10 A 5 0 5
> > ## 2 20 B - - 0
> > ## 3 30 C - - 0
> > ## 4 40 D 0 2 3
> > ## 5 50 B 10 2 11
> > ## 6 60 A 0 6 5
> >
> > In this example the stop “B” is matched to the second stop “B” in the
> stop
> > sequence and not the first because the onoff data is after stop “D”.
> >
> > Any guidance much appreciated.
> >
> > Regards
> > Adam
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list