[R] Bus stop sequence matching problem

Adam Lawrence alaw005 at gmail.com
Mon Sep 1 02:07:06 CEST 2014


Thank you for the help everyone, it has been a very helpful but steep
learning curve for me. I have ended up doing a loop as suggested by David
as I could understand this a bit better and seems can apply more generally.
I have set out my solution below in case that helps anyone.

I am interested though if the loop can be vectorized, as from my reading
this is the preferred (faster) R approach but I can't get my head around
how it works.

Regards
Adam

#
# I could either 1) loop through stops and check onoff, or 2) loop through
onoff and check stops.
#
# I chose option 2 because the first option might miss data if a stop
occurs more than once in the sequence and the
# first occurance of stop_onoff does not relate to the first occurance (e.g
for stop_sequence A,B,C,D,B,A if stop_onoff
# records are B, A then on first iteration option 1 will link the first
stop_sequence A to stop_onoff A and subsequent
# iterations will ignore stop_onoff B because occurs before A).
#
bus_load_profile <- function(stop_onoff, stop_sequence) {

  # Add additional columns to stop_sequence to hold matched on/off/load data
  stop_sequence <- cbind(stop_sequence, on=0, off=0, load=0)

  # Start stop_sequence index at 1 (i.e. over all records)
  idx_stop_sequence <- 1

  # Loop through stop_onoff records and assign to stop_sequence
  for (idx_stop_onoff in 1:nrow(stop_onoff)) {

    # Get stop_onoff ref
    ref_onoff <- as.character(stop_onoff$ref[idx_stop_onoff])

    # Get candidate refs from stop_sequence (i.e only those occuring after
any previous matches)
    ref_sequence_canidates <-
as.character(stop_sequence$ref[idx_stop_sequence:nrow(stop_sequence)])

    # Match the first occurance on ref_onoff in ref_sequence_canidates
    match_stop_sequence <- which(ref_onoff == ref_sequence_canidates)[1]

    # Update stop_sequence index to match current record
    idx_stop_sequence <- idx_stop_sequence + match_stop_sequence

    # Add stop_onoff data to the stop_sequence ()
    stop_sequence$on[idx_stop_sequence - 1] <- stop_onoff$on[idx_stop_onoff]
    stop_sequence$off[idx_stop_sequence - 1] <-
stop_onoff$off[idx_stop_onoff]
  }

  # Generate load profile
  stop_sequence$load <- cumsum(stop_sequence$on - stop_sequence$off)

  # return data.frame with load profile
  return(stop_sequence)

}


# Original test data:

# Test data, note use of data.frames because data is pulled from database,
noting real stop_equence data has
# additional columns. NB I have removed the seq column from stop_sequence
as the stops are by definition in sequence
stop_sequence <- data.frame(ref=c('A','B','C','D','B','A'))
stop_onoff <-
data.frame(ref=c('A','D','B','A'),on=c(5,0,10,0),off=c(0,2,2,6))
# Get bus load profile
bus_load_profile(stop_onoff, stop_sequence)

#Aditional test:
stop_sequence <- data.frame(ref=c('A','B','C','D','B','A'))
stop_onoff <- data.frame(ref=c('B','A'),on=c(10,0),off=c(4,6))
bus_load_profile(stop_onoff, stop_sequence)


On 31 August 2014 01:08, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:

> Try dtw.  First convert ref to numeric since dtw does not handle
> character input.  Then align using dtw and NA out repeated values in
> the alignment.  Finally zap ugly row names and calculate loading:
>
> library(dtw)
> s1 <- as.numeric(stop_sequence$ref)
> s2 <- as.numeric(factor(as.character(stop_onoff$ref),
> levels(stop_sequence$ref)))
> a <- dtw(s1, s2)
> DF <- cbind(stop_sequence,
>       stop_onoff[replace(a$index2, c(FALSE, diff(a$index2) == 0), NA),
> ])[-3]
> rownames(DF) <- NULL
> transform(DF, loading = cumsum(ifelse(is.na(on), 0, on)) -
>                         cumsum(ifelse(is.na(off), 0, off)))
>
> giving:
>
>   seq ref on off loading
> 1  10   A  5   0       5
> 2  20   B NA  NA       5
> 3  30   C NA  NA       5
> 4  40   D  0   2       3
> 5  50   B 10   2      11
> 6  60   A  0   6       5
>
> You will need to test this with more data and tweak it if necessary
> via the various dtw arguments.
>
>
> On Fri, Aug 29, 2014 at 8:46 PM, Adam Lawrence <alaw005 at gmail.com> wrote:
> > I am hoping someone can help me with a bus stop sequencing problem in R,
> > where I need to match counts of people getting on and off a bus to the
> > correct stop in the bus route stop sequence. I have tried looking
> > online/forums for sequence matching but seems to refer to numeric
> sequences
> > or DNA matching and over my head. I am after a simple example if anyone
> can
> > please help.
> >
> > I have two data series as per below (from database), that I want to
> > combine. In this example “stop_sequence” includes the equence (seq) of
> bus
> > stops and “stop_onoff” is a count of people getting on and off at certain
> > stops (there is no entry if noone gets on or off).
> >
> > stop_sequence <- data.frame(seq=c(10,20,30,40,50,60),
> > ref=c('A','B','C','D','B','A'))
> > ##   seq ref
> > ## 1  10   A
> > ## 2  20   B
> > ## 3  30   C
> > ## 4  40   D
> > ## 5  50   B
> > ## 6  60   A
> > stop_onoff <-
> > data.frame(ref=c('A','D','B','A'),on=c(5,0,10,0),off=c(0,2,2,6))
> > ##   ref on off
> > ## 1   A  5   0
> > ## 2   D  0   2
> > ## 3   B 10   2
> > ## 4   A  0   6
> >
> > I need to match the stop_onoff numbers in the right sto sequence, with
> the
> > correctly matched output as follows (load is a cumulative count of on and
> > off)
> >
> > desired_output <- data.frame(seq=c(10,20,30,40,50,60),
> > ref=c('A','B','C','D','B','A'),
> > on=c(5,'-','-',0,10,0),off=c(0,'-','-',2,2,6), load=c(5,0,0,3,11,5))
> > ##   seq ref on off load
> > ## 1  10   A  5   0    5
> > ## 2  20   B  -   -    0
> > ## 3  30   C  -   -    0
> > ## 4  40   D  0   2    3
> > ## 5  50   B 10   2   11
> > ## 6  60   A  0   6    5
> >
> > In this example the stop “B” is matched to the second stop “B” in the
> stop
> > sequence and not the first because the onoff data is after stop “D”.
> >
> > Any guidance much appreciated.
> >
> > Regards
> > Adam
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list