[R] fuzzy merge

jim holtman jholtman at gmail.com
Wed Apr 9 13:38:12 CEST 2008


Here is one way of doing it.  Create a new dataframe with the values,
look for the breaks in times between the two sequences and then create
a results dataframe:

> # convert time to POSIXct
> d1$time1 <- as.POSIXct(paste(d1[[1]], d1[[2]]))
> d2$time2 <- as.POSIXct(paste(d2[[1]], d2[[2]]))
> d1$ID1 <- row.names(d1)
> d2$ID2 <- row.names(d2)
> d1
  ID1               time1 dt
1   1 2008-01-02 13:11:00 10
2   2 2008-01-02 14:20:00 20
3   3 2008-01-02 15:42:00 30
4   4 2008-01-02 16:45:00 40
5   5 2008-01-02 17:42:00 50
6   6 2008-01-02 20:40:00 60
> d2
    ID2               time2  d1
101 101 2008-01-02 14:29:00  75
102 102 2008-01-02 17:55:00 105
103 103 2008-02-07 20:01:00   8
> # create dataframe of times
> checkTime <- data.frame(index=c(seq(nrow(d1)), seq(nrow(d2))),
+     ID=c(d1$ID1, d2$ID2),
+     times=c(d1$time1, d2$time2), key=c(rep(0, nrow(d1)), rep(1, nrow(d2))))
> # sort by time
> checkTime <- checkTime[order(checkTime$times),]
> # find the breaks (transition from 0 -> 1)
> breaks <- which(diff(checkTime$key) == 1)
> # find out which ones are withing 15 minutes
> dif.15 <- which(difftime(checkTime$times[breaks + 1], checkTime$times[breaks], units='mins') <= 15)
> # print out the values
> data.frame(ID1=checkTime$ID[breaks[dif.15]], time1=checkTime$times[breaks[dif.15]],
+     time2=checkTime$times[breaks[dif.15] + 1])
  ID1               time1               time2
1   2 2008-01-02 14:20:00 2008-01-02 14:29:00
2   5 2008-01-02 17:42:00 2008-01-02 17:55:00
>
>
>


On Wed, Apr 9, 2008 at 4:53 AM, ravi <rv15i at yahoo.se> wrote:
> Hi,
> I would like to merge two data frames. It is just that I want the merging to be done with some kind of a fuzzy criterion. Let me explain.
> My first data frame looks like this :
>
> ID1                     time1                                dt
> 1                        2008-01-02 13:11                10
> 2                        2008-01-02 14:20                20
> 3                        2008-01-02 15:42                30
> 4                        2008-01-02 16:45                40
> 5                        2008-01-02 17:42                50
> 6                        2008-01-02 20:40                60
>
>
> My second data frame :
>
> ID2                        time2                                d1
> 101                        2008-01-02 14:29                75
> 102                        2008-01-02 17:55                105
> 103                        2008-02-07 20:01                8
>
>
>
> I want the merging to be done such that time2 is in the range between time1 and (time1+15 min).
> That is, my merged data frame should be :
>
> ID1                     time1                                    time2
> 2                        2008-01-02 14:20                2008-01-02 14:29
> 5                        2008-01-02 17:42                2008-01-02 17:55
>
>
> My data frames have thousands of records. If the two data frames are d1 and d2,
>
> d3<-merge(d1,d2,by.x=time1,by.y=time2)
> will work only for exact matching. One possible option is to match the times for the date and hour times only (by filtering away the minute data).
> But this is only a partial solution as I am not interested in data where the time difference is more than 15 minutes.
>
> How can I make the merge to work for fuzzy matching?
> Would it be easier to convert the times into data classes? Or, it better to treat them as strings and use regular expresssions for doing the matching?
>
> I would appreciate any help that I can get.
> Thanking You,
> Ravi
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?



More information about the R-help mailing list