[R] merging and obtaining the nearest value

Sun Aug 19 23:58:04 CEST 2012

And the following, f4,  uses the same algorithm as f2 but codes
it somewhat more efficiently.  It uses the same closestValue()
function.
f4 <- function (A, B)  {
    A$TYPE <- as.factor(A$TYPE)
    uA <- levels(A$TYPE)
    As <- split(A$DATE, A$TYPE)
    Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
    closest <- numeric(nrow(A))
    split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
    A$Difference <- A$DATE - closest
    A
}

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of William Dunlap
> Sent: Sunday, August 19, 2012 1:49 PM
> To: Francesco; r-help at r-project.org
> Subject: Re: [R] merging and obtaining the nearest value
> 
> The following, f2(A,B), should do well with lots of rows in A and B
> as long as the number of types is not huge.
> 
> f2 <- function(A, B) {
>     types <- as.character(unique(A$TYPE))
>     result <- numeric(nrow(A))
>     Bs <- split(B$Special_Date, B$TYPE)
>     for(type in types) {
>         w <- A$TYPE == type
>         # can omit the sort() below if you know that B$Special_Date is sorted.
>         result[w] <- closestValue(A$DATE[w], sort(Bs[[type]]))
>     }
>     A$Difference <- A$DATE - result
>     A
> }
> 
> closestValue <- function (x, vec)
> {
>     # for each value in x, find closest value in vec.
>     # Break ties by using highest.
>     # Assume vec is sorted.
>     intervalNo <- findInterval(x, vec)
>     lowerValue <- vec[pmax(1, intervalNo)]
>     upperValue <- vec[pmin(length(vec), intervalNo+1)]
>     ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
> }
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
> 
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> > Of William Dunlap
> > Sent: Sunday, August 19, 2012 9:43 AM
> > To: Francesco; r-help at r-project.org
> > Subject: Re: [R] merging and obtaining the nearest value
> >
> > How many different types are there?  Just a handful or many thousands?
> > For this sort of problem it is often handy to write a function which generates
> > datasets of the sort you are thinking of but parameterized by the
> > number of rows, levels, etc., so you can see how the execution time
> > varies with these things.
> >
> > If there are just a few types, try looping over types and using findInterval
> > to see where A$Date fits into the sequence of B$Special_Date.
> >
> >
> > Bill Dunlap
> > Spotfire, TIBCO Software
> > wdunlap tibco.com
> >
> >
> > > -----Original Message-----
> > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf
> > > Of Francesco
> > > Sent: Sunday, August 19, 2012 4:01 AM
> > > To: r-help at r-project.org
> > > Subject: Re: [R] merging and obtaining the nearest value
> > >
> > > Dear Riu, Many thanks for your suggestion
> > >
> > > However these are just simplified examples... in reality the dataset A
> > > contains millions of observations and B several thousands of rows...
> > > Could I still use a modified form of your suggestion?
> > >
> > > Thanks
> > >
> > > On 19 August 2012 12:51, Rui Barradas <ruipbarradas at sapo.pt> wrote:
> > > > Hello,
> > > >
> > > > Try the following.
> > > >
> > > >
> > > > A <- read.table(text="
> > > >
> > > > TYPE   DATE
> > > > A            2
> > > > A            5
> > > > A            20
> > > > B            10
> > > > B            2
> > > > ", header = TRUE)
> > > >
> > > >
> > > > B <- read.table(text="
> > > >
> > > > TYPE  Special_Date
> > > > A              2
> > > > A              6
> > > > A              20
> > > > A              22
> > > > B              5
> > > > B              6
> > > > ", header = TRUE)
> > > >
> > > > result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, m$TYPE)),
> > > > function(x){
> > > >         a <- abs(x$DATE - x$Special_Date)
> > > >         if(nrow(x)) x[which(min(a) == a), ] }) )
> > > > result$Difference <- result$DATE - result$Special_Date
> > > > result$Special_Date <- NULL
> > > > rownames(result) <- seq_len(nrow(result))
> > > > result
> > > >
> > > >
> > > > Also, it's a good practice to post data examples using dput(). For instance,
> > > >
> > > > dput(A)
> > > > structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
> > > > "B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = c("TYPE",
> > > > "DATE"), class = "data.frame", row.names = c(NA, -5L))
> > > >
> > > > Now all we have to do is run the statement A <- structure(... etc...) to
> > > > have an exact copy of the data example.
> > > > Anyway, your example with input and the wanted result was very welcome.
> > > >
> > > > Hope this helps,
> > > >
> > > > Rui Barradas
> > > >
> > > > Em 19-08-2012 11:10, Francesco escreveu:
> > > >>
> > > >> Dear R-help
> > > >>
> > > >> Î would like to know if there is a short solution in R for this
> > > >> merging problem...
> > > >>
> > > >> Let say I have a dataset A as:
> > > >>
> > > >> TYPE   DATE
> > > >> A            2
> > > >> A            5
> > > >> A            20
> > > >> B            10
> > > >> B            2
> > > >>
> > > >> (there can be duplicates for the same type and date)
> > > >>
> > > >> and I have another dataset B as :
> > > >>
> > > >> TYPE  Special_Date
> > > >> A              2
> > > >> A              6
> > > >> A              20
> > > >> A              22
> > > >> B              5
> > > >> B              6
> > > >>
> > > >> The question is : I would like to obtain the difference between the
> > > >> date of each observation in A and the closest special date in B with
> > > >> the same type. In case of ties I would take the latest date of the
> > > >> two.
> > > >>
> > > >> For example I would obtain here
> > > >>
> > > >> TYPE   DATE   Difference
> > > >> A            2            0=2-2
> > > >> A            5            -1=5-6
> > > >> A            20            0=20-20
> > > >> B            10           +4=10-6
> > > >> B            2             -3=2-5
> > > >>
> > > >> Do you know how to (simply?) obtain this in R?
> > > >>
> > > >> Many thanks!
> > > >> Best Regards
> > > >>
> > > >> ______________________________________________
> > > >> R-help at r-project.org mailing list
> > > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > > >> PLEASE do read the posting guide
> > > >> http://www.R-project.org/posting-guide.html
> > > >> and provide commented, minimal, self-contained, reproducible code.
> > > >
> > > >
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.