[R] merging and obtaining the nearest value

Mon Aug 20 00:53:39 CEST 2012

Hello,

You're right, your solution is much faster, but it doesn't remove 
duplicates.
When I ran f4() with larger datasets it poduced an error,

Error in findInterval(x, vec) : 'vec' must be sorted non-decreasingly

So here they all are.

f1 <- function(A, B){
     m <- merge(A, B)
     result <- do.call( rbind, lapply(split(m, list(m$DATE, m$TYPE)), 
function(x){
         if(nrow(x)){
               a <- abs(x$DATE - x$Special_Date)
               x[which.min(a), ] }}) )
     result$Difference <- result$DATE - result$Special_Date
     result$Special_Date <- NULL
     rownames(result) <- seq_len(nrow(result))
     result
}

closestValue <- function (x, vec)
{
     # for each value in x, find closest value in vec.
     # Break ties by using highest.
     # Assume vec is sorted.
     intervalNo <- findInterval(x, vec)
     lowerValue <- vec[pmax(1, intervalNo)]
     upperValue <- vec[pmin(length(vec), intervalNo+1)]
     ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
}
f4 <- function (A, B)  {
     A$TYPE <- as.factor(A$TYPE)
     uA <- levels(A$TYPE)
     As <- split(A$DATE, A$TYPE)
     B <- B[order(B$TYPE, B$Special_Date), ]
     Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
     closest <- numeric(nrow(A))
     split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
     A$Difference <- A$DATE - closest
     A
}

# Test data, not many types
nA <- 1e3
nB <- 1e4
set.seed(1)
ta <- sample(LETTERS, nA, TRUE); da <- sample(1e2, nA, TRUE)
tb <- sample(LETTERS, nB, TRUE); db <- sample(nB, nB, TRUE)

aa <- data.frame(TYPE = ta, DATE = da)
bb <- data.frame(TYPE = tb, Special_Date = db)

t1 <- system.time(r1 <- f1(aa, bb))
t4 <- system.time(r4 <- f4(aa, bb))
rbind(t1 = t1, t4 = t4)

sum( duplicated(r4) )  # 165

Rui Barradas
Em 19-08-2012 22:58, William Dunlap escreveu:
> And the following, f4,  uses the same algorithm as f2 but codes
> it somewhat more efficiently.  It uses the same closestValue()
> function.
> f4 <- function (A, B)  {
>      A$TYPE <- as.factor(A$TYPE)
>      uA <- levels(A$TYPE)
>      As <- split(A$DATE, A$TYPE)
>      Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
>      closest <- numeric(nrow(A))
>      split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
>      A$Difference <- A$DATE - closest
>      A
> }
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
>> Of William Dunlap
>> Sent: Sunday, August 19, 2012 1:49 PM
>> To: Francesco; r-help at r-project.org
>> Subject: Re: [R] merging and obtaining the nearest value
>>
>> The following, f2(A,B), should do well with lots of rows in A and B
>> as long as the number of types is not huge.
>>
>> f2 <- function(A, B) {
>>      types <- as.character(unique(A$TYPE))
>>      result <- numeric(nrow(A))
>>      Bs <- split(B$Special_Date, B$TYPE)
>>      for(type in types) {
>>          w <- A$TYPE == type
>>          # can omit the sort() below if you know that B$Special_Date is sorted.
>>          result[w] <- closestValue(A$DATE[w], sort(Bs[[type]]))
>>      }
>>      A$Difference <- A$DATE - result
>>      A
>> }
>>
>> closestValue <- function (x, vec)
>> {
>>      # for each value in x, find closest value in vec.
>>      # Break ties by using highest.
>>      # Assume vec is sorted.
>>      intervalNo <- findInterval(x, vec)
>>      lowerValue <- vec[pmax(1, intervalNo)]
>>      upperValue <- vec[pmin(length(vec), intervalNo+1)]
>>      ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
>> }
>>
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>>
>>
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
>>> Of William Dunlap
>>> Sent: Sunday, August 19, 2012 9:43 AM
>>> To: Francesco; r-help at r-project.org
>>> Subject: Re: [R] merging and obtaining the nearest value
>>>
>>> How many different types are there?  Just a handful or many thousands?
>>> For this sort of problem it is often handy to write a function which generates
>>> datasets of the sort you are thinking of but parameterized by the
>>> number of rows, levels, etc., so you can see how the execution time
>>> varies with these things.
>>>
>>> If there are just a few types, try looping over types and using findInterval
>>> to see where A$Date fits into the sequence of B$Special_Date.
>>>
>>>
>>> Bill Dunlap
>>> Spotfire, TIBCO Software
>>> wdunlap tibco.com
>>>
>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
>> Behalf
>>>> Of Francesco
>>>> Sent: Sunday, August 19, 2012 4:01 AM
>>>> To: r-help at r-project.org
>>>> Subject: Re: [R] merging and obtaining the nearest value
>>>>
>>>> Dear Riu, Many thanks for your suggestion
>>>>
>>>> However these are just simplified examples... in reality the dataset A
>>>> contains millions of observations and B several thousands of rows...
>>>> Could I still use a modified form of your suggestion?
>>>>
>>>> Thanks
>>>>
>>>> On 19 August 2012 12:51, Rui Barradas <ruipbarradas at sapo.pt> wrote:
>>>>> Hello,
>>>>>
>>>>> Try the following.
>>>>>
>>>>>
>>>>> A <- read.table(text="
>>>>>
>>>>> TYPE   DATE
>>>>> A            2
>>>>> A            5
>>>>> A            20
>>>>> B            10
>>>>> B            2
>>>>> ", header = TRUE)
>>>>>
>>>>>
>>>>> B <- read.table(text="
>>>>>
>>>>> TYPE  Special_Date
>>>>> A              2
>>>>> A              6
>>>>> A              20
>>>>> A              22
>>>>> B              5
>>>>> B              6
>>>>> ", header = TRUE)
>>>>>
>>>>> result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, m$TYPE)),
>>>>> function(x){
>>>>>          a <- abs(x$DATE - x$Special_Date)
>>>>>          if(nrow(x)) x[which(min(a) == a), ] }) )
>>>>> result$Difference <- result$DATE - result$Special_Date
>>>>> result$Special_Date <- NULL
>>>>> rownames(result) <- seq_len(nrow(result))
>>>>> result
>>>>>
>>>>>
>>>>> Also, it's a good practice to post data examples using dput(). For instance,
>>>>>
>>>>> dput(A)
>>>>> structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
>>>>> "B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = c("TYPE",
>>>>> "DATE"), class = "data.frame", row.names = c(NA, -5L))
>>>>>
>>>>> Now all we have to do is run the statement A <- structure(... etc...) to
>>>>> have an exact copy of the data example.
>>>>> Anyway, your example with input and the wanted result was very welcome.
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>> Rui Barradas
>>>>>
>>>>> Em 19-08-2012 11:10, Francesco escreveu:
>>>>>> Dear R-help
>>>>>>
>>>>>> Î would like to know if there is a short solution in R for this
>>>>>> merging problem...
>>>>>>
>>>>>> Let say I have a dataset A as:
>>>>>>
>>>>>> TYPE   DATE
>>>>>> A            2
>>>>>> A            5
>>>>>> A            20
>>>>>> B            10
>>>>>> B            2
>>>>>>
>>>>>> (there can be duplicates for the same type and date)
>>>>>>
>>>>>> and I have another dataset B as :
>>>>>>
>>>>>> TYPE  Special_Date
>>>>>> A              2
>>>>>> A              6
>>>>>> A              20
>>>>>> A              22
>>>>>> B              5
>>>>>> B              6
>>>>>>
>>>>>> The question is : I would like to obtain the difference between the
>>>>>> date of each observation in A and the closest special date in B with
>>>>>> the same type. In case of ties I would take the latest date of the
>>>>>> two.
>>>>>>
>>>>>> For example I would obtain here
>>>>>>
>>>>>> TYPE   DATE   Difference
>>>>>> A            2            0=2-2
>>>>>> A            5            -1=5-6
>>>>>> A            20            0=20-20
>>>>>> B            10           +4=10-6
>>>>>> B            2             -3=2-5
>>>>>>
>>>>>> Do you know how to (simply?) obtain this in R?
>>>>>>
>>>>>> Many thanks!
>>>>>> Best Regards
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.