[R] merging and obtaining the nearest value
Rui Barradas
ruipbarradas at sapo.pt
Mon Aug 20 00:53:39 CEST 2012
Hello,
You're right, your solution is much faster, but it doesn't remove
duplicates.
When I ran f4() with larger datasets it poduced an error,
Error in findInterval(x, vec) : 'vec' must be sorted non-decreasingly
So here they all are.
f1 <- function(A, B){
m <- merge(A, B)
result <- do.call( rbind, lapply(split(m, list(m$DATE, m$TYPE)),
function(x){
if(nrow(x)){
a <- abs(x$DATE - x$Special_Date)
x[which.min(a), ] }}) )
result$Difference <- result$DATE - result$Special_Date
result$Special_Date <- NULL
rownames(result) <- seq_len(nrow(result))
result
}
closestValue <- function (x, vec)
{
# for each value in x, find closest value in vec.
# Break ties by using highest.
# Assume vec is sorted.
intervalNo <- findInterval(x, vec)
lowerValue <- vec[pmax(1, intervalNo)]
upperValue <- vec[pmin(length(vec), intervalNo+1)]
ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
}
f4 <- function (A, B) {
A$TYPE <- as.factor(A$TYPE)
uA <- levels(A$TYPE)
As <- split(A$DATE, A$TYPE)
B <- B[order(B$TYPE, B$Special_Date), ]
Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
closest <- numeric(nrow(A))
split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
A$Difference <- A$DATE - closest
A
}
# Test data, not many types
nA <- 1e3
nB <- 1e4
set.seed(1)
ta <- sample(LETTERS, nA, TRUE); da <- sample(1e2, nA, TRUE)
tb <- sample(LETTERS, nB, TRUE); db <- sample(nB, nB, TRUE)
aa <- data.frame(TYPE = ta, DATE = da)
bb <- data.frame(TYPE = tb, Special_Date = db)
t1 <- system.time(r1 <- f1(aa, bb))
t4 <- system.time(r4 <- f4(aa, bb))
rbind(t1 = t1, t4 = t4)
sum( duplicated(r4) ) # 165
Rui Barradas
Em 19-08-2012 22:58, William Dunlap escreveu:
> And the following, f4, uses the same algorithm as f2 but codes
> it somewhat more efficiently. It uses the same closestValue()
> function.
> f4 <- function (A, B) {
> A$TYPE <- as.factor(A$TYPE)
> uA <- levels(A$TYPE)
> As <- split(A$DATE, A$TYPE)
> Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
> closest <- numeric(nrow(A))
> split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
> A$Difference <- A$DATE - closest
> A
> }
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
>> Of William Dunlap
>> Sent: Sunday, August 19, 2012 1:49 PM
>> To: Francesco; r-help at r-project.org
>> Subject: Re: [R] merging and obtaining the nearest value
>>
>> The following, f2(A,B), should do well with lots of rows in A and B
>> as long as the number of types is not huge.
>>
>> f2 <- function(A, B) {
>> types <- as.character(unique(A$TYPE))
>> result <- numeric(nrow(A))
>> Bs <- split(B$Special_Date, B$TYPE)
>> for(type in types) {
>> w <- A$TYPE == type
>> # can omit the sort() below if you know that B$Special_Date is sorted.
>> result[w] <- closestValue(A$DATE[w], sort(Bs[[type]]))
>> }
>> A$Difference <- A$DATE - result
>> A
>> }
>>
>> closestValue <- function (x, vec)
>> {
>> # for each value in x, find closest value in vec.
>> # Break ties by using highest.
>> # Assume vec is sorted.
>> intervalNo <- findInterval(x, vec)
>> lowerValue <- vec[pmax(1, intervalNo)]
>> upperValue <- vec[pmin(length(vec), intervalNo+1)]
>> ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
>> }
>>
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>>
>>
>>> -----Original Message-----
>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
>>> Of William Dunlap
>>> Sent: Sunday, August 19, 2012 9:43 AM
>>> To: Francesco; r-help at r-project.org
>>> Subject: Re: [R] merging and obtaining the nearest value
>>>
>>> How many different types are there? Just a handful or many thousands?
>>> For this sort of problem it is often handy to write a function which generates
>>> datasets of the sort you are thinking of but parameterized by the
>>> number of rows, levels, etc., so you can see how the execution time
>>> varies with these things.
>>>
>>> If there are just a few types, try looping over types and using findInterval
>>> to see where A$Date fits into the sequence of B$Special_Date.
>>>
>>>
>>> Bill Dunlap
>>> Spotfire, TIBCO Software
>>> wdunlap tibco.com
>>>
>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
>> Behalf
>>>> Of Francesco
>>>> Sent: Sunday, August 19, 2012 4:01 AM
>>>> To: r-help at r-project.org
>>>> Subject: Re: [R] merging and obtaining the nearest value
>>>>
>>>> Dear Riu, Many thanks for your suggestion
>>>>
>>>> However these are just simplified examples... in reality the dataset A
>>>> contains millions of observations and B several thousands of rows...
>>>> Could I still use a modified form of your suggestion?
>>>>
>>>> Thanks
>>>>
>>>> On 19 August 2012 12:51, Rui Barradas <ruipbarradas at sapo.pt> wrote:
>>>>> Hello,
>>>>>
>>>>> Try the following.
>>>>>
>>>>>
>>>>> A <- read.table(text="
>>>>>
>>>>> TYPE DATE
>>>>> A 2
>>>>> A 5
>>>>> A 20
>>>>> B 10
>>>>> B 2
>>>>> ", header = TRUE)
>>>>>
>>>>>
>>>>> B <- read.table(text="
>>>>>
>>>>> TYPE Special_Date
>>>>> A 2
>>>>> A 6
>>>>> A 20
>>>>> A 22
>>>>> B 5
>>>>> B 6
>>>>> ", header = TRUE)
>>>>>
>>>>> result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, m$TYPE)),
>>>>> function(x){
>>>>> a <- abs(x$DATE - x$Special_Date)
>>>>> if(nrow(x)) x[which(min(a) == a), ] }) )
>>>>> result$Difference <- result$DATE - result$Special_Date
>>>>> result$Special_Date <- NULL
>>>>> rownames(result) <- seq_len(nrow(result))
>>>>> result
>>>>>
>>>>>
>>>>> Also, it's a good practice to post data examples using dput(). For instance,
>>>>>
>>>>> dput(A)
>>>>> structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
>>>>> "B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = c("TYPE",
>>>>> "DATE"), class = "data.frame", row.names = c(NA, -5L))
>>>>>
>>>>> Now all we have to do is run the statement A <- structure(... etc...) to
>>>>> have an exact copy of the data example.
>>>>> Anyway, your example with input and the wanted result was very welcome.
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>> Rui Barradas
>>>>>
>>>>> Em 19-08-2012 11:10, Francesco escreveu:
>>>>>> Dear R-help
>>>>>>
>>>>>> Î would like to know if there is a short solution in R for this
>>>>>> merging problem...
>>>>>>
>>>>>> Let say I have a dataset A as:
>>>>>>
>>>>>> TYPE DATE
>>>>>> A 2
>>>>>> A 5
>>>>>> A 20
>>>>>> B 10
>>>>>> B 2
>>>>>>
>>>>>> (there can be duplicates for the same type and date)
>>>>>>
>>>>>> and I have another dataset B as :
>>>>>>
>>>>>> TYPE Special_Date
>>>>>> A 2
>>>>>> A 6
>>>>>> A 20
>>>>>> A 22
>>>>>> B 5
>>>>>> B 6
>>>>>>
>>>>>> The question is : I would like to obtain the difference between the
>>>>>> date of each observation in A and the closest special date in B with
>>>>>> the same type. In case of ties I would take the latest date of the
>>>>>> two.
>>>>>>
>>>>>> For example I would obtain here
>>>>>>
>>>>>> TYPE DATE Difference
>>>>>> A 2 0=2-2
>>>>>> A 5 -1=5-6
>>>>>> A 20 0=20-20
>>>>>> B 10 +4=10-6
>>>>>> B 2 -3=2-5
>>>>>>
>>>>>> Do you know how to (simply?) obtain this in R?
>>>>>>
>>>>>> Many thanks!
>>>>>> Best Regards
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list