[R] merging and obtaining the nearest value
William Dunlap
wdunlap at tibco.com
Sun Aug 19 23:58:04 CEST 2012
And the following, f4, uses the same algorithm as f2 but codes
it somewhat more efficiently. It uses the same closestValue()
function.
f4 <- function (A, B) {
A$TYPE <- as.factor(A$TYPE)
uA <- levels(A$TYPE)
As <- split(A$DATE, A$TYPE)
Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
closest <- numeric(nrow(A))
split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
A$Difference <- A$DATE - closest
A
}
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of William Dunlap
> Sent: Sunday, August 19, 2012 1:49 PM
> To: Francesco; r-help at r-project.org
> Subject: Re: [R] merging and obtaining the nearest value
>
> The following, f2(A,B), should do well with lots of rows in A and B
> as long as the number of types is not huge.
>
> f2 <- function(A, B) {
> types <- as.character(unique(A$TYPE))
> result <- numeric(nrow(A))
> Bs <- split(B$Special_Date, B$TYPE)
> for(type in types) {
> w <- A$TYPE == type
> # can omit the sort() below if you know that B$Special_Date is sorted.
> result[w] <- closestValue(A$DATE[w], sort(Bs[[type]]))
> }
> A$Difference <- A$DATE - result
> A
> }
>
> closestValue <- function (x, vec)
> {
> # for each value in x, find closest value in vec.
> # Break ties by using highest.
> # Assume vec is sorted.
> intervalNo <- findInterval(x, vec)
> lowerValue <- vec[pmax(1, intervalNo)]
> upperValue <- vec[pmin(length(vec), intervalNo+1)]
> ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
> }
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>
> > -----Original Message-----
> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> > Of William Dunlap
> > Sent: Sunday, August 19, 2012 9:43 AM
> > To: Francesco; r-help at r-project.org
> > Subject: Re: [R] merging and obtaining the nearest value
> >
> > How many different types are there? Just a handful or many thousands?
> > For this sort of problem it is often handy to write a function which generates
> > datasets of the sort you are thinking of but parameterized by the
> > number of rows, levels, etc., so you can see how the execution time
> > varies with these things.
> >
> > If there are just a few types, try looping over types and using findInterval
> > to see where A$Date fits into the sequence of B$Special_Date.
> >
> >
> > Bill Dunlap
> > Spotfire, TIBCO Software
> > wdunlap tibco.com
> >
> >
> > > -----Original Message-----
> > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf
> > > Of Francesco
> > > Sent: Sunday, August 19, 2012 4:01 AM
> > > To: r-help at r-project.org
> > > Subject: Re: [R] merging and obtaining the nearest value
> > >
> > > Dear Riu, Many thanks for your suggestion
> > >
> > > However these are just simplified examples... in reality the dataset A
> > > contains millions of observations and B several thousands of rows...
> > > Could I still use a modified form of your suggestion?
> > >
> > > Thanks
> > >
> > > On 19 August 2012 12:51, Rui Barradas <ruipbarradas at sapo.pt> wrote:
> > > > Hello,
> > > >
> > > > Try the following.
> > > >
> > > >
> > > > A <- read.table(text="
> > > >
> > > > TYPE DATE
> > > > A 2
> > > > A 5
> > > > A 20
> > > > B 10
> > > > B 2
> > > > ", header = TRUE)
> > > >
> > > >
> > > > B <- read.table(text="
> > > >
> > > > TYPE Special_Date
> > > > A 2
> > > > A 6
> > > > A 20
> > > > A 22
> > > > B 5
> > > > B 6
> > > > ", header = TRUE)
> > > >
> > > > result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, m$TYPE)),
> > > > function(x){
> > > > a <- abs(x$DATE - x$Special_Date)
> > > > if(nrow(x)) x[which(min(a) == a), ] }) )
> > > > result$Difference <- result$DATE - result$Special_Date
> > > > result$Special_Date <- NULL
> > > > rownames(result) <- seq_len(nrow(result))
> > > > result
> > > >
> > > >
> > > > Also, it's a good practice to post data examples using dput(). For instance,
> > > >
> > > > dput(A)
> > > > structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
> > > > "B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = c("TYPE",
> > > > "DATE"), class = "data.frame", row.names = c(NA, -5L))
> > > >
> > > > Now all we have to do is run the statement A <- structure(... etc...) to
> > > > have an exact copy of the data example.
> > > > Anyway, your example with input and the wanted result was very welcome.
> > > >
> > > > Hope this helps,
> > > >
> > > > Rui Barradas
> > > >
> > > > Em 19-08-2012 11:10, Francesco escreveu:
> > > >>
> > > >> Dear R-help
> > > >>
> > > >> Î would like to know if there is a short solution in R for this
> > > >> merging problem...
> > > >>
> > > >> Let say I have a dataset A as:
> > > >>
> > > >> TYPE DATE
> > > >> A 2
> > > >> A 5
> > > >> A 20
> > > >> B 10
> > > >> B 2
> > > >>
> > > >> (there can be duplicates for the same type and date)
> > > >>
> > > >> and I have another dataset B as :
> > > >>
> > > >> TYPE Special_Date
> > > >> A 2
> > > >> A 6
> > > >> A 20
> > > >> A 22
> > > >> B 5
> > > >> B 6
> > > >>
> > > >> The question is : I would like to obtain the difference between the
> > > >> date of each observation in A and the closest special date in B with
> > > >> the same type. In case of ties I would take the latest date of the
> > > >> two.
> > > >>
> > > >> For example I would obtain here
> > > >>
> > > >> TYPE DATE Difference
> > > >> A 2 0=2-2
> > > >> A 5 -1=5-6
> > > >> A 20 0=20-20
> > > >> B 10 +4=10-6
> > > >> B 2 -3=2-5
> > > >>
> > > >> Do you know how to (simply?) obtain this in R?
> > > >>
> > > >> Many thanks!
> > > >> Best Regards
> > > >>
> > > >> ______________________________________________
> > > >> R-help at r-project.org mailing list
> > > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > > >> PLEASE do read the posting guide
> > > >> http://www.R-project.org/posting-guide.html
> > > >> and provide commented, minimal, self-contained, reproducible code.
> > > >
> > > >
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list