[R] Removing rows with earlier dates
William Dunlap
wdunlap at tibco.com
Fri Dec 24 19:44:50 CET 2010
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Ali Salekfard
> Sent: Friday, December 24, 2010 5:46 AM
> To: r-help at r-project.org
> Subject: [R] Removing rows with earlier dates
>
> Hi all,
>
> I'm new to the list but have benfited from it quite
> extensively. Straight to
> my rather strange question:
>
> I have a data frame that contains mapping rules in this way:
>
> ACCOUNT, RULE COLUMNS, Effective Date
>
>
> The dataframe comes from a database that stores all dates.
> What I would like
> to do is to create a data frame with only the most recent
> rule for each
> account. In traditional programming languages I would loop
> through each
> account find the most recent rule(s) and fill up my updated
> data frame.
Do you mean you would loop through the accounts and for
each account loop through all rules for that account looking
for the rule with the latest date?
>
> Does anyone have any better idea to use R's magic (Its syntax is still
> magical to me) for this problem?
I like to think of this sort of a problem as a one involving
"runs" (sequences of identical data points). The following
function identifies which data points are the last in a run:
isLastInRun <- function (x, ...)
{
retval <- c(x[-1] != x[-length(x)], TRUE)
for (y in list(...)) {
stopifnot(length(x) == length(y))
retval <- retval | c(x[-1] != x[-length(x)], TRUE)
}
retval
}
E.g.,
> isLastInRun(rep(1:2,each=5), rep(1:3,c(3,3,4)))
[1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
If your data is sorted by ACCOUNT with ties broken by
`Effective Date` then you can get what I think you want with
f0 <- function(data) {
data[isLastInRun(data[,"ACCOUNT"]), ]
}
If it is not sorted then sort it first and then do the above with
f1 <- function(data) {
data <- data[order(data[,"ACCOUNT"], data[,"Effective Date"]), ]
f0(data)
}
If it is important that the output records be in the same order
as the input records then you can use
f2 <- function(data) {
o <- order(data[, "ACCOUNT"], data[, "Effective Date"])
tmp <- logical(length(o))
tmp[o] <- isLastInRun(data[o, "ACCOUNT"])
data[tmp,]
}
(It may be faster to sort the output of f1 rather that
sorting the input, as f2 does, but f2's method is a bit
simpler to write.)
This approach is typically faster than tapply when there
are a lot of small groups and runs less risk of corrupting
the data when using unusual classes in the columns of your
data.frame.
Typical usage is
> d <- data.frame(check.names=FALSE,
+ "Effective Date"=c(1004,1008,1004,1007,1005),
+ ACCOUNT= c( 11, 12, 11, 12, 11),
+ Rule= c( "A", "B", "C", "D", "E"))
> f1(d)
Effective Date ACCOUNT Rule
5 1005 11 E
2 1008 12 B
> f2(d)
Effective Date ACCOUNT Rule
2 1008 12 B
5 1005 11 E
Did I interpret your question correctly?
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
>
> By the way the list of rules is quite extensive (144643 lines to be
> precise), and there are usually 1-3 most recent rules (rows) for each
> account.
>
> Thanks.
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list