[R] Removing rows with earlier dates
Martin Maechler
maechler at stat.math.ethz.ch
Wed Dec 29 15:39:08 CET 2010
>>>>> David Winsemius <dwinsemius at comcast.net>
>>>>> on Fri, 24 Dec 2010 11:47:05 -0500 writes:
> On Dec 24, 2010, at 11:04 AM, David Winsemius wrote:
>>
>> On Dec 24, 2010, at 8:45 AM, Ali Salekfard wrote:
>>
>>> Hi all,
>>>
>>> I'm new to the list but have benfited from it quite extensively.
>>> Straight to
>>> my rather strange question:
>>>
>>> I have a data frame that contains mapping rules in this way:
>>>
>>> ACCOUNT, RULE COLUMNS, Effective Date
>>>
>>>
>>> The dataframe comes from a database that stores all dates. What I
>>> would like
>>> to do is to create a data frame with only the most recent rule for
>>> each
>>> account. In traditional programming languages I would loop through
>>> each
>>> account find the most recent rule(s) and fill up my updated data
>>> frame.
>>>
>>> Does anyone have any better idea to use R's magic (Its syntax is
>>> still
>>> magical to me) for this problem?
>>
>> It's going to remain magic until you start thinking about what is
>> needed. In this case the need is for a good understanding of the
>> structure of the data object and the str function is the usual way
>> to examine such AND to then communicate with the list. Read the
>> Posting Guide again and the references it cites, please.
>>
>>>
>>
>> Here would have been my first attempt, assuming a dataframe named
>> dfrm:
>> #make sure the most recent is on top
>> dfrm <- dfrm[ order(dfrm["Effective Date"], decreasing=TRUE), ]
>> # then pull the first record within ACCOUNT
>> tapply(dfrm, dfrm$ACCOUNT , FUN= "[", 1 , )
>>
>>
>>> By the way the list of rules is quite extensive (144643 lines to be
>>> precise), and there are usually 1-3 most recent rules (rows) for each
>>> account.
>>
>> That is a bit different than the initial problem statement in which
>> you asked for the "only the most recent" within each account. How
>> are we supposed to get 3 _most_ recent rules? I think you are
>> expecting us to read your mind regarding how you are thinking about
>> this problem and pull all the records with the maximum date within
>> an account.
>>
>> Perhaps this effort to create a logical vector would be in the right
>> direction:
>>
>> dfrm[ ave(dfrm["Effective Date"], dfrm[ , "ACCOUNT"], function(x) x
>> == max(x), ]
>>
>> It should pull all records for which the Effective Date is equal to
>> the maximum within ACCOUNT. It is going to depend on whether
>> "Effective Date" of of a class that can be properly compared with
>> max(). Both Date and character representations of dates in standard
>> y-m-d form would qualify. Other date formats might not:
>> > max("01-02-2011", "02-01-2010")
>> [1] "02-01-2010"
>>
> When I used the strategy on the airquality dataset I do not get the
> results I expected, but a modification did succeed:
>> airquality[ airquality$Day == ave(airquality$Day, airquality$Month,
> FUN=function(x){ max(x)} ), ]
> Ozone Solar.R Wind Temp Month Day
> 31 37 279 7.4 76 5 31
> 61 NA 138 8.0 83 6 30
> 92 59 254 9.2 81 7 31
> 123 85 188 6.3 94 8 31
> 153 20 223 11.5 68 9 30
Hmm, yes, but " FUN = function(x) { max(x) } "
is so ugly that it hurts my R-eyes.
Just use 'FUN = max' .. please ..
and as we are in making things more readable,
I'd like to propose using with() in these cases -->
> airquality[with(airquality, Day == ave(Day, Month, FUN=max)),]
Ozone Solar.R Wind Temp Month Day
31 37 279 7.4 76 5 31
61 NA 138 8.0 83 6 30
92 59 254 9.2 81 7 31
123 85 188 6.3 94 8 31
153 20 223 11.5 68 9 30
Regards,
Martin Maechler, ETH Zurich
> I do suspect it requires that the dataframe be sorted to get the
> joint conditions lined up correctly. The earlier method should have
> used an as.logical() wrapper and would then not have needed pre-
> sorting the dataframe, so try instead:
> frm[ as.logical(ave(dfrm["Effective Date"], dfrm[ , "ACCOUNT"],
> function(x) x == max(x)), ]
>>
>>
>> --
>> David Winsemius, MD
>> West Hartford, CT
More information about the R-help
mailing list