[R] remove

Sun Feb 12 08:19:11 CET 2017

Hi Jeff,

Why do you say ave() is better suited *because* it always returns a 
vector that is just as long as the input vector? Is it because that 
feature (of equal length), allows match() to be avoided, and as a 
result, the subsequent subsetting is faster with very large datasets?

Thanks, Philip

On 12/02/2017 5:42 PM, Jeff Newmiller wrote:
> The "by" function aggregates and returns a result with generally fewer 
> rows than the original data. Since you are looking to index the rows 
> in the original data set, the "ave" function is better suited because 
> it always returns a vector that is just as long as the input vector:
>
> # I usually work with character data rather than factors if I plan
> # to modify the data (e.g. removing rows)
> DF <- read.table( text=
> 'first  week last
> Alex    1  West
> Bob     1  John
> Cory    1  Jack
> Cory    2  Jack
> Bob     2  John
> Bob     3  John
> Alex    2  Joseph
> Alex    3  West
> Alex    4  West
> ', header = TRUE, as.is = TRUE )
>
> err <- ave( DF$last
>           , DF[ , "first", drop = FALSE]
>           , FUN = function( lst ) {
>               length( unique( lst ) )
>             }
>           )
> result <- DF[ "1" == err, ]
> result
>
> Notice that the ave function returns a vector of the same type as was 
> given to it, so even though the function returns a numeric the err
> vector is character.
>
> If you wanted to be able to examine more than one other column in 
> determining the keep/reject decision, you could do:
>
> err2 <- ave( seq_along( DF$first )
>            , DF[ , "first", drop = FALSE]
>            , FUN = function( n ) {
>               length( unique( DF[ n, "last" ] ) )
>              }
>            )
> result2 <- DF[ 1 == err2, ]
> result2
>
> and then you would have the option to re-use the "n" index to look at 
> other columns as well.
>
> Finally, here is a dplyr solution:
>
> library(dplyr)
> result3 <- (   DF
>            %>% group_by( first ) # like a prep for ave or by
>            %>% mutate( err = length( unique( last ) ) ) # similar to ave
>            %>% filter( 1 == err ) # drop the rows with too many last 
> names
>            %>% select( -err ) # drop the temporary column
>            %>% as.data.frame # convert back to a plain-jane data frame
>            )
> result3
>
> which uses a small set of verbs in a pipeline of functions to go from 
> input to result in one pass.
>
> If your data set is really big (running out of memory big) then you 
> might want to investigate the data.table or sqlite packages, either of 
> which can be combined with dplyr to get a standardized syntax for 
> managing larger amounts of data. However, most people actually aren't 
> running out of memory so in most cases the extra horsepower isn't 
> actually needed.
>
> On Sun, 12 Feb 2017, P Tennant wrote:
>
>> Hi Val,
>>
>> The by() function could be used here. With the dataframe dfr:
>>
>> # split the data by first name and check for more than one last name 
>> for each first name
>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1)
>> # make the result more easily manipulated
>> res <- as.table(res)
>> res
>> # first
>> # Alex   Bob  Cory
>> # TRUE FALSE FALSE
>>
>> # then use this result to subset the data
>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>> # sort if needed
>> nw.dfr[order(nw.dfr$first) , ]
>>
>>  first week last
>> 2   Bob    1 John
>> 5   Bob    2 John
>> 6   Bob    3 John
>> 3  Cory    1 Jack
>> 4  Cory    2 Jack
>>
>>
>> Philip
>>
>> On 12/02/2017 4:02 PM, Val wrote:
>>> Hi all,
>>> I have a big data set and want to  remove rows conditionally.
>>> In my data file  each person were recorded  for several weeks. Somehow
>>> during the recording periods, their last name was misreported.   For
>>> each person,   the last name should be the same. Otherwise remove from
>>> the data. Example, in the following data set, Alex was found to have
>>> two last names .
>>>
>>> Alex   West
>>> Alex   Joseph
>>>
>>> Alex should be removed  from the data.  if this happens then I want
>>> remove  all rows with Alex. Here is my data set
>>>
>>> df<- read.table(header=TRUE, text='first  week last
>>> Alex    1  West
>>> Bob     1  John
>>> Cory    1  Jack
>>> Cory    2  Jack
>>> Bob     2  John
>>> Bob     3  John
>>> Alex    2  Joseph
>>> Alex    3  West
>>> Alex    4  West ')
>>>
>>> Desired output
>>>
>>>        first  week last
>>> 1     Bob     1   John
>>> 2     Bob     2   John
>>> 3     Bob     3   John
>>> 4     Cory     1   Jack
>>> 5     Cory     2   Jack
>>>
>>> Thank you in advance
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> --------------------------------------------------------------------------- 
>
> Jeff Newmiller                        The     .....       .....  Go 
> Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live 
> Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  
> rocks...1k
> --------------------------------------------------------------------------- 
>