[R] remove
Val
valkremk at gmail.com
Mon Feb 13 00:30:49 CET 2017
Hi Jeff and all,
How do I get the number of unique first names in the two data sets?
for the first one,
result2 <- DF[ 1 == err2, ]
length(unique(result2$first))
On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:
> The "by" function aggregates and returns a result with generally fewer rows
> than the original data. Since you are looking to index the rows in the
> original data set, the "ave" function is better suited because it always
> returns a vector that is just as long as the input vector:
>
> # I usually work with character data rather than factors if I plan
> # to modify the data (e.g. removing rows)
> DF <- read.table( text=
> 'first week last
> Alex 1 West
> Bob 1 John
> Cory 1 Jack
> Cory 2 Jack
> Bob 2 John
> Bob 3 John
> Alex 2 Joseph
> Alex 3 West
> Alex 4 West
> ', header = TRUE, as.is = TRUE )
>
> err <- ave( DF$last
> , DF[ , "first", drop = FALSE]
> , FUN = function( lst ) {
> length( unique( lst ) )
> }
> )
> result <- DF[ "1" == err, ]
> result
>
> Notice that the ave function returns a vector of the same type as was given
> to it, so even though the function returns a numeric the err
> vector is character.
>
> If you wanted to be able to examine more than one other column in
> determining the keep/reject decision, you could do:
>
> err2 <- ave( seq_along( DF$first )
> , DF[ , "first", drop = FALSE]
> , FUN = function( n ) {
> length( unique( DF[ n, "last" ] ) )
> }
> )
> result2 <- DF[ 1 == err2, ]
> result2
>
> and then you would have the option to re-use the "n" index to look at other
> columns as well.
>
> Finally, here is a dplyr solution:
>
> library(dplyr)
> result3 <- ( DF
> %>% group_by( first ) # like a prep for ave or by
> %>% mutate( err = length( unique( last ) ) ) # similar to ave
> %>% filter( 1 == err ) # drop the rows with too many last names
> %>% select( -err ) # drop the temporary column
> %>% as.data.frame # convert back to a plain-jane data frame
> )
> result3
>
> which uses a small set of verbs in a pipeline of functions to go from input
> to result in one pass.
>
> If your data set is really big (running out of memory big) then you might
> want to investigate the data.table or sqlite packages, either of which can
> be combined with dplyr to get a standardized syntax for managing larger
> amounts of data. However, most people actually aren't running out of memory
> so in most cases the extra horsepower isn't actually needed.
>
>
> On Sun, 12 Feb 2017, P Tennant wrote:
>
>> Hi Val,
>>
>> The by() function could be used here. With the dataframe dfr:
>>
>> # split the data by first name and check for more than one last name for
>> each first name
>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1)
>> # make the result more easily manipulated
>> res <- as.table(res)
>> res
>> # first
>> # Alex Bob Cory
>> # TRUE FALSE FALSE
>>
>> # then use this result to subset the data
>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>> # sort if needed
>> nw.dfr[order(nw.dfr$first) , ]
>>
>> first week last
>> 2 Bob 1 John
>> 5 Bob 2 John
>> 6 Bob 3 John
>> 3 Cory 1 Jack
>> 4 Cory 2 Jack
>>
>>
>> Philip
>>
>> On 12/02/2017 4:02 PM, Val wrote:
>>>
>>> Hi all,
>>> I have a big data set and want to remove rows conditionally.
>>> In my data file each person were recorded for several weeks. Somehow
>>> during the recording periods, their last name was misreported. For
>>> each person, the last name should be the same. Otherwise remove from
>>> the data. Example, in the following data set, Alex was found to have
>>> two last names .
>>>
>>> Alex West
>>> Alex Joseph
>>>
>>> Alex should be removed from the data. if this happens then I want
>>> remove all rows with Alex. Here is my data set
>>>
>>> df<- read.table(header=TRUE, text='first week last
>>> Alex 1 West
>>> Bob 1 John
>>> Cory 1 Jack
>>> Cory 2 Jack
>>> Bob 2 John
>>> Bob 3 John
>>> Alex 2 Joseph
>>> Alex 3 West
>>> Alex 4 West ')
>>>
>>> Desired output
>>>
>>> first week last
>>> 1 Bob 1 John
>>> 2 Bob 2 John
>>> 3 Bob 3 John
>>> 4 Cory 1 Jack
>>> 5 Cory 2 Jack
>>>
>>> Thank you in advance
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ---------------------------------------------------------------------------
> Jeff Newmiller The ..... ..... Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
> Live: OO#.. Dead: OO#.. Playing
> Research Engineer (Solar/Batteries O.O#. #.O#. with
> /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
> ---------------------------------------------------------------------------
More information about the R-help
mailing list