[R] remove
Val
valkremk at gmail.com
Mon Feb 13 05:18:37 CET 2017
Hi Jeff and All,
When I examined the excluded data, ie., first name with with
different last names, I noticed that some last names were not
recorded
or instance, I modified the data as follows
DF <- read.table( text=
'first week last
Alex 1 West
Bob 1 John
Cory 1 Jack
Cory 2 -
Bob 2 John
Bob 3 John
Alex 2 Joseph
Alex 3 West
Alex 4 West
', header = TRUE, as.is = TRUE )
err2 <- ave( seq_along( DF$first )
, DF[ , "first", drop = FALSE]
, FUN = function( n ) {
length( unique( DF[ n, "last" ] ) )
}
)
result2 <- DF[ 1 == err2, ]
result2
first week last
2 Bob 1 John
5 Bob 2 John
6 Bob 3 John
However, I want keep Cory's record. It is assumed that not recorded
should have the same last name.
Final out put should be
first week last
Bob 1 John
Bob 2 John
Bob 3 John
Cory 1 Jack
Cory 2 -
Thank you again!
On Sun, Feb 12, 2017 at 7:28 PM, Val <valkremk at gmail.com> wrote:
> Sorry Jeff, I did not finish my email. I accidentally touched the send button.
> My question was the
> when I used this one
> length(unique(result2$first))
> vs
> dim(result2[!duplicated(result2[,c('first')]),]) [1]
>
> I did get different results but now I found out the problem.
>
> Thank you!.
>
>
>
>
>
>
>
>
> On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller
> <jdnewmil at dcn.davis.ca.us> wrote:
>> Your question mystifies me, since it looks to me like you already know the answer.
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On February 12, 2017 3:30:49 PM PST, Val <valkremk at gmail.com> wrote:
>>>Hi Jeff and all,
>>> How do I get the number of unique first names in the two data sets?
>>>
>>>for the first one,
>>>result2 <- DF[ 1 == err2, ]
>>>length(unique(result2$first))
>>>
>>>
>>>
>>>
>>>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
>>><jdnewmil at dcn.davis.ca.us> wrote:
>>>> The "by" function aggregates and returns a result with generally
>>>fewer rows
>>>> than the original data. Since you are looking to index the rows in
>>>the
>>>> original data set, the "ave" function is better suited because it
>>>always
>>>> returns a vector that is just as long as the input vector:
>>>>
>>>> # I usually work with character data rather than factors if I plan
>>>> # to modify the data (e.g. removing rows)
>>>> DF <- read.table( text=
>>>> 'first week last
>>>> Alex 1 West
>>>> Bob 1 John
>>>> Cory 1 Jack
>>>> Cory 2 Jack
>>>> Bob 2 John
>>>> Bob 3 John
>>>> Alex 2 Joseph
>>>> Alex 3 West
>>>> Alex 4 West
>>>> ', header = TRUE, as.is = TRUE )
>>>>
>>>> err <- ave( DF$last
>>>> , DF[ , "first", drop = FALSE]
>>>> , FUN = function( lst ) {
>>>> length( unique( lst ) )
>>>> }
>>>> )
>>>> result <- DF[ "1" == err, ]
>>>> result
>>>>
>>>> Notice that the ave function returns a vector of the same type as was
>>>given
>>>> to it, so even though the function returns a numeric the err
>>>> vector is character.
>>>>
>>>> If you wanted to be able to examine more than one other column in
>>>> determining the keep/reject decision, you could do:
>>>>
>>>> err2 <- ave( seq_along( DF$first )
>>>> , DF[ , "first", drop = FALSE]
>>>> , FUN = function( n ) {
>>>> length( unique( DF[ n, "last" ] ) )
>>>> }
>>>> )
>>>> result2 <- DF[ 1 == err2, ]
>>>> result2
>>>>
>>>> and then you would have the option to re-use the "n" index to look at
>>>other
>>>> columns as well.
>>>>
>>>> Finally, here is a dplyr solution:
>>>>
>>>> library(dplyr)
>>>> result3 <- ( DF
>>>> %>% group_by( first ) # like a prep for ave or by
>>>> %>% mutate( err = length( unique( last ) ) ) # similar to
>>>ave
>>>> %>% filter( 1 == err ) # drop the rows with too many last
>>>names
>>>> %>% select( -err ) # drop the temporary column
>>>> %>% as.data.frame # convert back to a plain-jane data
>>>frame
>>>> )
>>>> result3
>>>>
>>>> which uses a small set of verbs in a pipeline of functions to go from
>>>input
>>>> to result in one pass.
>>>>
>>>> If your data set is really big (running out of memory big) then you
>>>might
>>>> want to investigate the data.table or sqlite packages, either of
>>>which can
>>>> be combined with dplyr to get a standardized syntax for managing
>>>larger
>>>> amounts of data. However, most people actually aren't running out of
>>>memory
>>>> so in most cases the extra horsepower isn't actually needed.
>>>>
>>>>
>>>> On Sun, 12 Feb 2017, P Tennant wrote:
>>>>
>>>>> Hi Val,
>>>>>
>>>>> The by() function could be used here. With the dataframe dfr:
>>>>>
>>>>> # split the data by first name and check for more than one last name
>>>for
>>>>> each first name
>>>>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1)
>>>>> # make the result more easily manipulated
>>>>> res <- as.table(res)
>>>>> res
>>>>> # first
>>>>> # Alex Bob Cory
>>>>> # TRUE FALSE FALSE
>>>>>
>>>>> # then use this result to subset the data
>>>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>>>>> # sort if needed
>>>>> nw.dfr[order(nw.dfr$first) , ]
>>>>>
>>>>> first week last
>>>>> 2 Bob 1 John
>>>>> 5 Bob 2 John
>>>>> 6 Bob 3 John
>>>>> 3 Cory 1 Jack
>>>>> 4 Cory 2 Jack
>>>>>
>>>>>
>>>>> Philip
>>>>>
>>>>> On 12/02/2017 4:02 PM, Val wrote:
>>>>>>
>>>>>> Hi all,
>>>>>> I have a big data set and want to remove rows conditionally.
>>>>>> In my data file each person were recorded for several weeks.
>>>Somehow
>>>>>> during the recording periods, their last name was misreported.
>>>For
>>>>>> each person, the last name should be the same. Otherwise remove
>>>from
>>>>>> the data. Example, in the following data set, Alex was found to
>>>have
>>>>>> two last names .
>>>>>>
>>>>>> Alex West
>>>>>> Alex Joseph
>>>>>>
>>>>>> Alex should be removed from the data. if this happens then I want
>>>>>> remove all rows with Alex. Here is my data set
>>>>>>
>>>>>> df<- read.table(header=TRUE, text='first week last
>>>>>> Alex 1 West
>>>>>> Bob 1 John
>>>>>> Cory 1 Jack
>>>>>> Cory 2 Jack
>>>>>> Bob 2 John
>>>>>> Bob 3 John
>>>>>> Alex 2 Joseph
>>>>>> Alex 3 West
>>>>>> Alex 4 West ')
>>>>>>
>>>>>> Desired output
>>>>>>
>>>>>> first week last
>>>>>> 1 Bob 1 John
>>>>>> 2 Bob 2 John
>>>>>> 3 Bob 3 John
>>>>>> 4 Cory 1 Jack
>>>>>> 5 Cory 2 Jack
>>>>>>
>>>>>> Thank you in advance
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>>
>>>---------------------------------------------------------------------------
>>>> Jeff Newmiller The ..... ..... Go
>>>Live...
>>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
>>>Go...
>>>> Live: OO#.. Dead: OO#..
>>>Playing
>>>> Research Engineer (Solar/Batteries O.O#. #.O#. with
>>>> /Software/Embedded Controllers) .OO#. .OO#.
>>>rocks...1k
>>>>
>>>---------------------------------------------------------------------------
More information about the R-help
mailing list