[R] remove
Val
valkremk at gmail.com
Mon Feb 13 02:28:53 CET 2017
Sorry Jeff, I did not finish my email. I accidentally touched the send button.
My question was the
when I used this one
length(unique(result2$first))
vs
dim(result2[!duplicated(result2[,c('first')]),]) [1]
I did get different results but now I found out the problem.
Thank you!.
On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:
> Your question mystifies me, since it looks to me like you already know the answer.
> --
> Sent from my phone. Please excuse my brevity.
>
> On February 12, 2017 3:30:49 PM PST, Val <valkremk at gmail.com> wrote:
>>Hi Jeff and all,
>> How do I get the number of unique first names in the two data sets?
>>
>>for the first one,
>>result2 <- DF[ 1 == err2, ]
>>length(unique(result2$first))
>>
>>
>>
>>
>>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
>><jdnewmil at dcn.davis.ca.us> wrote:
>>> The "by" function aggregates and returns a result with generally
>>fewer rows
>>> than the original data. Since you are looking to index the rows in
>>the
>>> original data set, the "ave" function is better suited because it
>>always
>>> returns a vector that is just as long as the input vector:
>>>
>>> # I usually work with character data rather than factors if I plan
>>> # to modify the data (e.g. removing rows)
>>> DF <- read.table( text=
>>> 'first week last
>>> Alex 1 West
>>> Bob 1 John
>>> Cory 1 Jack
>>> Cory 2 Jack
>>> Bob 2 John
>>> Bob 3 John
>>> Alex 2 Joseph
>>> Alex 3 West
>>> Alex 4 West
>>> ', header = TRUE, as.is = TRUE )
>>>
>>> err <- ave( DF$last
>>> , DF[ , "first", drop = FALSE]
>>> , FUN = function( lst ) {
>>> length( unique( lst ) )
>>> }
>>> )
>>> result <- DF[ "1" == err, ]
>>> result
>>>
>>> Notice that the ave function returns a vector of the same type as was
>>given
>>> to it, so even though the function returns a numeric the err
>>> vector is character.
>>>
>>> If you wanted to be able to examine more than one other column in
>>> determining the keep/reject decision, you could do:
>>>
>>> err2 <- ave( seq_along( DF$first )
>>> , DF[ , "first", drop = FALSE]
>>> , FUN = function( n ) {
>>> length( unique( DF[ n, "last" ] ) )
>>> }
>>> )
>>> result2 <- DF[ 1 == err2, ]
>>> result2
>>>
>>> and then you would have the option to re-use the "n" index to look at
>>other
>>> columns as well.
>>>
>>> Finally, here is a dplyr solution:
>>>
>>> library(dplyr)
>>> result3 <- ( DF
>>> %>% group_by( first ) # like a prep for ave or by
>>> %>% mutate( err = length( unique( last ) ) ) # similar to
>>ave
>>> %>% filter( 1 == err ) # drop the rows with too many last
>>names
>>> %>% select( -err ) # drop the temporary column
>>> %>% as.data.frame # convert back to a plain-jane data
>>frame
>>> )
>>> result3
>>>
>>> which uses a small set of verbs in a pipeline of functions to go from
>>input
>>> to result in one pass.
>>>
>>> If your data set is really big (running out of memory big) then you
>>might
>>> want to investigate the data.table or sqlite packages, either of
>>which can
>>> be combined with dplyr to get a standardized syntax for managing
>>larger
>>> amounts of data. However, most people actually aren't running out of
>>memory
>>> so in most cases the extra horsepower isn't actually needed.
>>>
>>>
>>> On Sun, 12 Feb 2017, P Tennant wrote:
>>>
>>>> Hi Val,
>>>>
>>>> The by() function could be used here. With the dataframe dfr:
>>>>
>>>> # split the data by first name and check for more than one last name
>>for
>>>> each first name
>>>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1)
>>>> # make the result more easily manipulated
>>>> res <- as.table(res)
>>>> res
>>>> # first
>>>> # Alex Bob Cory
>>>> # TRUE FALSE FALSE
>>>>
>>>> # then use this result to subset the data
>>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>>>> # sort if needed
>>>> nw.dfr[order(nw.dfr$first) , ]
>>>>
>>>> first week last
>>>> 2 Bob 1 John
>>>> 5 Bob 2 John
>>>> 6 Bob 3 John
>>>> 3 Cory 1 Jack
>>>> 4 Cory 2 Jack
>>>>
>>>>
>>>> Philip
>>>>
>>>> On 12/02/2017 4:02 PM, Val wrote:
>>>>>
>>>>> Hi all,
>>>>> I have a big data set and want to remove rows conditionally.
>>>>> In my data file each person were recorded for several weeks.
>>Somehow
>>>>> during the recording periods, their last name was misreported.
>>For
>>>>> each person, the last name should be the same. Otherwise remove
>>from
>>>>> the data. Example, in the following data set, Alex was found to
>>have
>>>>> two last names .
>>>>>
>>>>> Alex West
>>>>> Alex Joseph
>>>>>
>>>>> Alex should be removed from the data. if this happens then I want
>>>>> remove all rows with Alex. Here is my data set
>>>>>
>>>>> df<- read.table(header=TRUE, text='first week last
>>>>> Alex 1 West
>>>>> Bob 1 John
>>>>> Cory 1 Jack
>>>>> Cory 2 Jack
>>>>> Bob 2 John
>>>>> Bob 3 John
>>>>> Alex 2 Joseph
>>>>> Alex 3 West
>>>>> Alex 4 West ')
>>>>>
>>>>> Desired output
>>>>>
>>>>> first week last
>>>>> 1 Bob 1 John
>>>>> 2 Bob 2 John
>>>>> 3 Bob 3 John
>>>>> 4 Cory 1 Jack
>>>>> 5 Cory 2 Jack
>>>>>
>>>>> Thank you in advance
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>---------------------------------------------------------------------------
>>> Jeff Newmiller The ..... ..... Go
>>Live...
>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
>>Go...
>>> Live: OO#.. Dead: OO#..
>>Playing
>>> Research Engineer (Solar/Batteries O.O#. #.O#. with
>>> /Software/Embedded Controllers) .OO#. .OO#.
>>rocks...1k
>>>
>>---------------------------------------------------------------------------
More information about the R-help
mailing list