[R] Remove duplicates from a data frame but with some special requirements
gcam
gcam032 at gmail.com
Thu Dec 17 20:31:15 CET 2009
Thanks Gray,
This helps, I'd completely forgotten about the subset command. However, it
doesn't quite get me where I need. Perhaps an example will help. I will
simplify my dataframe to the three important variables:
ESR_ref ESR_ref_edit Loaded
1.1 1.1 Y
1.1.1 1.1 NC
1.1.2 1.1 Y
2.1 2.1 N
2.1.1 2.1 Y
2.1.2 2.1 PU
2.1.3 2.1 Y
3.1 3.1 Y
4.1 4.1 N
4.1.1 4.1 PU
So I've created the "edit" variable so I can test for duplicates (i.e.
samples with more than one sub-sample) because this is not of interest at
this point. I just want one subsample per sample. However, if we consider
2.1 - this would result in a subset (if duplicates were removed) with the
first line which has an "N". But it is of interest to me the if at least
one of the subsamples has a "Y" then I want that line rather than a
subsample with another code. 1.1 in this example works by default because
the first subsample is a "Y" so it will retain that information.
Thanks
Gareth
Gray Calhoun-2 wrote:
>
> Hi,
> Try:
>
> subset(Samps, !duplicated(Samps$ESR_ref_edit) | Samps$Loaded == "Y")
>
> I'd need specific code to be sure that this is exactly what you want
> (ie you specify input and desired output), but indexing with a logical
> vector is probably going to be the solution.
>
> Best,
> Gray
>
> On Wed, Dec 16, 2009 at 7:55 PM, gcam <gcam032 at gmail.com> wrote:
>>
>> Hi all.
>>
>> So I have a data frame with multiple columns/variables. The first
>> variable
>> is a major sample name for which there are some sub-samples. Currently I
>> have used the following command to remove the duplicates:
>>
>> Samps_working<-Samps[-c(which(duplicated(Samps$ESR_Ref_edit))),]
>>
>> This removes all of the duplicated sample rows.
>>
>> However, I just realised that, of course, this removes the first
>> observation
>> of each duplicated set. However, I wish to retain any that have the code
>> "Y" in another variable Samps$Loaded. I'm at a bit of a loss as to how
>> best
>> to approach this problem.
>>
>> Just to reiterate. I want to remove all duplicate lines based on sample
>> name, but, I want the lines to be removed with a preference given to
>> those
>> that do not include a "Y" in the Loaded variable column.
>> --
>> View this message in context:
>> http://n4.nabble.com/Remove-duplicates-from-a-data-frame-but-with-some-special-requirements-tp965745p965745.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Gray Calhoun
>
> Assistant Professor of Economics
> Iowa State University
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
View this message in context: http://n4.nabble.com/Remove-duplicates-from-a-data-frame-but-with-some-special-requirements-tp965745p974312.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list