[R] Remove duplicates from a data frame but with some special requirements
Gray Calhoun
gray.calhoun at gmail.com
Fri Dec 18 03:17:17 CET 2009
The easiest thing might be to just sort on the Loaded column:
#### start of example
d <- read.table(textConnection("ESR_ref ESR_ref_edit Loaded
1.1 1.1 Y
1.1.1 1.1 NC
1.1.2 1.1 Y
2.1 2.1 N
2.1.1 2.1 Y
2.1.2 2.1 PU
2.1.3 2.1 Y
3.1 3.1 Y
4.1 4.1 N
4.1.1 4.1 PU"), header = TRUE)
d$Loaded <- ordered(d$Loaded, levels = c("Y", "NC", "PU", "N"))
dSorted <- d[order(d$Loaded),]
subset(dSorted, !duplicated(dSorted$ESR_ref_edit))
#### end of code
You could also try using tapply.
--Gray
On Thu, Dec 17, 2009 at 11:31 AM, gcam <gcam032 at gmail.com> wrote:
>
> Thanks Gray,
>
> This helps, I'd completely forgotten about the subset command. However, it
> doesn't quite get me where I need. Perhaps an example will help. I will
> simplify my dataframe to the three important variables:
>
> ESR_ref ESR_ref_edit Loaded
> 1.1 1.1 Y
> 1.1.1 1.1 NC
> 1.1.2 1.1 Y
> 2.1 2.1 N
> 2.1.1 2.1 Y
> 2.1.2 2.1 PU
> 2.1.3 2.1 Y
> 3.1 3.1 Y
> 4.1 4.1 N
> 4.1.1 4.1 PU
>
> So I've created the "edit" variable so I can test for duplicates (i.e.
> samples with more than one sub-sample) because this is not of interest at
> this point. I just want one subsample per sample. However, if we consider
> 2.1 - this would result in a subset (if duplicates were removed) with the
> first line which has an "N". But it is of interest to me the if at least
> one of the subsamples has a "Y" then I want that line rather than a
> subsample with another code. 1.1 in this example works by default because
> the first subsample is a "Y" so it will retain that information.
>
> Thanks
>
> Gareth
>
>
> Gray Calhoun-2 wrote:
>>
>> Hi,
>> Try:
>>
>> subset(Samps, !duplicated(Samps$ESR_ref_edit) | Samps$Loaded == "Y")
>>
>> I'd need specific code to be sure that this is exactly what you want
>> (ie you specify input and desired output), but indexing with a logical
>> vector is probably going to be the solution.
>>
>> Best,
>> Gray
>>
>> On Wed, Dec 16, 2009 at 7:55 PM, gcam <gcam032 at gmail.com> wrote:
>>>
>>> Hi all.
>>>
>>> So I have a data frame with multiple columns/variables. The first
>>> variable
>>> is a major sample name for which there are some sub-samples. Currently I
>>> have used the following command to remove the duplicates:
>>>
>>> Samps_working<-Samps[-c(which(duplicated(Samps$ESR_Ref_edit))),]
>>>
>>> This removes all of the duplicated sample rows.
>>>
>>> However, I just realised that, of course, this removes the first
>>> observation
>>> of each duplicated set. However, I wish to retain any that have the code
>>> "Y" in another variable Samps$Loaded. I'm at a bit of a loss as to how
>>> best
>>> to approach this problem.
>>>
>>> Just to reiterate. I want to remove all duplicate lines based on sample
>>> name, but, I want the lines to be removed with a preference given to
>>> those
>>> that do not include a "Y" in the Loaded variable column.
>>> --
>>> View this message in context:
>>> http://n4.nabble.com/Remove-duplicates-from-a-data-frame-but-with-some-special-requirements-tp965745p965745.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Gray Calhoun
>>
>> Assistant Professor of Economics
>> Iowa State University
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
> --
> View this message in context: http://n4.nabble.com/Remove-duplicates-from-a-data-frame-but-with-some-special-requirements-tp965745p974312.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Gray Calhoun
Assistant Professor of Economics
Iowa State University
More information about the R-help
mailing list