[R] Remove duplicates from a data frame but with some special requirements

Gray Calhoun gray.calhoun at gmail.com
Fri Dec 18 03:17:17 CET 2009


The easiest thing might be to just sort on the Loaded column:

#### start of example
d <- read.table(textConnection("ESR_ref   ESR_ref_edit    Loaded
1.1          1.1                  Y
1.1.1        1.1                  NC
1.1.2        1.1                 Y
2.1           2.1                  N
2.1.1         2.1                 Y
2.1.2        2.1                  PU
2.1.3        2.1                   Y
3.1           3.1                  Y
4.1           4.1                  N
4.1.1        4.1                   PU"), header = TRUE)

d$Loaded <- ordered(d$Loaded, levels = c("Y", "NC", "PU", "N"))
dSorted <- d[order(d$Loaded),]
subset(dSorted, !duplicated(dSorted$ESR_ref_edit))
#### end of code

You could also try using tapply.

--Gray

On Thu, Dec 17, 2009 at 11:31 AM, gcam <gcam032 at gmail.com> wrote:
>
> Thanks Gray,
>
> This helps, I'd completely forgotten about the subset command.  However, it
> doesn't quite get me where I need.  Perhaps an example will help.  I will
> simplify my dataframe to the three important variables:
>
> ESR_ref   ESR_ref_edit    Loaded
> 1.1          1.1                  Y
> 1.1.1        1.1                  NC
> 1.1.2        1.1                 Y
> 2.1           2.1                  N
> 2.1.1         2.1                 Y
> 2.1.2        2.1                  PU
> 2.1.3        2.1                   Y
> 3.1           3.1                  Y
> 4.1           4.1                  N
> 4.1.1        4.1                   PU
>
> So I've created the "edit" variable so I can test for duplicates (i.e.
> samples with more than one sub-sample) because this is not of interest at
> this point.  I just want one subsample per sample.  However, if we consider
> 2.1 - this would result in a subset (if duplicates were removed) with the
> first line which has an "N".  But it is of interest to me the if at least
> one of the subsamples has a "Y" then I want that line rather than a
> subsample with another code.  1.1 in this example works by default because
> the first subsample is a "Y" so it will retain that information.
>
> Thanks
>
> Gareth
>
>
> Gray Calhoun-2 wrote:
>>
>> Hi,
>> Try:
>>
>> subset(Samps, !duplicated(Samps$ESR_ref_edit) | Samps$Loaded == "Y")
>>
>> I'd need specific code to be sure that this is exactly what you want
>> (ie you specify input and desired output), but indexing with a logical
>> vector is probably going to be the solution.
>>
>> Best,
>> Gray
>>
>> On Wed, Dec 16, 2009 at 7:55 PM, gcam <gcam032 at gmail.com> wrote:
>>>
>>> Hi all.
>>>
>>> So I have a data frame with multiple columns/variables.  The first
>>> variable
>>> is a major sample name for which there are some sub-samples.  Currently I
>>> have used the following command to remove the duplicates:
>>>
>>> Samps_working<-Samps[-c(which(duplicated(Samps$ESR_Ref_edit))),]
>>>
>>> This removes all of the duplicated sample rows.
>>>
>>> However, I just realised that, of course, this removes the first
>>> observation
>>> of each duplicated set.  However, I wish to retain any that have the code
>>> "Y" in another variable Samps$Loaded.  I'm at a bit of a loss as to how
>>> best
>>> to approach this problem.
>>>
>>> Just to reiterate.  I want to remove all duplicate lines based on sample
>>> name, but, I want the lines to be removed with a preference given to
>>> those
>>> that do not include a "Y" in the Loaded variable column.
>>> --
>>> View this message in context:
>>> http://n4.nabble.com/Remove-duplicates-from-a-data-frame-but-with-some-special-requirements-tp965745p965745.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Gray Calhoun
>>
>> Assistant Professor of Economics
>> Iowa State University
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
> --
> View this message in context: http://n4.nabble.com/Remove-duplicates-from-a-data-frame-but-with-some-special-requirements-tp965745p974312.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Gray Calhoun

Assistant Professor of Economics
Iowa State University




More information about the R-help mailing list