[R] subset

Tue May 16 21:34:02 CEST 2006

On Tue, 2006-05-16 at 14:54 -0400, Guenther, Cameron wrote:
> Marc, 
> I have tried unique but unique looks at the entire row.  

Right, as I noted in the last line of my reply.

> I have a data
> set with a variable TRIPID.  The dataset has 469,000 rows.  In most
> cases TRIPID is a unique value.  However, in some cases I have the same
> TRIPID value but different values for other variables.  What this
> amounts to is an data entry error.  I need to get rid of the repeated
> rows that have the same TRIPID but different co-variables.  
> Thanks for your help.
> Cam 

If I am reading correctly, rather than retaining all unique rows, you
actually want to remove all rows with duplicated TRIPID values,
presuming that you don't know which row is correct.

In other words, if there are two rows with the same TRIPID value, you
want both rows removed?

I think what you want is this, presuming that 'x' is the data frame that
contains the column "TRIPID":

  NewDF <- subset(x, !TRIPID %in% TRIPID[duplicated(TRIPID)])

What is being done is to identify the actual values of TRIPID that are
duplicated (TRIPID[duplicated(TRIPID)]) and then subsetting 'x' by only
retaining rows of 'x' where the values of TRIPID are _not_ in the
duplicated values.

Check me on that though.

HTH,

Marc Schwartz

> On Tue, 2006-05-16 at 14:37 -0400, Guenther, Cameron wrote:
> > Hello everyone,
> > 
> > I have a large dataset (x) with some rows that have duplicate 
> > variables that I would like to remove.  I find which rows are the 
> > duplicates with X1<-which(duplicated(x)).  That gives me the rows with
> 
> > duplicated variables.  Now, how can I remove just those rose from the 
> > original data frame.  I think I can create a new data frame without 
> > the duplicates using subset.  I have tried:
> > Subset(x,!x1) and subset(x,!x[x1,])
> > I can't seem to find the correct syntax.  Any advice.
> > Thanks in advance
> 
> Even easier would be to use unique():
> 
>   NewDF < unique(x)
> 
> NewDF will contain rows from 'x' with duplicates removed.
> 
> See ?unique for more information.
> 
> unique(), which has a data.frame method, is basically:
> 
>   x[!duplicated(x), , drop = FALSE]
> 
> which covers the case where the result may contain a single row and
> which remains a data frame.
> 
> Note that the above presumes that you want to test all columns in 'x'
> for dups.
> 
> HTH,
> 
> Marc Schwartz