[R] how to delete specific rows in a data frame where the first column matches any string from a list

Andrew Choens andy.choens at gmail.com
Fri Feb 6 22:30:17 CET 2009


I regularly deal with a similar pattern at work. People send me these
big long .csv files and I have to run them through some pattern analysis
to decide which rows I keep and which rows I kill off.

As others have mentioned, Perl is a good candidate for this task.
Another option would be a quick SQL query. It should be a snap to pull
this into something like Access or OOo Base . . . . or better yet,  a
real database like Postgres, MySQL, etc.

In case you aren't too familiar with SQL, this query could be done by
deleting the rows using a self join (syntax varies by product).

But, if the pattern is as simple as it sounds and / or this is a
one-time job, using SQL is over-kill for the situation.

I often use sed in places where Perl is over-kill, but I can't think of
any way to match from row to row with sed. If anyone knows how to do
this with sed, it would (probably) be easier than trying to learn how to
use perl. And, I would like to know how to do this with sed too.


On Fri, 2009-02-06 at 16:04 -0500, Laura Rodriguez Murillo wrote:
> yep, it definitely sounds like a work for perl, but I don't know perl
> (unfortunately). I'm still stuck with this so I'm giving more details
> in case it helps:
> 
> I have file A with 382 columns and 300000 rows. There are rows where
> only the entry in first column is duplicated in other rows. In these
> cases, I need to delete the entire row.
> 
> I also have a file B (one column and around 280000 rows) with a list
> of the entries that are repeated. So I was trying to look for the ones
> that match and get rid of the entire row.
> 
> Thank you!
> 
> Laura
> 
> 2009/2/6 Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>:
> > Laura Rodriguez Murillo wrote:
> >> Thank you. I think grep would do it, but the list of expressions I
> >> need to match is too long so they are stored in a file.
> >
> > what does 'too long' mean?
> >
> >> So the
> >> question would be how I can tell R to look into that file to look for
> >> the expressions that I want to match.
> >>
> >
> > i guess you may still successfully use r for this, but to me it sounds
> > like a perfect job for perl.  let me know if you need more help.
> >
> > note, in the below, you'd use 'data[,2]' instead of 'd[,2]' (or 'd'
> > instead of 'data').  sorry for the typo.  mark, thanks for pointing this
> > out -- the more obvious the mistake, the less visible ;)
> >
> > vQ
> >
> >
> >> Thank you again for your help
> >>
> >> Laura
> >>
> >> 2009/2/6 Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>:
> >>
> >>> Laura Rodriguez Murillo wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I'm new in the mailing list but I would appreciate if you could help
> >>>> me with this:
> >>>> I have a big matrix from where I need to delete specific rows. The
> >>>> second entry on these rows to delete should match any string within a
> >>>> list (other file with just one column).
> >>>> Thank you so much!
> >>>>
> >>>>
> >>>>
> >>> here's one way to do it, illustrated with dummy data:
> >>>
> >>> # dummy character matrix
> >>> data = matrix(replicate(20, paste(sample(letters, 20), collapse="")),
> >>> ncol=2)
> >>>
> >>> # filter out rows where second column does not match 'a'
> >>> data[-grep('a', d[,2]),]
> >>>
> >>> this will work also if your data is actually a data frame:
> >>>
> >>> data = as.data.frame(data)
> >>> data[-grep('a', d[,2]),]
> >>>
> >>> note, due to a known issue with grep, this won't work correctly if there
> >>> are *no* rows that do *not* match the pattern:
> >>>
> >>> data[-grep('1', d[,2]),]
> >>> # should return all of data, but returns an empty matrix
> >>>
> >>> with the upcoming version of r, grep will have an additional argument
> >>> which will make this problem easy to fix:
> >>>
> >>> data[grep('a', d[,2], invert=TRUE),]
> >>>
> >>>
> >>> vQ
> >
> >
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
This is the price and the promise of citizenship.
        -- Barack Obama, 44th President of the United States




More information about the R-help mailing list