[R] how to delete specific rows in a data frame where the first column matches any string from a list

Fri Feb 6 22:04:55 CET 2009

yep, it definitely sounds like a work for perl, but I don't know perl
(unfortunately). I'm still stuck with this so I'm giving more details
in case it helps:

I have file A with 382 columns and 300000 rows. There are rows where
only the entry in first column is duplicated in other rows. In these
cases, I need to delete the entire row.

I also have a file B (one column and around 280000 rows) with a list
of the entries that are repeated. So I was trying to look for the ones
that match and get rid of the entire row.

Thank you!

Laura

2009/2/6 Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>:
> Laura Rodriguez Murillo wrote:
>> Thank you. I think grep would do it, but the list of expressions I
>> need to match is too long so they are stored in a file.
>
> what does 'too long' mean?
>
>> So the
>> question would be how I can tell R to look into that file to look for
>> the expressions that I want to match.
>>
>
> i guess you may still successfully use r for this, but to me it sounds
> like a perfect job for perl.  let me know if you need more help.
>
> note, in the below, you'd use 'data[,2]' instead of 'd[,2]' (or 'd'
> instead of 'data').  sorry for the typo.  mark, thanks for pointing this
> out -- the more obvious the mistake, the less visible ;)
>
> vQ
>
>
>> Thank you again for your help
>>
>> Laura
>>
>> 2009/2/6 Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk at idi.ntnu.no>:
>>
>>> Laura Rodriguez Murillo wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm new in the mailing list but I would appreciate if you could help
>>>> me with this:
>>>> I have a big matrix from where I need to delete specific rows. The
>>>> second entry on these rows to delete should match any string within a
>>>> list (other file with just one column).
>>>> Thank you so much!
>>>>
>>>>
>>>>
>>> here's one way to do it, illustrated with dummy data:
>>>
>>> # dummy character matrix
>>> data = matrix(replicate(20, paste(sample(letters, 20), collapse="")),
>>> ncol=2)
>>>
>>> # filter out rows where second column does not match 'a'
>>> data[-grep('a', d[,2]),]
>>>
>>> this will work also if your data is actually a data frame:
>>>
>>> data = as.data.frame(data)
>>> data[-grep('a', d[,2]),]
>>>
>>> note, due to a known issue with grep, this won't work correctly if there
>>> are *no* rows that do *not* match the pattern:
>>>
>>> data[-grep('1', d[,2]),]
>>> # should return all of data, but returns an empty matrix
>>>
>>> with the upcoming version of r, grep will have an additional argument
>>> which will make this problem easy to fix:
>>>
>>> data[grep('a', d[,2], invert=TRUE),]
>>>
>>>
>>> vQ
>
>