[R] Selecting rows from a DF where the value in a selected column matches any element of a vector.

Sun Apr 13 00:52:04 CEST 2014

Hi Andrew,

On Apr 12, 2014, at 6:36 PM, Andrew Hoerner <ahoerner at rprogress.org> wrote:

> Thanks Sarah! That worked!
> 
> And you are quite right about the absence of parentheses and "EC07_A1$" 's.
> I apologize for sending that code snip -- I am not quite sure how I managed
> to do it, since I had already fixed those problems and changed the code in
> order to get the error message I posted.
> 
> Apropos of nothing in particular, before I could successfully impliment
> your fix, I also had to learn another new thing. When saving a CSV file
> with write.table, if you use sep=", " (that's double-quote comma space
> double-quote) R puts the space _inside_ the quotation marks around
> character variables. I'm not sure I would call that a bug, but I bet more
> people are surprised by it than expect it.

It shouldn’t; that’s incorrect. Can you provide a reproducible example?

When I look at your code & my reply, I notice that the quote marks are wrong too; could that be the actual problem?

Sarah

> 
> Again, many thanks!
> 
> Andrew
> 
> 
> On Sat, Apr 12, 2014 at 6:04 AM, Sarah Goslee <sarah.goslee at gmail.com>wrote:
> 
>> You need %in% instead.
>> 
>> This is untested, but something like this should work:
>> 
>> 
>> ECwork  <-  EC07_A1[ EC07_A1$GEO_ID %in% c("01000US", "04000US06",
>> "33000US488",
>> "31000US41860", "31400US4186036084" "05000US06001", "E6000US0600153000") &
>>      EC07_A1$SECTOR %in% c("32", "33", "42", 44", 45", 51", 54", 61",
>> "71",
>> "81"), ]
>> 
>> (Note that your original code snippet had a shortage of ) and didn't
>> specify the data frame from which to take the columns.)
>> 
>> Sarah
>> 
>> On Sat, Apr 12, 2014 at 8:36 AM, Andrew Hoerner <ahoerner at rprogress.org>
>> wrote:
>>> Dear Folks--
>>> I have a file with 3 million-odd rows of data from the 2007 U.S. Economic
>>> Census. I am trying to pare it down to a subset of rows that both (1) has
>>> any one of a vector of NAICS economic sector codes, and (2) also has any
>>> one of a vector of geographic ID codes.
>>> 
>>> Here is the code I am trying to use.
>>> 
>>> ECwork  <-  EC07_A1[ any(GEO_ID == c("01000US", "04000US06",
>> "33000US488",
>>> "31000US41860", "31400US4186036084" "05000US06001", "E6000US0600153000")
>> &
>>>      any(SECTOR == c("32", "33", "42", 44", 45", 51", 54", 61", "71",
>>> "81"), ]
>>> 
>>> I get back the following error:
>>> 
>>> Warning message:
>>> In EC07_A1$SECTOR == c("32", "33", "42", "44", "45", "51", "54",  :
>>>  longer object length is not a multiple of shorter object length
>>> 
>>> I see what R is doing.  Instead of comparing each element of the column
>>> SECTOR to the row vector of codes, and returning a logical vector of the
>>> length of SECTOR with rows marked as TRUE that match any of the codes, it
>>> is lining my code list up with SECTOR as a column vector and doing
>>> element-by-element testing, and then recycling the code list over three
>>> million rows. But I am not sure how to make it do what I want -- test the
>>> sector code in each row against the vector of code I am looking for. I
>>> would be grateful if anyone could suggest an alternative that would
>> achieve
>>> my ends.
>>> 
>>> Oh, and I would add, if there is a way of correctly using doing this with
>>> the extract function [], I would like to know what it is. If not, I guess
>>> I'd like to know that too.
>>> 
>>> Sincerely, Andrew Hoerner
>>> 
>>> --
>>> J. Andrew Hoerner
>>> Director, Sustainable Economics Program
>>> Redefining Progress
>>> (510) 507-4820
>>> 
>> --
>> Sarah Goslee
>> http://www.functionaldiversity.org
>>