[R] Selecting rows from a DF where the value in a selected column matches any element of a vector.

Sun Apr 13 02:19:24 CEST 2014

See inline.

On Apr 12, 2014, at 7:04 PM, Andrew Hoerner <ahoerner at rprogress.org> wrote:

> Oops! Spoke too soon.
> 
> Your fix fixed the problem I was having before, but it turns out the test is now accepting every line. So there is still some problem with the logic or with my implimentation of it.
> 
> I thought I should produce a reproducible example without 3 million lines of data. I made a version with only the geography information and test. Here is the code I am now using, applied to a file with only the first 8 lines of my geo data in it:

Please use dput() to provide data, rather than copy & paste, since that’s not reproducible.

> 
> First I read the data in and print it out:
> 
> GEOshort.DF <- read.table("C:\\Users\\andrewH\\Documents\\Oakland Tech Project\\GEO_short.csv", 
>                       header = FALSE, sep = ",", quote = "\"",  dec = ".", skip=1, col.names=
>                       c("originalRow", "GEO_ID", "GEOGRAPHY"), fill = TRUE, colClasses="character")
> 
> Which yields:
> 
> > GEOshort.DF
>   originalRow    GEO_ID     GEOGRAPHY
> 1           1   01000US United States
> 2        3115 04000US01       Alabama
> 3        5501 04000US02        Alaska
> 4        7924 04000US04       Arizona
> 5       10571 04000US05      Arkansas
> 6       14342 04000US06    California
> 7       17913 04000US08      Colorado
> 8       20442 04000US09   Connecticut
> 
> 
> Then I try to select the rows that match my geo-codes:
> 
> GEOextract.DF  <- GEOshort.DF[
>   any(GEOshort.DF$GEO_ID %in% c("01000US", "04000US06", "33000US488", "31000US41860", 
>                               "31400US4186036084", "05000US06001", "E6000US0600153000")), ]

But that’s not what I suggested: if you use any(), then if there are any matches it will return TRUE and by expansion you’ll get all the rows. You need:

GEOextract.DF  <- GEOshort.DF[
GEOshort.DF$GEO_ID %in% c("01000US", "04000US06", "33000US488", "31000US41860", 
                              "31400US4186036084", "05000US06001", "E6000US0600153000"), ]

You can check this yourself by running just the logical portion: compare

any(GEOshort.DF$GEO_ID %in% c("01000US", "04000US06", "33000US488", "31000US41860", 
                              "31400US4186036084", "05000US06001", "E6000US0600153000"))
with

GEOshort.DF$GEO_ID %in% c("01000US", "04000US06", "33000US488", "31000US41860", 
                              "31400US4186036084", "05000US06001", “E6000US0600153000")

> 
>         "... But pattern-matching doesn't equal comprehension."  --Peter Watts

Happy to help a Peter Watts fan. 

Sarah

> 
> 
> On Sat, Apr 12, 2014 at 6:04 AM, Sarah Goslee <sarah.goslee at gmail.com> wrote:
> You need %in% instead.
> 
> This is untested, but something like this should work:
> 
> 
> ECwork  <-  EC07_A1[ EC07_A1$GEO_ID %in% c("01000US", "04000US06", "33000US488",
> "31000US41860", "31400US4186036084" "05000US06001", "E6000US0600153000") &
>       EC07_A1$SECTOR %in% c("32", "33", "42", 44", 45", 51", 54", 61", "71",
> "81"), ]
> 
> (Note that your original code snippet had a shortage of ) and didn't
> specify the data frame from which to take the columns.)
> 
> Sarah
> 
> On Sat, Apr 12, 2014 at 8:36 AM, Andrew Hoerner <ahoerner at rprogress.org> wrote:
> > Dear Folks--
> > I have a file with 3 million-odd rows of data from the 2007 U.S. Economic
> > Census. I am trying to pare it down to a subset of rows that both (1) has
> > any one of a vector of NAICS economic sector codes, and (2) also has any
> > one of a vector of geographic ID codes.
> >
> > Here is the code I am trying to use.
> >
> > ECwork  <-  EC07_A1[ any(GEO_ID == c("01000US", "04000US06", "33000US488",
> > "31000US41860", "31400US4186036084" "05000US06001", "E6000US0600153000") &
> >       any(SECTOR == c("32", "33", "42", 44", 45", 51", 54", 61", "71",
> > "81"), ]
> >
> > I get back the following error:
> >
> > Warning message:
> > In EC07_A1$SECTOR == c("32", "33", "42", "44", "45", "51", "54",  :
> >   longer object length is not a multiple of shorter object length
> >
> > I see what R is doing.  Instead of comparing each element of the column
> > SECTOR to the row vector of codes, and returning a logical vector of the
> > length of SECTOR with rows marked as TRUE that match any of the codes, it
> > is lining my code list up with SECTOR as a column vector and doing
> > element-by-element testing, and then recycling the code list over three
> > million rows. But I am not sure how to make it do what I want -- test the
> > sector code in each row against the vector of code I am looking for. I
> > would be grateful if anyone could suggest an alternative that would achieve
> > my ends.
> >
> > Oh, and I would add, if there is a way of correctly using doing this with
> > the extract function [], I would like to know what it is. If not, I guess
> > I'd like to know that too.
> >
> > Sincerely, Andrew Hoerner
> >
> > --
> > J. Andrew Hoerner
> > Director, Sustainable Economics Program
> > Redefining Progress
> > (510) 507-4820
> >
> --
> Sarah Goslee
> http://www.functionaldiversity.org
> 
> 
> 
> -- 
> J. Andrew Hoerner
> Director, Sustainable Economics Program
> Redefining Progress
> (510) 507-4820