[R] subset and na.rm not really suppressing <NA> values

Thu Jan 23 03:59:12 CET 2014

I don't think na.rm is a valid at parameter for the subset function. I would normally use the is.na function to logically test for NA values. I also don't know where your VALID_EMAIL variable is coming from.

a <- subset(mydf, !is.na(EMAIL_ADDRESS))

The na.strings argument to read.csv and friends is used to help recognise strings in the input that should be treated as NA. If you don't see "<NA>" in your input file then it will have no effect on the data import.

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Jeff Johnson <mrjefftoyou at gmail.com> wrote:
>I have a dataset "mydf" with a field EMAIL_ADDRESS. When importing, I
>specified:
>mydf <- read.csv(file = extract, header = TRUE, stringsAsFactors =
>FALSE,
>na.strings=c("NA",""))
>
>I've also tried setting na.strings= c("NA","","<NA>") but I don't know
>if
>it's appropriate to put <NA> there.
>
>I'm running
>a <- subset(mydf, VALID_EMAIL == FALSE, na.rm = TRUE, select =
>EMAIL_ADDRESS)
>dput(head(a,5))
>
>structure(list(EMAIL_ADDRESS = c(NA_character_, NA_character_,
>NA_character_, NA_character_, NA_character_)), .Names =
>"EMAIL_ADDRESS",
>row.names = c(17L,
>22L, 23L, 24L, 30L), class = "data.frame")
>
>The results show a lot of <NA> values on screen and in the dput
>statement.
>
>I don't quite understand why it is doing that. I would have expected it
>to
>exclude those since I had the na.rm = TRUE statement. Do you have any
>suggestions?
>
>Thanks!