[R] na.omit - Is it working properly?

Wed May 4 08:02:51 CEST 2011

On May 3, 2011, at 21:18 , Kalicin, Sarah wrote:

> 
> I have a work around for this, but can someone explain why the first example does not work properly? I believed it worked in the previous version of R, by selecting just the rows=200525 and omitting the na's. I just upgraded to 2.13. I am also concern with the row numbers being different in the selections, should I be worried? FYI, I just selected the first few rows for demonstration, please do not worry that the number of rows shown are not equal. - Sarah
> 
> With na.omit around the column, but it is showing other values in the F.WW column other than 200525, along with NA.  I was hoping that this would omit all the NA's, and show all the rows that P$F.WW=200525. I believe it did with the previous version of R.

That's highly unlikely. na.omit(P$WW) has fewer elements than there are rows in P so you get vector recycling in the style of 

> thuesen[c(F,F,F,F,T),]
   blood.glucose short.velocity
5            7.2           1.27
10          12.2           1.22
15           6.7           1.52
20          16.1           1.05

(now why don't we get the usual warning about "not a multiple of" in this case?)

Worse, if you omit observations prior to comparison, the result won't line up. E.g. in the thuesen data, obs.

> thuesen[na.omit(thuesen$short.velocity)==1.12,]
   blood.glucose short.velocity
16           8.6             NA
22           4.9           1.03

whereas in fact 

> subset(thuesen, short.velocity==1.12)
   blood.glucose short.velocity
17           4.2           1.12
23           8.8           1.12

> P[na.omit(P$F.WW)==200525, c(51, 52)]
>          F.WW        R.WW
> 45      200525          NA
> 53          NA          NA
> 61      200534      200534
> 63      200608      200608
> 66      200522      200541
> 80          NA          NA
> 150     200521      200516
> 231     200530      200530
> 
> No na.omit, the F.WW=200525 seems to work, but lots of NA included. This is what is expected!! The row numbers are not the same as the above example, except the first row.
>> P[P$F.WW==200525, c(51, 52)]
>            F.WW     R.WW
> 45        200525          NA
> NA            NA          NA
> NA.1          NA          NA
> NA.2          NA          NA
> NA.3          NA          NA
> 57        200525      200526
> 65        200525          NA
> 67        200525          NA
> 70        200525      200525
> NA.4          NA          NA
> NA.5          NA          NA
> 86        200525          NA

Presumably, a number of rows got omitted here? The NA's are a bit of a pain, but that's the way things work: If there is an observation that you don't know whether to include, you get an NA filled row.

> thuesen[thuesen$short.velocity==1.12,]
   blood.glucose short.velocity
NA            NA             NA
17           4.2           1.12
23           8.8           1.12

To avoid this, you explicitly test for NA using is.na() or use subset() which does it internally. 

> 
> Na.omit excludes the na's. This is what I want. The concern I have is why the row numbers do not match any of those shown in the examples above.
>> na.omit(P[P$F.WW==200525, c(51, 52)])
>        F.WW        R.WW
> 57    200525      200526
> 70    200525      200525
> 161   200525      200525
> 245   200525      200525
> 246   200525      200525
> 247   200525      200526
> 256   200525      200525
> 266   200525      200525
> 269   200525      200525
> 271   200525      200526
> 276   200525      200526
> 278   200525      200526
> 

Well, now you remove rows with NA _anywhere_, so e.g. row #65 is out because R.WW is missing. I expect #161 and higher was just chopped from the earlier list. 

In short, nothing out of the ordinary seems to be going on here.

-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com