[R] na.omit - Is it working properly?
peter dalgaard
pdalgd at gmail.com
Wed May 4 08:02:51 CEST 2011
On May 3, 2011, at 21:18 , Kalicin, Sarah wrote:
>
> I have a work around for this, but can someone explain why the first example does not work properly? I believed it worked in the previous version of R, by selecting just the rows=200525 and omitting the na's. I just upgraded to 2.13. I am also concern with the row numbers being different in the selections, should I be worried? FYI, I just selected the first few rows for demonstration, please do not worry that the number of rows shown are not equal. - Sarah
>
> With na.omit around the column, but it is showing other values in the F.WW column other than 200525, along with NA. I was hoping that this would omit all the NA's, and show all the rows that P$F.WW=200525. I believe it did with the previous version of R.
That's highly unlikely. na.omit(P$WW) has fewer elements than there are rows in P so you get vector recycling in the style of
> thuesen[c(F,F,F,F,T),]
blood.glucose short.velocity
5 7.2 1.27
10 12.2 1.22
15 6.7 1.52
20 16.1 1.05
(now why don't we get the usual warning about "not a multiple of" in this case?)
Worse, if you omit observations prior to comparison, the result won't line up. E.g. in the thuesen data, obs.
> thuesen[na.omit(thuesen$short.velocity)==1.12,]
blood.glucose short.velocity
16 8.6 NA
22 4.9 1.03
whereas in fact
> subset(thuesen, short.velocity==1.12)
blood.glucose short.velocity
17 4.2 1.12
23 8.8 1.12
> P[na.omit(P$F.WW)==200525, c(51, 52)]
> F.WW R.WW
> 45 200525 NA
> 53 NA NA
> 61 200534 200534
> 63 200608 200608
> 66 200522 200541
> 80 NA NA
> 150 200521 200516
> 231 200530 200530
>
> No na.omit, the F.WW=200525 seems to work, but lots of NA included. This is what is expected!! The row numbers are not the same as the above example, except the first row.
>> P[P$F.WW==200525, c(51, 52)]
> F.WW R.WW
> 45 200525 NA
> NA NA NA
> NA.1 NA NA
> NA.2 NA NA
> NA.3 NA NA
> 57 200525 200526
> 65 200525 NA
> 67 200525 NA
> 70 200525 200525
> NA.4 NA NA
> NA.5 NA NA
> 86 200525 NA
Presumably, a number of rows got omitted here? The NA's are a bit of a pain, but that's the way things work: If there is an observation that you don't know whether to include, you get an NA filled row.
> thuesen[thuesen$short.velocity==1.12,]
blood.glucose short.velocity
NA NA NA
17 4.2 1.12
23 8.8 1.12
To avoid this, you explicitly test for NA using is.na() or use subset() which does it internally.
>
> Na.omit excludes the na's. This is what I want. The concern I have is why the row numbers do not match any of those shown in the examples above.
>> na.omit(P[P$F.WW==200525, c(51, 52)])
> F.WW R.WW
> 57 200525 200526
> 70 200525 200525
> 161 200525 200525
> 245 200525 200525
> 246 200525 200525
> 247 200525 200526
> 256 200525 200525
> 266 200525 200525
> 269 200525 200525
> 271 200525 200526
> 276 200525 200526
> 278 200525 200526
>
Well, now you remove rows with NA _anywhere_, so e.g. row #65 is out because R.WW is missing. I expect #161 and higher was just chopped from the earlier list.
In short, nothing out of the ordinary seems to be going on here.
--
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-help
mailing list