[R] Effect of na.omit()

Tue Dec 29 22:33:13 CET 2009

On 29-Dec-09 21:11:38, James Rome wrote:
> I had an NA in one row of my data frame, so I called na.omit().
> But I do not understand where that row disappeared to.
> 
>>fri=na.omit(fri)
>> fri
>       Date.Only    DAY Hour Min15 Quarter Arrival.Val Arrival4
> 1    09/05/2008 Friday    8    33       3          32        8
> 2    10/24/2008 Friday   21    86       4          28        7
> 3    10/31/2008 Friday    8    33       4          20        5
> 4    10/31/2008 Friday    8    34       4          20        5
> 5    10/31/2008 Friday    8    35       4          12        3
> ....
> 1233 08/28/2009 Friday    0     2       3          12        3
> 1234 09/18/2009 Friday   22    92       3           8        2
> 1235 09/18/2009 Friday   23    93       3          20        5
>> fri[1235,]
>    Date.Only  DAY Hour Min15 Quarter Arrival.Val Arrival4
> NA      <NA> <NA>   NA    NA      NA          NA       NA
>> fri[1234,]
>       Date.Only    DAY Hour Min15 Quarter Arrival.Val Arrival4
> 1235 09/18/2009 Friday   23    93       3          20        5
> 
> So, the index numbers of the rows do not seem to have been updated.
> They are not part of my data frame (I think), so why didn't the rows
> renumber themselves?
> 
> Thanks,
> Jim Rome

Because the numbers which are displayed at the left of the rows
are not the row numbers of the structure being displayed, but
they are in fact row *names*!

These are so assigned (by default) when the dataframe is created.
Example:

  DF <- data.frame(col1=c(1,2,3,4),col2=c(2,3,4,5),col3=c(3,4,5,6))
  DF
  #   col1 col2 col3
  # 1    1    2    3
  # 2    2    3    4
  # 3    3    4    5
  # 4    4    5    6
  row.names(DF)
  # [1] "1" "2" "3" "4"

  DF[c(1,3,4),]
  #   col1 col2 col3
  # 1    1    2    3
  # 3    3    4    5
  # 4    4    5    6

  row.names(DF) <- c("A","B","C","D")
  DF
  #   col1 col2 col3
  # A    1    2    3
  # B    2    3    4
  # C    3    4    5
  # D    4    5    6

  DF[c(1,3,4),]
  #   col1 col2 col3
  # A    1    2    3
  # C    3    4    5
  # D    4    5    6

So the (1,2,3,4) row-names -> (1,3,4) are treated exactly like
the row-names (A,B,C,D) -> (A,C,D).

If you want to "re-number" the rows after eliminating some rows
(with na.omit) then you could do

row.names(fri) <- (1:nrow(fri))

Example:

  DF1 <-  DF[c(1,3,4),]
  DF1
  #   col1 col2 col3
  # A    1    2    3
  # C    3    4    5
  # D    4    5    6
  row.names(DF1) <- (1:nrow(DF1))
  DF1
  #   col1 col2 col3
  # 1    1    2    3
  # 2    3    4    5
  # 3    4    5    6

However, often it is very useful to keeep the original "numbering"
(i.e. the numerical row-names), since this is then a record of
which rows in the dataframe got used. For example, in a regression
with some missing data coded as "NA", the model-matrix will retain
the original "numbering", so yhou can identify which cases (rows)
got used by looking at the row.names() of the model matrix.

Since these are returned as numeric values, the result can be used
as an index into the original dataset.

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 29-Dec-09                                       Time: 21:33:10
------------------------------ XFMail ------------------------------