[R] maintaining variable types in data frames

Mike Miller mbmiller at taxa.epi.umn.edu
Thu Jan 22 18:36:50 CET 2009


Suppose X and Y are two data frames with the same structures, variable 
names and dimensions but with different data and different patterns of 
missing.  I want to replace missing values in Y with corresponding values 
from X.  I'll construct a simple two-by-two case:

> X <- as.data.frame(matrix(c("a","b",1,2),2,2), stringsAsFactors=FALSE)
> X[,2] <- as.integer(X[,2])
> str(X)
'data.frame':   2 obs. of  2 variables:
   $ V1: chr  "a" "b"
   $ V2: int  1 2

> Y <- as.data.frame(matrix(c("c","d",NA,4),2,2), stringsAsFactors=FALSE)
> Y[,2] <- as.integer(Y[,2])
> str(Y)
'data.frame':   2 obs. of  2 variables:
   $ V1: chr  "c" "d"
   $ V2: int  NA 4

This seems to be what I want to do...

> Y[is.na(Y)] <- X[is.na(Y)]

...and it works except that the structure of Y is changed so that Y$V2 is 
now of type chr instead of type int:

> str(Y)
'data.frame':   2 obs. of  2 variables:
   $ V1: chr  "c" "d"
   $ V2: chr  "1" "4"

This behavior makes sense because the vector X[is.na(Y)] is of the 
character type:

> is.character(X[is.na(Y)])
[1] TRUE
> str(X[is.na(Y)])
   chr "1"
> X[is.na(Y)]
[1] "1"

The last couple of results seem weird at first.  The "1" was originally an 
integer but now it is a character.  This *must* be because the typing is 
done at an earlier stage in the process, back when R decides which 
elements of X have to be checked against the logical matrix is.na(Y).  It 
then decides the type for the vector and only afterward does it find that 
only one of the four elements of X will be selected, but it was prepared 
from that early stage for any of the four, even all four of them, to be 
selected.

Suppose there were no NA elements in Y, what should we expect to see if we 
repeat what we did above?

> Y <- as.data.frame(matrix(c("c","d",3,4),2,2), stringsAsFactors=FALSE)
> Y[,2] <- as.integer(Y[,2])
> str(Y)
'data.frame':   2 obs. of  2 variables:
   $ V1: chr  "c" "d"
   $ V2: int  3 4

Even though there are no elements in X[is.na(Y)], the null element is of 
type chr:

> is.vector(X[is.na(Y)])
[1] TRUE
> is.character(X[is.na(Y)])
[1] TRUE
> str(X[is.na(Y)])
   chr(0)
> X[is.na(Y)]
character(0)

So what happens if we do this...

> Y[is.na(Y)] <- X[is.na(Y)]

...will it change the structure of Y so that Y$V2 becomes type chr?

> str(Y)
'data.frame':   2 obs. of  2 variables:
   $ V1: chr  "c" "d"
   $ V2: int  3 4

No.  I think there is an obvious reason for that:  Y was not changed, and 
more specifically, Y$V2 was not changed, so no change was made to the 
variable types.

It all makes sense, but I want an easy way to maintain the structure of a 
data frame when I do this kind of operation. I ought to be able to do 
something like this:

Ytypes <- get_types(Y)

Y[is.na(Y)] <- X[is.na(Y)]

use_types(Y, Ytypes)

That kind of system would ensure that the basic structure of the data 
frame can be maintained.  I don't want to have to check by hand, and 
sometimes it would be impossible to do so.

So what's the trick?  Is there a trick?

Mike




More information about the R-help mailing list