[R] maintaining variable types in data frames
Mike Miller
mbmiller at taxa.epi.umn.edu
Thu Jan 22 18:36:50 CET 2009
Suppose X and Y are two data frames with the same structures, variable
names and dimensions but with different data and different patterns of
missing. I want to replace missing values in Y with corresponding values
from X. I'll construct a simple two-by-two case:
> X <- as.data.frame(matrix(c("a","b",1,2),2,2), stringsAsFactors=FALSE)
> X[,2] <- as.integer(X[,2])
> str(X)
'data.frame': 2 obs. of 2 variables:
$ V1: chr "a" "b"
$ V2: int 1 2
> Y <- as.data.frame(matrix(c("c","d",NA,4),2,2), stringsAsFactors=FALSE)
> Y[,2] <- as.integer(Y[,2])
> str(Y)
'data.frame': 2 obs. of 2 variables:
$ V1: chr "c" "d"
$ V2: int NA 4
This seems to be what I want to do...
> Y[is.na(Y)] <- X[is.na(Y)]
...and it works except that the structure of Y is changed so that Y$V2 is
now of type chr instead of type int:
> str(Y)
'data.frame': 2 obs. of 2 variables:
$ V1: chr "c" "d"
$ V2: chr "1" "4"
This behavior makes sense because the vector X[is.na(Y)] is of the
character type:
> is.character(X[is.na(Y)])
[1] TRUE
> str(X[is.na(Y)])
chr "1"
> X[is.na(Y)]
[1] "1"
The last couple of results seem weird at first. The "1" was originally an
integer but now it is a character. This *must* be because the typing is
done at an earlier stage in the process, back when R decides which
elements of X have to be checked against the logical matrix is.na(Y). It
then decides the type for the vector and only afterward does it find that
only one of the four elements of X will be selected, but it was prepared
from that early stage for any of the four, even all four of them, to be
selected.
Suppose there were no NA elements in Y, what should we expect to see if we
repeat what we did above?
> Y <- as.data.frame(matrix(c("c","d",3,4),2,2), stringsAsFactors=FALSE)
> Y[,2] <- as.integer(Y[,2])
> str(Y)
'data.frame': 2 obs. of 2 variables:
$ V1: chr "c" "d"
$ V2: int 3 4
Even though there are no elements in X[is.na(Y)], the null element is of
type chr:
> is.vector(X[is.na(Y)])
[1] TRUE
> is.character(X[is.na(Y)])
[1] TRUE
> str(X[is.na(Y)])
chr(0)
> X[is.na(Y)]
character(0)
So what happens if we do this...
> Y[is.na(Y)] <- X[is.na(Y)]
...will it change the structure of Y so that Y$V2 becomes type chr?
> str(Y)
'data.frame': 2 obs. of 2 variables:
$ V1: chr "c" "d"
$ V2: int 3 4
No. I think there is an obvious reason for that: Y was not changed, and
more specifically, Y$V2 was not changed, so no change was made to the
variable types.
It all makes sense, but I want an easy way to maintain the structure of a
data frame when I do this kind of operation. I ought to be able to do
something like this:
Ytypes <- get_types(Y)
Y[is.na(Y)] <- X[is.na(Y)]
use_types(Y, Ytypes)
That kind of system would ensure that the basic structure of the data
frame can be maintained. I don't want to have to check by hand, and
sometimes it would be impossible to do so.
So what's the trick? Is there a trick?
Mike
More information about the R-help
mailing list