[R] maintaining variable types in data frames

Mike Miller mbmiller at taxa.epi.umn.edu
Fri Jan 23 04:44:02 CET 2009


On Thu, 22 Jan 2009, Mike Miller wrote:

> Suppose X and Y are two data frames with the same structures, variable 
> names and dimensions but with different data and different patterns of 
> missing.  I want to replace missing values in Y with corresponding 
> values from X.  I'll construct a simple two-by-two case:
>
>> X <- as.data.frame(matrix(c("a","b",1,2),2,2), stringsAsFactors=FALSE)
>> X[,2] <- as.integer(X[,2])
>> str(X)
> 'data.frame':   2 obs. of  2 variables:
>  $ V1: chr  "a" "b"
>  $ V2: int  1 2
>
>> Y <- as.data.frame(matrix(c("c","d",NA,4),2,2), stringsAsFactors=FALSE)
>> Y[,2] <- as.integer(Y[,2])
>> str(Y)
> 'data.frame':   2 obs. of  2 variables:
>  $ V1: chr  "c" "d"
>  $ V2: int  NA 4
>
> This seems to be what I want to do...
>
>> Y[is.na(Y)] <- X[is.na(Y)]
>
> ...and it works except that the structure of Y is changed so that Y$V2 is now 
> of type chr instead of type int:
>
>> str(Y)
> 'data.frame':   2 obs. of  2 variables:
>  $ V1: chr  "c" "d"
>  $ V2: chr  "1" "4"


I figured out a good answer.  We can just decide the list of columns we 
want to work with and then use a for loop.  This avoids problems with 
changing variable types:

cols <- 38:47
keep <- is.na(Y)
for (i in cols) { nas <- which(keep[,i]); if ( length(nas) > 0 ) { Y[nas,i] <- X[nas,i] }}

Something like that makes for a good one-liner on the interactive command 
line, but this looks neater in a script:

cols <- 38:47
keep <- is.na(Y)
for (i in cols) {
     nas <- which(keep[,i])
     if ( length(nas) > 0 ) {
        Y[nas,i] <- X[nas,i]
      }
   }

It shouldn't be too hard to write a function that does that kind of thing.

The only problem I know of is that if X and Y don't have exactly the same 
levels for factors, if there are factors, there could be problems.  It 
would probably take a few more lines to deal with this

A couple of people wrote to me with helpful suggestions, but no one had a 
really great, established kind of solution.  I'm a little surprised.  But, 
with an average of 125 messages per day (!) on this list, I shouldn't be 
surprised that a long message like this one won't be read by everyone.

Best,
Mike




More information about the R-help mailing list