[R] error in NORM lib
(Ted Harding)
Ted.Harding at nessie.mcc.ac.uk
Tue Nov 15 01:45:48 CET 2005
Folks,
Leo Gürther and I have been privately discussing the
problems with imputation using NORM which he originally
described on 9 November. Essentially, he observed that
many of the imputed missing values were totally absurd,
being well out of any range comatible with the observed
values of the variables.
After following a few false trails, we have discovered
the reason. People interested in using NORM (and CAT and
MIX and maybe PAN) may well be interested in this reason!
The dataset, which can be downloaded from his URL
http://www.anicca-vijja.de/lg/dframe.Rdata
consists of a matrix with 74 columns and 200 rows.
There are 553 missing values out of the 14800 (less
than 4%), and the distributions of the observed values
of the variables are well-behaved. So this should not
be a problematic dataset.
61 of the 74 columns have missing values (NAs) in them,
and this is the reason why NORM fails.
Specifically, the first few lines of the code of the
function prelim.norm() are as follows:
if (is.vector(x))
x <- matrix(x, length(x), 1)
n <- nrow(x)
p <- ncol(x)
storage.mode(x) <- "double"
r <- 1 * is.na(x)
nmis <- as.integer(apply(r, 2, sum))
names(nmis) <- dimnames(x)[[2]]
mdp <- as.integer((r %*% (2^((1:ncol(x)) - 1))) + 1)
and, as can be seen from the last line, if there are
missing values in a column with index > 31 then
(r %*% (2^((1:ncol(x)) - 1))) + 1 >= 2^31
and then applying as.integer() to this value returns NA
since as.integer only works for numbers no greater than
.Machine$integer.max, normally 2^31 - 1. (Is the situation
different for R on say 64-bit machines?)
The value of mdp[i] is a "packed" binary encoding of the
column positions of any NAs in row i: if bit j-1 (counting
from 0) in the binary representation of mdp[i] is 1, then
there is an NA in column j of row i.
The vector mdp is used at various places in the NORM routines,
and the effect on the imputations of having NAs in it, when
the functioning of the routines depends on unpacking the
encoding, is catastrophic. (Experiment had shown, indeed,
that imputing with a subset of fewer than 32 columns always
gave acceptable results).
The upshot of this is that NORM cannot be used for multiple
imputations if there are more than 31 columns in the data
which have NAs in them.
You could have more than 31 columns of data -- indeed Leo's
74 would have worked then -- if the columns are re-ordered
so that all the columns with NAs are at the left, provided
there are fewer than 32 with NAs. Unfortunately Leo has 61.
There is in principle no necessity to represent NA positions
in this way, but that is how Shafer did it and it was carried
over into R. An alternative method would simply be to have
a 0/1 matrix of NA indicators, but the code for the NORM
functions would have to be picked through to replace the
unpacking of mdp -- and this includes FORTRAN routines
(Oh dear, echoes of the "open source and R" discussion)!
So removing this limitation would not be trivial.
I have not noticed mention of the limitation in the documentation
of the NORM functions.
Exactly the same construction of mdp, and therefore exactly the
same problem, occurs in prelim.cat in CAT, for which I'm joint
maintainer with Fernando Tusell, so we had better try to look
into that! Any lessons we learn will be broadcast, so should be
useful for NORM as well.
And, for good measure, in MIX it occurs twice over in prelim.mix:
once in constructing mdpz for the continuous variables, and
once in mdpw for the categorical variables. This is perhaps
less likely in practice to cause the problem in MIX, since it
would arise only if either there were more than 31 columns of
continuous variables with NAs, or more than 31 of categorical
variables; so MIXers can spread their bets.
Again, I have not noticed that the limitation is mentioned
in the documentation of MIX; and I'm pretty sure it is not
in the documentation of CAT!
Any suggestions or guidance from people who are familiar with
NORM and MIX will be most welcome.
I should add that I have not looked into PAN, but would not
be surprised if it were there as well.
I've written this explanation in consultation with Leo Gürtler,
and he has proposed that I should publish it to the R List;
but please consider that it is a joint effort.
Best wishes to all,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 15-Nov-05 Time: 00:45:46
------------------------------ XFMail ------------------------------
More information about the R-help
mailing list