[R] missing values imputation

Wed May 12 19:23:05 CEST 2004

(Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> writes:

> On 12-May-04 Rolf Turner wrote:
>> Anne Piotet wrote:
>> 
>>> What R functionnalities are there to do missing values imputation
>>> (substantial proportion of missing data)?  I would prefer to use
>>> maximum likelihood methods ; is the EM algorithm implemented? in
>>> which package?
>> 
>>       The so-called ``EM algorithm'' is ***NOT*** an
>>       algorithm.  It is a methodology or a unifying concept.
>>       It would be impossible to ``implement'' it.  (Except
>>       possibly by means of some extremely advanced and
>>       sophisticated Artificial Intelligence software.)
>
> Do we understand the same thing by "EM Algorithm"?
>
> The one I'm thinking of -- formulated under that name by Dempster,
> Laird and Rubin in 1977 ("Maximum likelihood estimation from incomplete
> data via the EM  algorithm", JRSS(B) 39, 1-38) -- is indeed an algorithm
> in exactly the same sense as any iterative search for the maximum of a
> function.
>
> Essentially, in the context of data modelled by an underlying exponential
> family distribution where there is incomplete information about the
> values which have this distribution, it proceeds by
>
> Start: Choose starting estimates for the parameters of the distribution
> E: Using the current parameter values, compute the expected vaues
>    of the sufficient statistics conditional on the observed information
> M: Solve the maximum-likelihood equations (which are functions of the
>    sufficient statistics) using the expected values computed in (E)
> If sufficently converged, stop. Otherwise, make the current parameter
> values equal to the values estimated in (M) and return to (E).
>
> Algorithm, this, or not????
>
> And where does "extremely advanced and sophisticated Artificial
> Intelligence software" come into it? You can, in some cases, perform
> the above EM algorithm by hand.
>
> Which "EM Algorithm" are you thinking of?

Thanks, Ted :-) -- to extend it a bit, one can imagine the use of
approximate solutions to the 2 steps (simulation methods to get
expected values, similar range of approaches for the maximization) and
get a general (but possibly not robust)  computational solution for
the parametric problem.  Just plug in a formula for the likelihood and
the sufficient statistics...

Of course, thousands of papers have been written on these variations
(likelihood, specific implementations of the E and M steps).  

best,
-tony

-- 
rossini at u.washington.edu            http://www.analytics.washington.edu/ 
Biomedical and Health Informatics   University of Washington
Biostatistics, SCHARP/HVTN          Fred Hutchinson Cancer Research Center
UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable
FHCRC  (M/W): 206-667-7025 FAX=206-667-4812 | use Email

CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}}