[BioC] missing values in limma/contrasts.fit
Albyn Jones
jones at reed.edu
Wed Dec 16 18:31:40 CET 2009
Gordon
Thanks for the toughtful response. I should have included an example;
I will put a small one together soon. I am inmeshed in giving and grading
exams at the moment, so it won't happen immediately.
albyn
On Wed, Dec 16, 2009 at 10:09:59AM +1100, Gordon K Smyth wrote:
> Dear Albyn,
>
> It is always helpful to give a code example, preferably something which can
> be run by others, but at least showing us the details of the situation.
> There isn't any way to confirm from your email that the issue you mention
> is the root cause of the problem. Your colleague's data example would have
> to quite unusual for this to be true.
>
> I have plenty of sympathy with researchers carefully examining the
> microarray image and marking scratches or other imperfections, but it is
> hard to imagine that they would flag tens of thousands of spots in this way
> and, if they did, it might be better to drop the array entirely.
>
> lmFit() already computes exact standard errors.
>
> In many large data problems, in which many contrasts might be experimented
> with, it is convenient to be able to compute a basic fit object and then to
> apply various contrasts posthoc. This was especially true in the early
> days of microarray analysis with somewhat slower computers than today. This
> is the reason for the separate contrasts.fit() function.
>
> I made an early design decision that fit objects could be order
> ngenes*nparameters in size, but not order ngenes*nparameters^2, so it was
> not possible to store genewise QR decompositions. This leads to the
> computational efficiency issue, which applies to contrasts.fit, not to
> lmFit.
>
> If a dataset has so many spot-specific weights of different sizes, so that
> the approximation involved in contrasts.fit() is not trustworthy, then it
> is simple enough to run lmFit() a couple of times with different
> parametrizations to get all the contrasts desired.
>
> I have considered in the past adding a 'contrasts' argument to lmFit, so
> that users could fit the basic regression and simultaneously extract all
> desired contrasts, using the exact covariance matrices. (This is not quite
> as simple as you might think, because lmFit supports a number of different
> covariance structures, robust fits etc, so the changes have to be made in a
> number of low level functions. And it changes the structure of the fit
> object which is passed on.) I can still do so if there is a demand for it.
>
> Best wishes
> Gordon
>
>> Date: Mon, 14 Dec 2009 11:15:05 -0800
>> From: Albyn Jones <jones at reed.edu>
>> Subject: [BioC] missing values in limma/contrasts.fit
>> To: bioconductor at stat.math.ethz.ch
>> Message-ID: <20091214191505.GD11908 at laplace.reed.edu>
>> Content-Type: text/plain; charset=us-ascii
>>
>> Dear BioConductor Folk
>>
>> The help file for contrasts.fit states:
>>
>> "Warning. For efficiency reasons, this function does not
>> re-factorize the design matrix for each probe. A consequence is
>> that, if the design matrix is non-orthogonal and the original fit
>> included quality weights or missing values, then the unscaled
>> standard deviations produced by this function are approximate
>> rather than exact. The approximation is usually acceptable...."
>>
>> My attention was attracted to the statement when a colleague in
>> biology asked me why one would get different sets of probes identified
>> as differentially expressed, depending on which individual or
>> biological sample was selected as the reference in a balanced loop
>> design.
>>
>> My experience, admittedly limited, suggests that the computational
>> efficiency gain is not worth the loss of accuracy. Even if one has to
>> sacrifice the efficiency of a single pass through the raw data, at
>> least one gets correct results. I have hacked a version of lmFit to
>> evaluate contrasts with standard errors based on the exact covariance
>> matrix. It runs esssentially as quickly as lmFit, so I find the
>> efficiency argument uncompelling.
>>
>> A search of the archive produced several discussions of missing values
>> in limma. The main argument I see is Gordon Smyth's (Date: 2008-03-08)
>>
>> "The ideal solution is not to introduce missing values into your
>> data in the first place. In my experimence, missing values are
>> almost always avoidable. I have never seen a situation where it
>> was necessary or desirable to introduce a large proportion of
>> missing values."
>>
>> My colleagues in biology report that they inspect their arrays
>> visually and note probes which have been scratched, probes covered by
>> background blobs and the like. These categories seem to satisfy the
>> missing-at-random criterion: the probe is marked NA not because it is
>> saturated or below background, but because it was unreadable for
>> reasons unrelated to the response.
>>
>> I'd appreciate feedback: has anyone else already done this? Would
>> others find this useful? Are there objections I have overlooked?
>>
>> albyn
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:5}}
More information about the Bioconductor
mailing list