[BioC] missing values in limma/contrasts.fit

Wed Dec 16 18:31:40 CET 2009

Gordon

Thanks for the toughtful response.  I should have included an example;
I will put a small one together soon.  I am inmeshed in giving and grading
exams at the moment, so it won't happen immediately.

albyn

On Wed, Dec 16, 2009 at 10:09:59AM +1100, Gordon K Smyth wrote:
> Dear Albyn,
>
> It is always helpful to give a code example, preferably something which can 
> be run by others, but at least showing us the details of the situation.  
> There isn't any way to confirm from your email that the issue you mention 
> is the root cause of the problem.  Your colleague's data example would have 
> to quite unusual for this to be true.
>
> I have plenty of sympathy with researchers carefully examining the 
> microarray image and marking scratches or other imperfections, but it is 
> hard to imagine that they would flag tens of thousands of spots in this way 
> and, if they did, it might be better to drop the array entirely.
>
> lmFit() already computes exact standard errors.
>
> In many large data problems, in which many contrasts might be experimented 
> with, it is convenient to be able to compute a basic fit object and then to 
> apply various contrasts posthoc.  This was especially true in the early 
> days of microarray analysis with somewhat slower computers than today. This 
> is the reason for the separate contrasts.fit() function.
>
> I made an early design decision that fit objects could be order 
> ngenes*nparameters in size, but not order ngenes*nparameters^2, so it was 
> not possible to store genewise QR decompositions.  This leads to the 
> computational efficiency issue, which applies to contrasts.fit, not to 
> lmFit.
>
> If a dataset has so many spot-specific weights of different sizes, so that 
> the approximation involved in contrasts.fit() is not trustworthy, then it 
> is simple enough to run lmFit() a couple of times with different 
> parametrizations to get all the contrasts desired.
>
> I have considered in the past adding a 'contrasts' argument to lmFit, so 
> that users could fit the basic regression and simultaneously extract all 
> desired contrasts, using the exact covariance matrices. (This is not quite 
> as simple as you might think, because lmFit supports a number of different 
> covariance structures, robust fits etc, so the changes have to be made in a 
> number of low level functions.  And it changes the structure of the fit 
> object which is passed on.)  I can still do so if there is a demand for it.
>
> Best wishes
> Gordon
>
>> Date: Mon, 14 Dec 2009 11:15:05 -0800
>> From: Albyn Jones <jones at reed.edu>
>> Subject: [BioC] missing values in limma/contrasts.fit
>> To: bioconductor at stat.math.ethz.ch
>> Message-ID: <20091214191505.GD11908 at laplace.reed.edu>
>> Content-Type: text/plain; charset=us-ascii
>>
>> Dear BioConductor Folk
>>
>> The help file for contrasts.fit states:
>>
>>     "Warning. For efficiency reasons, this function does not
>>     re-factorize the design matrix for each probe. A consequence is
>>     that, if the design matrix is non-orthogonal and the original fit
>>     included quality weights or missing values, then the unscaled
>>     standard deviations produced by this function are approximate
>>     rather than exact. The approximation is usually acceptable...."
>>
>> My attention was attracted to the statement when a colleague in
>> biology asked me why one would get different sets of probes identified
>> as differentially expressed, depending on which individual or
>> biological sample was selected as the reference in a balanced loop
>> design.
>>
>> My experience, admittedly limited, suggests that the computational
>> efficiency gain is not worth the loss of accuracy.  Even if one has to
>> sacrifice the efficiency of a single pass through the raw data, at
>> least one gets correct results.  I have hacked a version of lmFit to
>> evaluate contrasts with standard errors based on the exact covariance
>> matrix.  It runs esssentially as quickly as lmFit, so I find the
>> efficiency argument uncompelling.
>>
>> A search of the archive produced several discussions of missing values
>> in limma.  The main argument I see is Gordon Smyth's (Date: 2008-03-08)
>>
>>   "The ideal solution is not to introduce missing values into your
>>    data in the first place.  In my experimence, missing values are
>>    almost always avoidable.  I have never seen a situation where it
>>    was necessary or desirable to introduce a large proportion of
>>    missing values."
>>
>> My colleagues in biology report that they inspect their arrays
>> visually and note probes which have been scratched, probes covered by
>> background blobs and the like.  These categories seem to satisfy the
>> missing-at-random criterion: the probe is marked NA not because it is
>> saturated or below background, but because it was unreadable for
>> reasons unrelated to the response.
>>
>> I'd appreciate feedback: has anyone else already done this? Would
>> others find this useful?  Are there objections I have overlooked?
>>
>> albyn
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:5}}