[BioC] quality assessment and preprocessing for tiling array-based CGH data

Wed Oct 22 16:32:34 CEST 2008

Sean Davis wrote:
> On Wed, Oct 22, 2008 at 9:51 AM, Leon Yee <yee.leon at gmail.com> wrote:
>> Dear all,
>>
>>    Is there any well-established routine for quality assessment and
>> preprocessing of array CGH data, especially tiling array-based CGH data? I
>> found most of the quality assessment of array data are about expression
>> array, while few are related to array CGH data.
>>    We are using agilent 244k CGH array of rat, and now we have the text
>> files produced by Feature Extraction, don't know whether they are of good
>> quality. Could anyone help provide some clues? Thanks in advance!
>>
>>    After read.maimage(), we got the RGlist object, which contain several
>> components including R, G, Rb, Gb, and so on.  The probes are of 3 types:
>> -1, 1 and 0. 0 means normal probe; -1 mean negative control, i guess, and
>> the probe names are like (-)3xSLv1, NC1_00000002, etc[no corresponding probe
>> sequence]; 1 means positive control, i guess, and the probe names are like
>> DarkCorner, DCP_008001.0, RnCGHBrightCorner, SRN_800002, etc[no
>> corresponding probe sequence].  The number of -1 is 1275, while the number
>> of 1 is 4217, each of which has its R, Rb, G, Gb values. Do we need these
>> values for quality assessment and normalization? How?
>>    In addition, in the normal probes, we have 1000 probes repeating 3 times
>> in the array. How could we use these data for quality assessment and
>> normalization?
> 
> You generally will not want to do any normalization besides a possible
> shift of the center.  Any linear normalization that affects the slope
> of the M vs. A plot or nonlinear normalization will likely decrease
> signal.  As for quality control, a good, general measure to track is
> the dlrs, a robust measure of the standard deviation.
> 
> 
> dlrs <-
>   function(x) {
>     nx <- length(x)
>     if (nx<3) {
>       stop("Vector length>2 needed for computation")
>     }
>     tmp <- embed(x,2)
>     diffs <- tmp[,2]-tmp[,1]
>     dlrs <- IQR(diffs)/(sqrt(2)*1.34)
>     return(dlrs)
>   }
> 
> For agilent arrays, most of the dlrs should be around or under 0.2,
> generally.  However, this might vary a bit based on lab-to-lab
> variation.  In any case, if there is a significant outlier, that is
> suspect.  The input to the above function is the log ratios for a
> single array arranged in chromosome and position order.
> 
> Sean
> 

Hi, Sean

    Thanks for your advice. However, I have still several questions:

    1. The input of dlrs is the log ratios, the log ration extracted 
from the text file produced by Feature Extraction? or calculated from 
RGlist --> MAlist ?  I have searched the mailist and seen a post of you 
mentioned the difference of log ration from Feature Extraction and the 
default M value from read.maimages.

    2. I can get the log ratios of all features including control type 
of -1 and 1, but these features don't have chromosome positions, does 
this mean I don't need all of them for quality assessment?

    3. Some probes with the name of "chr2_random:xxxxx-yyyyyy" will not 
get a proper mapping on the chromosome, so I should remove these values 
from the input of dlrs. Is it so?

    4. How could I handle those 1000 probes repeating 3 times?  They 
will be mapped on the same chromosome position by three per group.

Regards,
Leon