[BioC] quality assessment and preprocessing for tiling array-based CGH data

Wed Oct 22 16:02:35 CEST 2008

On Wed, Oct 22, 2008 at 9:51 AM, Leon Yee <yee.leon at gmail.com> wrote:
> Dear all,
>
>    Is there any well-established routine for quality assessment and
> preprocessing of array CGH data, especially tiling array-based CGH data? I
> found most of the quality assessment of array data are about expression
> array, while few are related to array CGH data.
>    We are using agilent 244k CGH array of rat, and now we have the text
> files produced by Feature Extraction, don't know whether they are of good
> quality. Could anyone help provide some clues? Thanks in advance!
>
>    After read.maimage(), we got the RGlist object, which contain several
> components including R, G, Rb, Gb, and so on.  The probes are of 3 types:
> -1, 1 and 0. 0 means normal probe; -1 mean negative control, i guess, and
> the probe names are like (-)3xSLv1, NC1_00000002, etc[no corresponding probe
> sequence]; 1 means positive control, i guess, and the probe names are like
> DarkCorner, DCP_008001.0, RnCGHBrightCorner, SRN_800002, etc[no
> corresponding probe sequence].  The number of -1 is 1275, while the number
> of 1 is 4217, each of which has its R, Rb, G, Gb values. Do we need these
> values for quality assessment and normalization? How?
>    In addition, in the normal probes, we have 1000 probes repeating 3 times
> in the array. How could we use these data for quality assessment and
> normalization?

You generally will not want to do any normalization besides a possible
shift of the center.  Any linear normalization that affects the slope
of the M vs. A plot or nonlinear normalization will likely decrease
signal.  As for quality control, a good, general measure to track is
the dlrs, a robust measure of the standard deviation.

dlrs <-
  function(x) {
    nx <- length(x)
    if (nx<3) {
      stop("Vector length>2 needed for computation")
    }
    tmp <- embed(x,2)
    diffs <- tmp[,2]-tmp[,1]
    dlrs <- IQR(diffs)/(sqrt(2)*1.34)
    return(dlrs)
  }

For agilent arrays, most of the dlrs should be around or under 0.2,
generally.  However, this might vary a bit based on lab-to-lab
variation.  In any case, if there is a significant outlier, that is
suspect.  The input to the above function is the log ratios for a
single array arranged in chromosome and position order.

Sean