[BioC] duplicateCorrelation function and custom array design

Bela Tiwari btiwari at ceh.ac.uk
Fri Oct 29 16:45:17 CEST 2004


I have been working with data given to me by biologists who designed
and printed their own arrays. In short, they have 48 blocks with 1089
spots per block. Duplicate spots of a "gene" are printed on the same
block. Ie. a gene appears in an array on two spots, both within a single

I wanted to run the duplicateCorrelation function on their data, and in
so doing, I discovered rather a lot about their array layout, which in
hindsight, I should have asked about first. Effectively, while genes and
species-specific control spots have been spotted twice per block, with a
nice even spacing of 520 between them, they also have blank "buffer"
wells which cause problems because these do not appear to be at a 520
spacing, and their inclusion in an object passed into the
duplicateCorrelation function causes a failure when it gets to running
the unwrapdups function. The latter function sensibly expects that the
number of spots on an array, divided by the "spacing" and the number of
duplicates to give a whole number - which, because of the way they have
laid out their blank buffer wells, and labelled them in the GAL file,
data with the array layout they defined, does not. I have to be grateful
in some ways that this function failed and I stopped to find out a lot
more about the array layout!

So, here are my questions:

Firstly, after normalising my RGList object and getting an MAList
object, I can get a list of indices for just the genes and
species-specific controls. I have read through the duplicateCorrelation
function and believe that if I inut the indices into the function
sensibly, I can get a value for the correlation. However, just because I
believe it, doesn't mean its true! Below are the lines I have changed in
the duplicateCorrelation function so that only the gene and control
spots are used to generate the correlation values. If anyone out there
knows this function well and has a moment, can they check that what I
have done at least verges on sensible?  And if not, any advice on how to
deal with this situation would be most welcome!

Effectively, only 3 lines have changed - the parameter list now has
indices (an integer vector), and M and weights now take in only those
entries from the object with those indices.

mydupcorr <- function (object, indices, design = rep(1, ncol(M)), ndups
= 2, spacing = 1,
    block = NULL, trim = 0.15, weights = NULL)
    if (is(object, "MAList")) {
        M <- object$M[indices,] #altered to add indices
        if (missing(design) && !is.null(object$design))
            design <- object$design
        if (missing(ndups) && !is.null(object$printer$ndups))
            ndups <- object$printer$ndups
        if (missing(spacing) && !is.null(object$printer$spacing))
            spacing <- object$printer$spacing
        if (missing(weights) && !is.null(object$weights))
            weights <- object$weights[indices,] #altered to add

In my case, a sample command would be:

dupcorr <- mydupcorr(myMAList, indices, design = design1, ndups = 2,
spacing = 520)

Hmmm, as I write this, I just realised that I could have just done this
all on the command line like:

dupcorr <- mydupcorr(myMAList[indices,], design = design1, ndups = 2,
spacing = 520)

but my essential question remains the same - is this sensible?

My second question is due to my lack of experience with the functions
involved - if I try to use the correlation consensus value generated via
the above function as input into the lmFit function, will it matter if I

include only the MAList elements for my genes and species-specific
controls?  I.e. does it matter if I give   myMAList[indices,]    as the
object parameter to the lmFit function, rather than the whole MAList
object. I don't think lmFit needs to refer back to the array layout as
stored within  myMAList$printer, but I'm not well versed enough to know
if there are downsteam effects of entering only a subset of the MAList
object to lmFit or not.

And finally, if you have made it this far in the email, if anyone has
suggestions for web pages, articles, other documents, etc, that gives
advice on how to design a good array layout, I'd love to hear about
them. The biologists I'm working with will be designing some new arrays
soon, and tips for how they should lay things out, especially with
considerations to the "usual" requirements software programs/functions
may have, would be great!

thank you,

Bela Tiwari

Dr. Bela Tiwari
Lead Bioinformatician

CEH Oxford
Mansfield Road
Oxford, OX1 3SR
01865 281975

More information about the Bioconductor mailing list