[BioC] Integrating Codelink data with bioconductor (using affyand limmafunctions)

Mon Apr 25 13:35:57 CEST 2005

Dear Gordon,

Thanks for your response. I will use the data as early but, What do you 
think it could affect more to normalization process: Some points 
assigned as NA values or some point with lowers A values as one of the 
intensitues was assigned a value of say 0.01?

I'd let you see my class definition and parser of course. This is really 
the first time a make use of classes and store all things as an R 
package so I thought that the best way to make something usable and 
quick without having to read completly "writting R extensions" was using 
others packages to learn (that is one of the greatness of opensource :). 
Of course I will have to read it one day.
Briefly:
1. The parser read exported txt files from codelink software. It works 
fine with 3 different chips so I think it should work fine with all 
types. A problem is that exported text data have custom fields (and you 
can chose within all fields including Raw_intensity, Median_foreground, 
etc) So it could be possible to found files with missing fields not 
exported. I know that it is possible to export as XML but a didn't try 
that yet.

2. The class definition is very simple. I based it in RGlist and used 
almost all redefinitions of dim() as.matrix() etc... that you use in 
limma. I also based a subsetting system in the one used in AffyBatch 
objects in affy. A Codelink object stores as a list 3 matrices. One of 
intensities, one of Flags and one last with probe name and probe type. I 
actually named it "val" "flags" and "info" slots but i don't thing they 
are appropiate so this week I want to import all possible fields and 
name it as they are called in the exported files. I probably too make 
comprobation about the fields present and warn or error if a *must have* 
field is missing.

When I have a more clear and clean code I will not have any problems in 
let you see it.

D

Gordon Smyth escribió:
> Dear Diego,
> 
> We are interested in introducing Codelink support into limma at some 
> stage, so we'd be interested to see your class definition and parser at 
> some time. The main reason we don't support Codelink yet is some 
> uncertainty about how best to handle the multiple fields (sub-arrays) 
> which are part of the Codelink platform.
> 
> I think that you would almost certainly be better off avoiding missing 
> values in the first place rather than trying to accommodate them during 
> the normalization process. See for example:
> 
>   https://stat.ethz.ch/pipermail/bioconductor/2005-April/008348.html
> 
> Note that while normalizeQuantiles() does allow for missing values, the 
> missing values are assumed to have occured at random. If the missing 
> values are actually associated with low intensities, this is not missing 
> at random and the normalization process will be biased. Even if you use 
> a method other than quantile normalization, the whole process of 
> between-array normalization is problematic with missing values. Better 
> to avoid them.
> 
> Gordon
> 
>> Date: Fri, 22 Apr 2005 14:45:53 +0200
>> From: Diego D?ez Ruiz <ddiez at iib.uam.es>
>> Subject: [BioC] Integrating Codelink data with bioconductor (using
>>         affy and        limmafunctions)
>> To: bioconductor at stat.math.ethz.ch
>>
>> Dear BioC users and developers:
>>
>> I'm working with Codelink plattform microarray data and i want to apply
>> some of the knowledge currently available for other platforms as
>> Affymetrix o cDNA spotted microarrays to deal with this data. I have
>> been using Affymetrix data  for a while and i used suscessfully the
>> great affy package. I also worked with spoted cDNA data and use the
>> also great package limma mainly. Currently this is what I have done:
>>
>> 1. Write a parser for exported text data from the codelink software.
>>
>> 2. I found convenient to create a new class for storing data from this
>> platform because there are no unique identifiers as in Affymetrix and so
>> an exprSet object was not appropiate. I think an RGlist/MAlist-like
>> definition could do the job fine so I almost cut and paste the classes
>> definition in the limma package an modified it a little (any concerns
>> about using other's code is wellcome. I'm willing to release this code
>> under LGPL licence) to make a currently named "Codelink" class. The
>> class currently stores signal values, flag values, type of spot and
>> codelink identifiers. Also I make some annotation packages for the
>> human10k, humanwholegenome and ratwholegenome bioarrays but i have to
>> test it deeper. A remember and older thread in the BioC list about
>> modifiying the exprSet definition to deal with this problem. Is there
>> any plan to change it so I can use a more standard approach better that
>> create a new class?
>>
>> 3. Created/adapted some functions for plotting: Density plots, MAplot,
>> Scatter plot that let see the position of diferent type of spots in
>> codelink chips: that is FIDUCIAL, DISCOVERY, POSITIVE, etc.
>>
>>
>> All that worked more or less. Data can be imported as RAW data (default)
>> or normalized (Codelink median normalization) and as log2 values
>> (default) or not. When I calculate log2 values, i initially decided to
>> set negative values (obtained from subtracting background to signal
>> sometimes) to somethign like 0.01. This works fine with
>> normalize.quantiles and normalize.loess from affy package but create
>> artifact point in a MAplot and a low intensity peak in a density plot.
>> Then, I saw a post from Gordon Smyth (sorry dont have link) telling that
>> limma make negative values NA prior to log2. I wanted to do that and
>> modified the functions involved. Finaly this is when I found a mayor
>> problem:
>>
>> 1) normalize.loess: I got to modify this function to allow NA values.
>> The function compares each array whith the others so if you have 10
>> arrays then each array is compared and modified (after loess
>> estimation) 10 times. If in each iteration I select the non NA values
>> created when you calculate A or M then in each iteration you have a
>> different subset of genes used in the normalization process. As a little
>> example, if you have 4 arrays with 5 genes:
>>
>>         chip1   chip2   chip3   chip4
>> gen1    1       10      NA      5
>> gen2    3       NA      40      6
>> gen3    4       NA      NA      7
>> gen4    3       12      45      6
>> gen5    2       1       4       3
>>
>> normalize.loess take a matrix as argument and do all pairwise 
>> comparisons:
>>
>> chip1-chip2: used gen1,gen4,gen5 to estimate loess curve.
>> chip1-chip3: used gen2,gen4,gen5 to estimate loess curve.
>> chip1-chip4: used all genes to estimate loess curve.
>> chip2-chip3: used gen4,gen5 to estimate loess curve.
>> chip2-chip4: used gen1,gen4,gen5 to estimate loess curve.
>> chip3-chip4: used gen2,gen4,gen5 to estimate loess curve.
>>
>> The code modified/added to allow for NA is between #:
>> (from normalize.loess in affy 1.5.8)
>>
>> ------------------------------------------------------------------------
>>
>> normalize.loess <- function (mat, subset = sample(1:(dim(mat)[1]),
>> min(c(5000, nrow(mat)))),
>>      epsilon = 10^-2, maxit = 1, log.it = TRUE, verbose = TRUE,
>>      span = 2/3, family.loess = "symmetric")
>> {
>>      J <- dim(mat)[2]
>>      II <- dim(mat)[1]
>>      newData <- mat
>>      if (log.it) {
>>          mat <- log2(mat)
>>          newData <- log2(newData)
>>      }
>>      change <- epsilon + 1
>>      fs <- matrix(0, II, J)
>>      iter <- 0
>>      w <- c(0, rep(1, length(subset)), 0)
>>      while (iter < maxit) {
>>          iter <- iter + 1
>>          means <- matrix(0, II, J)
>>          for (j in 1:(J - 1)) {
>>              for (k in (j + 1):J) {
>>                  y <- newData[, j] - newData[, k]
>>                  x <- (newData[, j] + newData[, k])/2
>>                 #####################################
>>                  # Select genes that are not set to NA
>>                  sel <- which(!is.na(as.character(y)))
>>                  y <- y[sel]
>>                  x <- x[sel]
>>                  #####################################
>>                  index <- c(order(x)[1], subset, order(-x)[1])
>>                  xx <- x[index]
>>                  yy <- y[index]
>>                  aux <- loess(yy ~ xx, span = span, degree = 1,
>>                    weights = w, family = family.loess)
>>                  aux <- predict(aux, data.frame(xx = x))/J
>>                 #####################################
>>                  # Apply correction to genes not NA.
>>                  means[sel, j] <- means[sel, j] + aux
>>                  means[sel, k] <- means[sel, k] - aux
>>                 #####################################
>>                  if (verbose)
>>                    cat("Done with", j, "vs", k, " in iteration ",
>>                      iter, "\n")
>>              }
>>          }
>>          fs <- fs + means
>>          newData <- mat - fs
>>          change <- max(colMeans((means[subset, ])^2))
>>          if (verbose)
>>              cat(iter, change, "\n")
>>          oldfs <- fs
>>      }
>>      if ((change > epsilon) & (maxit > 1))
>>          warning(paste("No convergence after", maxit, "iterations.\n"))
>>      if (log.it) {
>>          return(2^newData)
>>      }
>>      else return(newData)
>> }
>>
>> ------------------------------------------------------------------------
>>
>> I'm not an statitician so not sure if this could be a correct approach:
>> Is this correct or I cannot do that?
>>
>> 2) With normalize.quantiles all seems to work fine... the function don't
>> say nothing about NA values and the density plots seems correct (almost
>> same density plot for all chips) but the MAplot have changed greatly
>> with a great number of genes with greater values of M (in absolute
>> value) that is, a wider cloud of points. I don't know why is this
>> happening. I also tried to see at the C code for normalize.quantiles (at
>> qnorm.c) but not found an easy answer. Hopefully I finally found that I
>> can use normalizeQuantiles from limma package that allows for NA values.
>>
>> By the way, is this really necesary or can I assign a low value to
>> negative ones?. I think it is better to use NA but...
>> Something I'm going to do sonner is to import all the values obtained
>> from the Codelink data (mean foreground and background, median
>> foreground and background, etc) so it could be possible to calculate a
>> signal value within R and then use a method that avoids negative values.
>> But that could be the same in some cases to assign a small value (Or I'm
>> wrong)
>>
>> I'm working with a linux box (Debian Sarge) with R 2.0.1 compiled from
>> sources and BioC 1.5.
>>
>> Any comment on all that would be very appreciated.
>>
>> Thanks in advanced!
>>
>>
>> D.
> 
>