[BioC] Integrating Codelink data with bioconductor (using
affyand limmafunctions)
Diego Díez Ruiz
ddiez at iib.uam.es
Mon Apr 25 13:35:57 CEST 2005
Dear Gordon,
Thanks for your response. I will use the data as early but, What do you
think it could affect more to normalization process: Some points
assigned as NA values or some point with lowers A values as one of the
intensitues was assigned a value of say 0.01?
I'd let you see my class definition and parser of course. This is really
the first time a make use of classes and store all things as an R
package so I thought that the best way to make something usable and
quick without having to read completly "writting R extensions" was using
others packages to learn (that is one of the greatness of opensource :).
Of course I will have to read it one day.
Briefly:
1. The parser read exported txt files from codelink software. It works
fine with 3 different chips so I think it should work fine with all
types. A problem is that exported text data have custom fields (and you
can chose within all fields including Raw_intensity, Median_foreground,
etc) So it could be possible to found files with missing fields not
exported. I know that it is possible to export as XML but a didn't try
that yet.
2. The class definition is very simple. I based it in RGlist and used
almost all redefinitions of dim() as.matrix() etc... that you use in
limma. I also based a subsetting system in the one used in AffyBatch
objects in affy. A Codelink object stores as a list 3 matrices. One of
intensities, one of Flags and one last with probe name and probe type. I
actually named it "val" "flags" and "info" slots but i don't thing they
are appropiate so this week I want to import all possible fields and
name it as they are called in the exported files. I probably too make
comprobation about the fields present and warn or error if a *must have*
field is missing.
When I have a more clear and clean code I will not have any problems in
let you see it.
D
Gordon Smyth escribió:
> Dear Diego,
>
> We are interested in introducing Codelink support into limma at some
> stage, so we'd be interested to see your class definition and parser at
> some time. The main reason we don't support Codelink yet is some
> uncertainty about how best to handle the multiple fields (sub-arrays)
> which are part of the Codelink platform.
>
> I think that you would almost certainly be better off avoiding missing
> values in the first place rather than trying to accommodate them during
> the normalization process. See for example:
>
> https://stat.ethz.ch/pipermail/bioconductor/2005-April/008348.html
>
> Note that while normalizeQuantiles() does allow for missing values, the
> missing values are assumed to have occured at random. If the missing
> values are actually associated with low intensities, this is not missing
> at random and the normalization process will be biased. Even if you use
> a method other than quantile normalization, the whole process of
> between-array normalization is problematic with missing values. Better
> to avoid them.
>
> Gordon
>
>> Date: Fri, 22 Apr 2005 14:45:53 +0200
>> From: Diego D?ez Ruiz <ddiez at iib.uam.es>
>> Subject: [BioC] Integrating Codelink data with bioconductor (using
>> affy and limmafunctions)
>> To: bioconductor at stat.math.ethz.ch
>>
>> Dear BioC users and developers:
>>
>> I'm working with Codelink plattform microarray data and i want to apply
>> some of the knowledge currently available for other platforms as
>> Affymetrix o cDNA spotted microarrays to deal with this data. I have
>> been using Affymetrix data for a while and i used suscessfully the
>> great affy package. I also worked with spoted cDNA data and use the
>> also great package limma mainly. Currently this is what I have done:
>>
>> 1. Write a parser for exported text data from the codelink software.
>>
>> 2. I found convenient to create a new class for storing data from this
>> platform because there are no unique identifiers as in Affymetrix and so
>> an exprSet object was not appropiate. I think an RGlist/MAlist-like
>> definition could do the job fine so I almost cut and paste the classes
>> definition in the limma package an modified it a little (any concerns
>> about using other's code is wellcome. I'm willing to release this code
>> under LGPL licence) to make a currently named "Codelink" class. The
>> class currently stores signal values, flag values, type of spot and
>> codelink identifiers. Also I make some annotation packages for the
>> human10k, humanwholegenome and ratwholegenome bioarrays but i have to
>> test it deeper. A remember and older thread in the BioC list about
>> modifiying the exprSet definition to deal with this problem. Is there
>> any plan to change it so I can use a more standard approach better that
>> create a new class?
>>
>> 3. Created/adapted some functions for plotting: Density plots, MAplot,
>> Scatter plot that let see the position of diferent type of spots in
>> codelink chips: that is FIDUCIAL, DISCOVERY, POSITIVE, etc.
>>
>>
>> All that worked more or less. Data can be imported as RAW data (default)
>> or normalized (Codelink median normalization) and as log2 values
>> (default) or not. When I calculate log2 values, i initially decided to
>> set negative values (obtained from subtracting background to signal
>> sometimes) to somethign like 0.01. This works fine with
>> normalize.quantiles and normalize.loess from affy package but create
>> artifact point in a MAplot and a low intensity peak in a density plot.
>> Then, I saw a post from Gordon Smyth (sorry dont have link) telling that
>> limma make negative values NA prior to log2. I wanted to do that and
>> modified the functions involved. Finaly this is when I found a mayor
>> problem:
>>
>> 1) normalize.loess: I got to modify this function to allow NA values.
>> The function compares each array whith the others so if you have 10
>> arrays then each array is compared and modified (after loess
>> estimation) 10 times. If in each iteration I select the non NA values
>> created when you calculate A or M then in each iteration you have a
>> different subset of genes used in the normalization process. As a little
>> example, if you have 4 arrays with 5 genes:
>>
>> chip1 chip2 chip3 chip4
>> gen1 1 10 NA 5
>> gen2 3 NA 40 6
>> gen3 4 NA NA 7
>> gen4 3 12 45 6
>> gen5 2 1 4 3
>>
>> normalize.loess take a matrix as argument and do all pairwise
>> comparisons:
>>
>> chip1-chip2: used gen1,gen4,gen5 to estimate loess curve.
>> chip1-chip3: used gen2,gen4,gen5 to estimate loess curve.
>> chip1-chip4: used all genes to estimate loess curve.
>> chip2-chip3: used gen4,gen5 to estimate loess curve.
>> chip2-chip4: used gen1,gen4,gen5 to estimate loess curve.
>> chip3-chip4: used gen2,gen4,gen5 to estimate loess curve.
>>
>> The code modified/added to allow for NA is between #:
>> (from normalize.loess in affy 1.5.8)
>>
>> ------------------------------------------------------------------------
>>
>> normalize.loess <- function (mat, subset = sample(1:(dim(mat)[1]),
>> min(c(5000, nrow(mat)))),
>> epsilon = 10^-2, maxit = 1, log.it = TRUE, verbose = TRUE,
>> span = 2/3, family.loess = "symmetric")
>> {
>> J <- dim(mat)[2]
>> II <- dim(mat)[1]
>> newData <- mat
>> if (log.it) {
>> mat <- log2(mat)
>> newData <- log2(newData)
>> }
>> change <- epsilon + 1
>> fs <- matrix(0, II, J)
>> iter <- 0
>> w <- c(0, rep(1, length(subset)), 0)
>> while (iter < maxit) {
>> iter <- iter + 1
>> means <- matrix(0, II, J)
>> for (j in 1:(J - 1)) {
>> for (k in (j + 1):J) {
>> y <- newData[, j] - newData[, k]
>> x <- (newData[, j] + newData[, k])/2
>> #####################################
>> # Select genes that are not set to NA
>> sel <- which(!is.na(as.character(y)))
>> y <- y[sel]
>> x <- x[sel]
>> #####################################
>> index <- c(order(x)[1], subset, order(-x)[1])
>> xx <- x[index]
>> yy <- y[index]
>> aux <- loess(yy ~ xx, span = span, degree = 1,
>> weights = w, family = family.loess)
>> aux <- predict(aux, data.frame(xx = x))/J
>> #####################################
>> # Apply correction to genes not NA.
>> means[sel, j] <- means[sel, j] + aux
>> means[sel, k] <- means[sel, k] - aux
>> #####################################
>> if (verbose)
>> cat("Done with", j, "vs", k, " in iteration ",
>> iter, "\n")
>> }
>> }
>> fs <- fs + means
>> newData <- mat - fs
>> change <- max(colMeans((means[subset, ])^2))
>> if (verbose)
>> cat(iter, change, "\n")
>> oldfs <- fs
>> }
>> if ((change > epsilon) & (maxit > 1))
>> warning(paste("No convergence after", maxit, "iterations.\n"))
>> if (log.it) {
>> return(2^newData)
>> }
>> else return(newData)
>> }
>>
>> ------------------------------------------------------------------------
>>
>> I'm not an statitician so not sure if this could be a correct approach:
>> Is this correct or I cannot do that?
>>
>> 2) With normalize.quantiles all seems to work fine... the function don't
>> say nothing about NA values and the density plots seems correct (almost
>> same density plot for all chips) but the MAplot have changed greatly
>> with a great number of genes with greater values of M (in absolute
>> value) that is, a wider cloud of points. I don't know why is this
>> happening. I also tried to see at the C code for normalize.quantiles (at
>> qnorm.c) but not found an easy answer. Hopefully I finally found that I
>> can use normalizeQuantiles from limma package that allows for NA values.
>>
>> By the way, is this really necesary or can I assign a low value to
>> negative ones?. I think it is better to use NA but...
>> Something I'm going to do sonner is to import all the values obtained
>> from the Codelink data (mean foreground and background, median
>> foreground and background, etc) so it could be possible to calculate a
>> signal value within R and then use a method that avoids negative values.
>> But that could be the same in some cases to assign a small value (Or I'm
>> wrong)
>>
>> I'm working with a linux box (Debian Sarge) with R 2.0.1 compiled from
>> sources and BioC 1.5.
>>
>> Any comment on all that would be very appreciated.
>>
>> Thanks in advanced!
>>
>>
>> D.
>
>
More information about the Bioconductor
mailing list