[BioC] Drosophila tiling array background correction

Wed Jul 16 16:01:43 CEST 2008

Hi Ben,

I don't really have any answers for you, but I'm interested in hearing  
them myself as well :-)

I'm making a few comments below ... I'm sorry, they turned out to be  
rather long and I'm not making any claim that I know exactly what I'm  
doing, I'm just trying what makes most sense to me at the moment.

If anybody has any positive/negative feedback, I'd be happy to hear.

On Jul 15, 2008, at 8:53 PM, Benjamin Lansdell wrote:

> Hi,
>
> I have been trying to analyse drosophila development tiling arrays  
> using Bioconductor. In particular I would like to perform some form  
> of background correction for probe affinities (such as GCRMA) and  
> then quantile normalisation but have been unable to find any  
> packages that can do so without a .cdf file. I have a collection  
> of .cel files and a .bpmap file.

I've been wrestling with 1.0R Drosophila tiling array myself and  
haven't really been able to use the AffyBatch objects I get (from  
reading their .CEL files with ReadAffy) with any success. By that I  
mean I haven't been able to then pass the batch object into any other  
processing step for further analysis/QA assessment, even though I  
think I made the appropriate CDFs correctly (more on that below)

So, really the only thing I use the AffyBatch object is to just get  
the expression info from it:

 > exps <- exprs(myAffyBatch)

You can then quantile normalize the data by just passing the exps  
expression matrix into the affy::normalize.quantiles function since it  
only expects a matrix.

I've also done my own MvA plots for rudimentary QA.

> I know the package oligo can perform RMA using a package that you  
> build from a .bpmap file but this isn't quite what I want.

Can it? My impression was that RMA relies on the idea of "probe sets,"  
which I don't think really applies to tiling arrays as much. I see  
that it does SNPRMA ... I don't really know what the details of that  
are, though.

Certainly you can annotate your probes to construct your own probe  
sets, though, but you also have a majority of probes on the array that  
don't really belong to any probe set.

> Is it possible to create the necessary structures myself (AffyBatch,  
> something that contains the probe sequences, etc) for use with the  
> many packages that seem to rely on a .cdf file?

I've built the appropriate CDF-like structure (I think) by using the  
makePlatformDesign::makePDpackage, but still never really got it act  
like a CDF for my affybatch objects like Bioconductor expects. If you  
haven't done so, I've put mine up previously for someone else to  
download, and you can still grab them. They were built for both chips  
of the 1.0R version of the tiling array (forward and backward strand  
(aka MF and MR)) on a 64-bit Linux machine:

http://cbio.mskcc.org/~lianos/files/bioconductor/

I've been manually attacking this data, probably doing more work than  
necessary since I can't figure out how to leverage most of  
bioconductor tools just yet.

Ideally I'd like to take a similar approach to what was done here:
http://genomebiology.com/2008/9/7/R112

The methods appeared earlier in the proceedings of this year's PSB  
conference. They've also provided MATLAB source code implementing  
their analysis techniques:
http://www.fml.tuebingen.mpg.de/raetsch/projects/PSBTiling

In reality, as a first approach, I'm probably just going to implement  
(in R) their slightly modified version of M. Gerstein's Sequence  
Quantile Normalization:
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/8/988

and explore the signals I'm getting off different loci of the genome.  
I'm hoping I can pull this off within the time constraints of my  
current rotation project. If others find it useful, and I'm  
successful, I'd be happy to share it back w/ the R/BioC community.

In my adventures, I've created some SQLite database that you might  
find useful:

(i) A DB with all of the annotation data for the Dmel (v5) genome  
(protein coding genes, RNA's, exonic positions, etc) from the GTF file  
taken from ensembl:
ftp://ftp.ensembl.org/pub/current_gtf/

(ii) A DB with meta data about the tiling chips, this includes the  
following tables:
   (a) probe information as extracted from the bpmap file (sequence, x/ 
y chip coordinates, pm/mm, etc)
   (b) alignment information for every probe to the latest release of  
the genome. For this I used MUMmer and had it return all alignments >=  
23 NTs in length (taking Wolfgang Huber's choice from his tilingArray  
package)
   (c) a table used for quick lookup of how many times each probe  
aligns perfectly, and "almost perfectly" to the reference genome.

I'm now building RData probe_annotation objects for each chromosome  
that tie this information together. This data.frame essentially has:

(i) probe information (sequence, x/y, pm/mm)
(ii) number of perfect/close hits for the probe
(iii) where in the genome the probe (perfectly) lands: gene name, type  
of gene (protein coding, miRNA, snoRNA, rRNA, etc) and whether it's  
exonic/intronic.

If you think any of these would be helpful to you, let me know and  
I'll try to put them up when I think they're reasonably correct.

Like I said before, I'm probably not leveraging the BioC tools that  
are available as I've gotten lost in the myriad of options (which are  
good to have) that are there.

-steve

--
Steve Lianoglou
Graduate Student: Physiology, Biophysics and Systems Biology
Weill Cornell Medical College

http://cbio.mskcc.org/~lianos