[BioC] Drosophila tiling array background correction
Steve Lianoglou
mailinglist.honeypot at gmail.com
Wed Jul 16 16:01:43 CEST 2008
Hi Ben,
I don't really have any answers for you, but I'm interested in hearing
them myself as well :-)
I'm making a few comments below ... I'm sorry, they turned out to be
rather long and I'm not making any claim that I know exactly what I'm
doing, I'm just trying what makes most sense to me at the moment.
If anybody has any positive/negative feedback, I'd be happy to hear.
On Jul 15, 2008, at 8:53 PM, Benjamin Lansdell wrote:
> Hi,
>
> I have been trying to analyse drosophila development tiling arrays
> using Bioconductor. In particular I would like to perform some form
> of background correction for probe affinities (such as GCRMA) and
> then quantile normalisation but have been unable to find any
> packages that can do so without a .cdf file. I have a collection
> of .cel files and a .bpmap file.
I've been wrestling with 1.0R Drosophila tiling array myself and
haven't really been able to use the AffyBatch objects I get (from
reading their .CEL files with ReadAffy) with any success. By that I
mean I haven't been able to then pass the batch object into any other
processing step for further analysis/QA assessment, even though I
think I made the appropriate CDFs correctly (more on that below)
So, really the only thing I use the AffyBatch object is to just get
the expression info from it:
> exps <- exprs(myAffyBatch)
You can then quantile normalize the data by just passing the exps
expression matrix into the affy::normalize.quantiles function since it
only expects a matrix.
I've also done my own MvA plots for rudimentary QA.
> I know the package oligo can perform RMA using a package that you
> build from a .bpmap file but this isn't quite what I want.
Can it? My impression was that RMA relies on the idea of "probe sets,"
which I don't think really applies to tiling arrays as much. I see
that it does SNPRMA ... I don't really know what the details of that
are, though.
Certainly you can annotate your probes to construct your own probe
sets, though, but you also have a majority of probes on the array that
don't really belong to any probe set.
> Is it possible to create the necessary structures myself (AffyBatch,
> something that contains the probe sequences, etc) for use with the
> many packages that seem to rely on a .cdf file?
I've built the appropriate CDF-like structure (I think) by using the
makePlatformDesign::makePDpackage, but still never really got it act
like a CDF for my affybatch objects like Bioconductor expects. If you
haven't done so, I've put mine up previously for someone else to
download, and you can still grab them. They were built for both chips
of the 1.0R version of the tiling array (forward and backward strand
(aka MF and MR)) on a 64-bit Linux machine:
http://cbio.mskcc.org/~lianos/files/bioconductor/
I've been manually attacking this data, probably doing more work than
necessary since I can't figure out how to leverage most of
bioconductor tools just yet.
Ideally I'd like to take a similar approach to what was done here:
http://genomebiology.com/2008/9/7/R112
The methods appeared earlier in the proceedings of this year's PSB
conference. They've also provided MATLAB source code implementing
their analysis techniques:
http://www.fml.tuebingen.mpg.de/raetsch/projects/PSBTiling
In reality, as a first approach, I'm probably just going to implement
(in R) their slightly modified version of M. Gerstein's Sequence
Quantile Normalization:
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/8/988
and explore the signals I'm getting off different loci of the genome.
I'm hoping I can pull this off within the time constraints of my
current rotation project. If others find it useful, and I'm
successful, I'd be happy to share it back w/ the R/BioC community.
In my adventures, I've created some SQLite database that you might
find useful:
(i) A DB with all of the annotation data for the Dmel (v5) genome
(protein coding genes, RNA's, exonic positions, etc) from the GTF file
taken from ensembl:
ftp://ftp.ensembl.org/pub/current_gtf/
(ii) A DB with meta data about the tiling chips, this includes the
following tables:
(a) probe information as extracted from the bpmap file (sequence, x/
y chip coordinates, pm/mm, etc)
(b) alignment information for every probe to the latest release of
the genome. For this I used MUMmer and had it return all alignments >=
23 NTs in length (taking Wolfgang Huber's choice from his tilingArray
package)
(c) a table used for quick lookup of how many times each probe
aligns perfectly, and "almost perfectly" to the reference genome.
I'm now building RData probe_annotation objects for each chromosome
that tie this information together. This data.frame essentially has:
(i) probe information (sequence, x/y, pm/mm)
(ii) number of perfect/close hits for the probe
(iii) where in the genome the probe (perfectly) lands: gene name, type
of gene (protein coding, miRNA, snoRNA, rRNA, etc) and whether it's
exonic/intronic.
If you think any of these would be helpful to you, let me know and
I'll try to put them up when I think they're reasonably correct.
Like I said before, I'm probably not leveraging the BioC tools that
are available as I've gotten lost in the myriad of options (which are
good to have) that are there.
-steve
--
Steve Lianoglou
Graduate Student: Physiology, Biophysics and Systems Biology
Weill Cornell Medical College
http://cbio.mskcc.org/~lianos
More information about the Bioconductor
mailing list