[BioC] Drosophila tiling array background correction
Tobias Straub
tstraub at med.uni-muenchen.de
Wed Jul 16 18:22:34 CEST 2008
Hi,
you can extract the signals manually (watch out that takes some time..).
#####################################################
library(affy)
library(affxparser)
#here the location of your bpmap
bpmap_file <- "Dm35b_MR_v02-3_BDGPv4h.bpmap"
bpmap <- readBpmap(bpmap_file)
summary(bpmap)
# you get a list of chromosome mappings only a few of them are
Drosophila ones
# these you specify in here:
allchrs <- c("Dm:BDGPv4h;chr2L",
"Dm:BDGPv4h;chr2R","Dm:BDGPv4h;chr2h","Dm:BDGPv4h;chr3L",
"Dm:BDGPv4h;chr3R","Dm:BDGPv4h;chr3h","Dm:BDGPv4h;chr4",
"Dm:BDGPv4h;chr4h","Dm:BDGPv4h;chrU","Dm:BDGPv4h;chrX",
"Dm:BDGPv4h;chrXh","Dm:BDGPv4h;chrYh")
cel_files <- c("file1.cel","file2.cel",
"file3.cel", "file4.cel",
"file5.cel", "file6.cel")
cel_header <- readCelHeader(cel_files[1])
cel_indices <- lapply(bpmap, function(x) xy2indices(x[["pmx"]],
x[["pmy"]], nr=cel_header[["cols"]]))
chr<-allchrs[1]
data <- readCelIntensities(cel_files, cel_indices[[chr]])
signals <- data
colnames(signals) <- cel_files
layout <- data.frame(chr=
factor(bpmap[[chr]][["seqInfo"]][["name"]]), seq=
bpmap[[chr]][["probeseq"]], pos=bpmap[[chr]][["startpos"]],
stringsAsFactors = FALSE)
layout$probe.id <- paste(layout[,1], layout[,3], sep="_")
lm <- layout
for (chr in allchrs[2:length(allchrs)]) {
data <- readCelIntensities(cel_files, cel_indices[[chr]])
signals <-rbind(signals, data)
layout <- data.frame(chr=
factor(bpmap[[chr]][["seqInfo"]][["name"]]), seq=
bpmap[[chr]][["probeseq"]], pos=bpmap[[chr]][["startpos"]],
stringsAsFactors = FALSE)
layout$probe.id <- paste(layout[,1], layout[,3], sep="_")
lm <- rbind(lm, layout)
}
rownames(signals)<-lm$probe.id
layout<-data.frame("probe.id"=lm$probe.id, "chr"=sub("chr","",lm$chr),
"pos"=lm$pos, "sequence"=lm$seq)
layout$length=c(25)
####################################
you end up with a 'signal' matrix and a 'layout' dataframe
the signals you can easily normalize without having them in an
AffyBatch object.
sth. like
norm <- exprs(vsn2(signals, subsample=200000))
or whatever else
best
Tobias
On Jul 16, 2008, at 4:01 PM, Steve Lianoglou wrote:
> Hi Ben,
>
> I don't really have any answers for you, but I'm interested in
> hearing them myself as well :-)
>
> I'm making a few comments below ... I'm sorry, they turned out to be
> rather long and I'm not making any claim that I know exactly what
> I'm doing, I'm just trying what makes most sense to me at the moment.
>
> If anybody has any positive/negative feedback, I'd be happy to hear.
>
> On Jul 15, 2008, at 8:53 PM, Benjamin Lansdell wrote:
>
>> Hi,
>>
>> I have been trying to analyse drosophila development tiling arrays
>> using Bioconductor. In particular I would like to perform some form
>> of background correction for probe affinities (such as GCRMA) and
>> then quantile normalisation but have been unable to find any
>> packages that can do so without a .cdf file. I have a collection
>> of .cel files and a .bpmap file.
>
> I've been wrestling with 1.0R Drosophila tiling array myself and
> haven't really been able to use the AffyBatch objects I get (from
> reading their .CEL files with ReadAffy) with any success. By that I
> mean I haven't been able to then pass the batch object into any
> other processing step for further analysis/QA assessment, even
> though I think I made the appropriate CDFs correctly (more on that
> below)
>
> So, really the only thing I use the AffyBatch object is to just get
> the expression info from it:
>
> > exps <- exprs(myAffyBatch)
>
> You can then quantile normalize the data by just passing the exps
> expression matrix into the affy::normalize.quantiles function since
> it only expects a matrix.
>
> I've also done my own MvA plots for rudimentary QA.
>
>> I know the package oligo can perform RMA using a package that you
>> build from a .bpmap file but this isn't quite what I want.
>
> Can it? My impression was that RMA relies on the idea of "probe
> sets," which I don't think really applies to tiling arrays as much.
> I see that it does SNPRMA ... I don't really know what the details
> of that are, though.
>
> Certainly you can annotate your probes to construct your own probe
> sets, though, but you also have a majority of probes on the array
> that don't really belong to any probe set.
>
>> Is it possible to create the necessary structures myself
>> (AffyBatch, something that contains the probe sequences, etc) for
>> use with the many packages that seem to rely on a .cdf file?
>
>
> I've built the appropriate CDF-like structure (I think) by using the
> makePlatformDesign::makePDpackage, but still never really got it act
> like a CDF for my affybatch objects like Bioconductor expects. If
> you haven't done so, I've put mine up previously for someone else to
> download, and you can still grab them. They were built for both
> chips of the 1.0R version of the tiling array (forward and backward
> strand (aka MF and MR)) on a 64-bit Linux machine:
>
> http://cbio.mskcc.org/~lianos/files/bioconductor/
>
> I've been manually attacking this data, probably doing more work
> than necessary since I can't figure out how to leverage most of
> bioconductor tools just yet.
>
> Ideally I'd like to take a similar approach to what was done here:
> http://genomebiology.com/2008/9/7/R112
>
> The methods appeared earlier in the proceedings of this year's PSB
> conference. They've also provided MATLAB source code implementing
> their analysis techniques:
> http://www.fml.tuebingen.mpg.de/raetsch/projects/PSBTiling
>
> In reality, as a first approach, I'm probably just going to
> implement (in R) their slightly modified version of M. Gerstein's
> Sequence Quantile Normalization:
> http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/8/988
>
> and explore the signals I'm getting off different loci of the
> genome. I'm hoping I can pull this off within the time constraints
> of my current rotation project. If others find it useful, and I'm
> successful, I'd be happy to share it back w/ the R/BioC community.
>
> In my adventures, I've created some SQLite database that you might
> find useful:
>
> (i) A DB with all of the annotation data for the Dmel (v5) genome
> (protein coding genes, RNA's, exonic positions, etc) from the GTF
> file taken from ensembl:
> ftp://ftp.ensembl.org/pub/current_gtf/
>
> (ii) A DB with meta data about the tiling chips, this includes the
> following tables:
> (a) probe information as extracted from the bpmap file (sequence, x/
> y chip coordinates, pm/mm, etc)
> (b) alignment information for every probe to the latest release of
> the genome. For this I used MUMmer and had it return all alignments
> >= 23 NTs in length (taking Wolfgang Huber's choice from his
> tilingArray package)
> (c) a table used for quick lookup of how many times each probe
> aligns perfectly, and "almost perfectly" to the reference genome.
>
> I'm now building RData probe_annotation objects for each chromosome
> that tie this information together. This data.frame essentially has:
>
> (i) probe information (sequence, x/y, pm/mm)
> (ii) number of perfect/close hits for the probe
> (iii) where in the genome the probe (perfectly) lands: gene name,
> type of gene (protein coding, miRNA, snoRNA, rRNA, etc) and whether
> it's exonic/intronic.
>
> If you think any of these would be helpful to you, let me know and
> I'll try to put them up when I think they're reasonably correct.
>
> Like I said before, I'm probably not leveraging the BioC tools that
> are available as I've gotten lost in the myriad of options (which
> are good to have) that are there.
>
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Physiology, Biophysics and Systems Biology
> Weill Cornell Medical College
>
> http://cbio.mskcc.org/~lianos
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
======================================================================
Dr. Tobias Straub Adolf-Butenandt-Institute, Molecular Biology
tel: +49-89-2180 75 439 Schillerstr. 44, 80336 Munich, Germany
More information about the Bioconductor
mailing list