[BioC] Drosophila tiling array background correction

Thu Jul 17 17:37:07 CEST 2008

Thanks for your replies. Tobias, I'm hoping it won't come to creating  
everything manually but if it does thanks for the code!

Steve, we seem to be having the same problem. I can use ReadAffy, and  
can even specify the package that I create using makePDpackage (which  
I outlined here http://article.gmane.org/ 
gmane.science.biology.informatics.conductor/19012 ) but not use the  
AffyBatch for anything other than exprs().

In reference to using RMA in oligo I should have been clearer. I'm  
really only interested in background correction and normalisation.  
The summarisation step doesn't make as much sense. oligo seems to be  
able to do rma() using its TilingFeatureSet object which would  
presumably not perform summarisation. I haven't checked whether it's  
given the right answer, and that it hasn't summarised something that  
it shouldn't have, because I'd like to try something more involved  
like GCRMA correction but perhaps it's something you can consider.

Thanks for the links you provided. I'm looking at the Matlab source  
code at the moment. I'll let you know how it goes.

Regards,
Ben.

On 17/07/2008, at 2:22 AM, Tobias Straub wrote:

> Hi,
> you can extract the signals manually (watch out that takes some  
> time..).
>
> #####################################################
>
> library(affy)
> library(affxparser)
>
> #here the location of your bpmap
> bpmap_file <- "Dm35b_MR_v02-3_BDGPv4h.bpmap"
> bpmap <- readBpmap(bpmap_file)
> summary(bpmap)
>
> # you get a list of chromosome mappings only a few of them are  
> Drosophila ones
> # these you specify in here:
> allchrs <- c("Dm:BDGPv4h;chr2L",
> 	"Dm:BDGPv4h;chr2R","Dm:BDGPv4h;chr2h","Dm:BDGPv4h;chr3L",
> 	"Dm:BDGPv4h;chr3R","Dm:BDGPv4h;chr3h","Dm:BDGPv4h;chr4",
> 	"Dm:BDGPv4h;chr4h","Dm:BDGPv4h;chrU","Dm:BDGPv4h;chrX",
> 	"Dm:BDGPv4h;chrXh","Dm:BDGPv4h;chrYh")
>
> cel_files <- c("file1.cel","file2.cel",
> 	"file3.cel", "file4.cel",
> 	"file5.cel", "file6.cel")
>
> cel_header <- readCelHeader(cel_files[1])
> cel_indices <- lapply(bpmap, function(x) xy2indices(x[["pmx"]], x 
> [["pmy"]], nr=cel_header[["cols"]]))
>
> chr<-allchrs[1]
> data <- readCelIntensities(cel_files, cel_indices[[chr]])
> signals <- data
> colnames(signals) <- cel_files
> layout <- data.frame(chr=
> 	factor(bpmap[[chr]][["seqInfo"]][["name"]]), seq=
> 	bpmap[[chr]][["probeseq"]], pos=bpmap[[chr]][["startpos"]],
> 	stringsAsFactors = FALSE)
> layout$probe.id <- paste(layout[,1], layout[,3], sep="_")
> lm <- layout
>
> for (chr in allchrs[2:length(allchrs)]) {
> 	data <- readCelIntensities(cel_files, cel_indices[[chr]])
> 	signals <-rbind(signals, data)
> 	layout <- data.frame(chr=
> 		factor(bpmap[[chr]][["seqInfo"]][["name"]]), seq=
> 		bpmap[[chr]][["probeseq"]], pos=bpmap[[chr]][["startpos"]],
> 		stringsAsFactors = FALSE)
> 	layout$probe.id <- paste(layout[,1], layout[,3], sep="_")
> 	lm <- rbind(lm, layout)
> }
>
> rownames(signals)<-lm$probe.id
> layout<-data.frame("probe.id"=lm$probe.id, "chr"=sub("chr","",lm 
> $chr), "pos"=lm$pos, "sequence"=lm$seq)
> layout$length=c(25)
>
> ####################################
>
> you end up with a 'signal' matrix and a 'layout' dataframe
> the signals you can easily normalize without having them in an  
> AffyBatch object.
>
> sth. like
>
> norm <- exprs(vsn2(signals, subsample=200000))
>
> or whatever else
>
> best
> Tobias
>
>
> On Jul 16, 2008, at 4:01 PM, Steve Lianoglou wrote:
>
>> Hi Ben,
>>
>> I don't really have any answers for you, but I'm interested in  
>> hearing them myself as well :-)
>>
>> I'm making a few comments below ... I'm sorry, they turned out to  
>> be rather long and I'm not making any claim that I know exactly  
>> what I'm doing, I'm just trying what makes most sense to me at the  
>> moment.
>>
>> If anybody has any positive/negative feedback, I'd be happy to hear.
>>
>> On Jul 15, 2008, at 8:53 PM, Benjamin Lansdell wrote:
>>
>>> Hi,
>>>
>>> I have been trying to analyse drosophila development tiling  
>>> arrays using Bioconductor. In particular I would like to perform  
>>> some form of background correction for probe affinities (such as  
>>> GCRMA) and then quantile normalisation but have been unable to  
>>> find any packages that can do so without a .cdf file. I have a  
>>> collection of .cel files and a .bpmap file.
>>
>> I've been wrestling with 1.0R Drosophila tiling array myself and  
>> haven't really been able to use the AffyBatch objects I get (from  
>> reading their .CEL files with ReadAffy) with any success. By that  
>> I mean I haven't been able to then pass the batch object into any  
>> other processing step for further analysis/QA assessment, even  
>> though I think I made the appropriate CDFs correctly (more on that  
>> below)
>>
>> So, really the only thing I use the AffyBatch object is to just  
>> get the expression info from it:
>>
>> > exps <- exprs(myAffyBatch)
>>
>> You can then quantile normalize the data by just passing the exps  
>> expression matrix into the affy::normalize.quantiles function  
>> since it only expects a matrix.
>>
>> I've also done my own MvA plots for rudimentary QA.
>>
>>> I know the package oligo can perform RMA using a package that you  
>>> build from a .bpmap file but this isn't quite what I want.
>>
>> Can it? My impression was that RMA relies on the idea of "probe  
>> sets," which I don't think really applies to tiling arrays as  
>> much. I see that it does SNPRMA ... I don't really know what the  
>> details of that are, though.
>>
>> Certainly you can annotate your probes to construct your own probe  
>> sets, though, but you also have a majority of probes on the array  
>> that don't really belong to any probe set.
>>
>>> Is it possible to create the necessary structures myself  
>>> (AffyBatch, something that contains the probe sequences, etc) for  
>>> use with the many packages that seem to rely on a .cdf file?
>>
>>
>> I've built the appropriate CDF-like structure (I think) by using  
>> the makePlatformDesign::makePDpackage, but still never really got  
>> it act like a CDF for my affybatch objects like Bioconductor  
>> expects. If you haven't done so, I've put mine up previously for  
>> someone else to download, and you can still grab them. They were  
>> built for both chips of the 1.0R version of the tiling array  
>> (forward and backward strand (aka MF and MR)) on a 64-bit Linux  
>> machine:
>>
>> http://cbio.mskcc.org/~lianos/files/bioconductor/
>>
>> I've been manually attacking this data, probably doing more work  
>> than necessary since I can't figure out how to leverage most of  
>> bioconductor tools just yet.
>>
>> Ideally I'd like to take a similar approach to what was done here:
>> http://genomebiology.com/2008/9/7/R112
>>
>> The methods appeared earlier in the proceedings of this year's PSB  
>> conference. They've also provided MATLAB source code implementing  
>> their analysis techniques:
>> http://www.fml.tuebingen.mpg.de/raetsch/projects/PSBTiling
>>
>> In reality, as a first approach, I'm probably just going to  
>> implement (in R) their slightly modified version of M. Gerstein's  
>> Sequence Quantile Normalization:
>> http://bioinformatics.oxfordjournals.org/cgi/content/abstract/ 
>> 23/8/988
>>
>> and explore the signals I'm getting off different loci of the  
>> genome. I'm hoping I can pull this off within the time constraints  
>> of my current rotation project. If others find it useful, and I'm  
>> successful, I'd be happy to share it back w/ the R/BioC community.
>>
>> In my adventures, I've created some SQLite database that you might  
>> find useful:
>>
>> (i) A DB with all of the annotation data for the Dmel (v5) genome  
>> (protein coding genes, RNA's, exonic positions, etc) from the GTF  
>> file taken from ensembl:
>> ftp://ftp.ensembl.org/pub/current_gtf/
>>
>> (ii) A DB with meta data about the tiling chips, this includes the  
>> following tables:
>>  (a) probe information as extracted from the bpmap file (sequence,  
>> x/y chip coordinates, pm/mm, etc)
>>  (b) alignment information for every probe to the latest release  
>> of the genome. For this I used MUMmer and had it return all  
>> alignments >= 23 NTs in length (taking Wolfgang Huber's choice  
>> from his tilingArray package)
>>  (c) a table used for quick lookup of how many times each probe  
>> aligns perfectly, and "almost perfectly" to the reference genome.
>>
>> I'm now building RData probe_annotation objects for each  
>> chromosome that tie this information together. This data.frame  
>> essentially has:
>>
>> (i) probe information (sequence, x/y, pm/mm)
>> (ii) number of perfect/close hits for the probe
>> (iii) where in the genome the probe (perfectly) lands: gene name,  
>> type of gene (protein coding, miRNA, snoRNA, rRNA, etc) and  
>> whether it's exonic/intronic.
>>
>> If you think any of these would be helpful to you, let me know and  
>> I'll try to put them up when I think they're reasonably correct.
>>
>> Like I said before, I'm probably not leveraging the BioC tools  
>> that are available as I've gotten lost in the myriad of options  
>> (which are good to have) that are there.
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Physiology, Biophysics and Systems Biology
>> Weill Cornell Medical College
>>
>> http://cbio.mskcc.org/~lianos
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/ 
>> gmane.science.biology.informatics.conductor
>
> ======================================================================
> Dr. Tobias Straub         Adolf-Butenandt-Institute, Molecular Biology
> tel: +49-89-2180 75 439         Schillerstr. 44, 80336 Munich, Germany
> ======================================================================
>
>