[BioC] Mapping microarray probes to the genome using findOverlaps
Hervé Pagès
hpages at fhcrc.org
Mon Mar 21 18:47:16 CET 2011
Sorry Ravi, I've no idea how to do this. H.
On 03/13/2011 07:03 PM, Ravi Karra wrote:
> Thanks for the suppprt Herve!
> Do you or Benilton (or anyone else!) have any ideas on how to redesign the .ndf file to allow PdInfoBuilder to use the updated probeset information from these alignments?
> Thanks again for all of the help,
> Ravi
>
> On Mar 8, 2011, at 8:29 PM, Hervé Pagès wrote:
>
>> Hi Ravi,
>>
>> FWIW, I just made a BSgenome package for Zv9. It will become available
>> to BioC devel users in a couple of hours (source package only for now).
>>
>> Cheers,
>> H.
>>
>>
>> On 03/07/2011 07:11 PM, Ravi Karra wrote:
>>> Hi Sean and Herve,
>>>
>>> Thanks a ton for the suggestions! I modified Herve's approach to use
>>> biomaRt. Zv9 is available via Biomart/Ensembl and has incorporated many
>>> of the VEGA transcripts, many of which are supposed to be on this array.
>>> I ended up mapping to the cdna's from zv9 and like Herve only match
>>> about 60% of probes to the ensembl transcripts. I too found many probes
>>> that map to multiple transcripts. Like the brainarray group, I chose to
>>> use the probes as a part of transcript-specific probesets... i.e. a
>>> given probe can contribute to the expression of multiple transcripts.
>>> Before trying to extend mapping to other databases as suggested by Sean,
>>> I wanted to try and create a custom ndf file and annotation package to
>>> give the analysis a go. However, I keep getting the error below using
>>> pdinfrobuilder and my custom ndf file. Did I construct my new ndf file
>>> incorrectly? I included my code and session info below. Sorry for the
>>> inefficient code... I am learning on the fly!
>>>
>>> Thanks everyone! This list has been immensely helpful!
>>>
>>> Ravi
>>>
>>> =======================================================================
>>> Building annotation package for Nimblegen Expression Array
>>> NDF: newndf.ndf
>>> XYS: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys
>>> =======================================================================
>>> Parsing file: newndf.ndf... OK
>>> Parsing file: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys... OK
>>> Merging NDF and XYS files... OK
>>> Preparing contents for featureSet table... OK
>>> Preparing contents for bgfeature table... OK
>>> Preparing contents for pmfeature table... OK
>>> Creating package in ./pd.newndf
>>> Inserting 27649 rows into table featureSet... OK
>>> Inserting 200368 rows into table pmfeature... Error in
>>> sqliteExecStatement(con, statement, bind.data) :
>>> RS-DBI driver: (RS_SQLite_exec: could not execute: constraint failed)
>>>
>>> My code:
>>>
>>> #1. Load the probes from the ndf file:
>>> library(Biostrings)
>>> array_file= "/Users/ravikarra/Documents/Research Projects/Regeneration
>>> Gene Expression/Bioconductor Code/Array Mapping/090818_Zv7_EXPR.ndf"
>>> probe_info= read.table(array_file, header= TRUE, sep= '\t')
>>> probes= DNAStringSet(as.character(probe_info$PROBE_SEQUENCE))
>>> probeid= probe_info$PROBE_DESIGN_ID
>>> probeid= as.character(probeid)
>>> names(probes) = probeid
>>> real_probes= which(width(probes) == 60) #don't align the control, empty,
>>> or the crosshyb probes
>>> probes= probes[real_probes]
>>>
>>> #2. Compute the transcriptome (51569 transcripts with sequences as of
>>> March 7, .2011):
>>> library(biomaRt)
>>> ensembl= useMart("ensembl")
>>> ensemblzv9= useDataset("drerio_gene_ensembl",mart=ensembl)
>>> txIDs= getBM(attributes= c("ensembl_transcript_id"), mart= ensemblzv9)[,1]
>>> txSeqs= getSequence(id= txIDs, type= "ensembl_transcript_id", seqType=
>>> "cdna", mart= ensemblzv9)
>>> #make the DNAStringSet object
>>> zv9txDict= DNAStringSet(txSeqs$cdna)
>>> names(zv9txDict) = txSeqs$ensembl_transcript_id
>>>
>>> #3 Map 'probes' to 'txseqs' (allowing 2 mismatches):
>>> pdict2<- PDict(probes, max.mismatch=2)
>>> m2= vwhichPDict(pdict2, zv9txDict, max.mismatch=2)
>>> makeMap<- function(m, probeids, txnames) {
>>> probeid<- probeids[unlist(m)]
>>> txname<- rep.int(txnames, elementLengths(m))
>>> ans<- unique(data.frame(probeid, txname))
>>> rownames(ans)<- NULL
>>> ans
>>> }
>>> map= makeMap(m2, names(probes), names(zv9txDict))
>>>
>>> #4. Remove transcripts with less than 3 probes
>>> txcounts= data.frame(table(map$txname))
>>> dim(txcounts) #start with 32961 different transcripts
>>> usefultxs= txcounts[txcounts$Freq> 2,][,1] #end with 27647 transcripts
>>> map= map[map$txname%in% usefultxs, ]
>>>
>>> #5: Make a new ndf file with the updated transcript information
>>> #probe_info is the ndf file I originally loaded
>>> rownames(probe_info) = probe_info[,1]
>>> ndf= probe_info[as.character(map$probeid),]
>>> ndf$SEQ_ID= map$txname
>>>
>>> #append the information for the control probes to the new ndf
>>> control= probe_info[probe_info$PROBE_CLASS!= "experimental",]
>>> ndf= rbind(ndf, control)
>>>
>>> # change all the probes with unmapped transcripts to random but keep
>>> controls intact
>>> noalign= !(as.character(probe_info$PROBE_DESIGN_ID) %in%
>>> as.character(ndf$PROBE_DESIGN_ID))
>>> noalignprobes= probe_info[noalign,]
>>> noalignprobes$SEQ_ID= "noalign"
>>> ndf= rbind(ndf, noalignprobes)
>>> write.table (ndf, "newndf.ndf", sep = '\t', row.names = FALSE)
>>>
>>> #6. Make the new annotation package
>>> #make custom chip descriptor file for Nimblegen
>>> library(pdInfoBuilder)
>>> filename_ndf<- list.files('.',pattern='.ndf',full.names=T)
>>> filename_ndf= new("ScalarCharacter", filename_ndf)
>>> filename_xys<- list.files('.', pattern=".xys",full.names= T) [1]
>>> filename_xys= new("ScalarCharacter", filename_xys)
>>> seed<- new("NgsExpressionPDInfoPkgSeed", ndfFile=filename_ndf,
>>> xysFile=filename_xys, author="Ravi Karra",email="ravi.karra at duke.edu
>>> <mailto:ravi.karra at duke.edu>", biocViews="AnnotationData", chipName=
>>> "Danio rerio 12x135K Array", genomebuild="Ensembl and UCSC Zv7",
>>> manufacturer= "Nimblegen", url=
>>> "http://www.nimblegen.com/products/exp/eukaryotic.html", organism=
>>> "Danio rerio", version= "1.1")
>>> makePdInfoPackage(seed, destDir= ".")
>>>
>>>> sessionInfo()
>>> R version 2.12.2 (2011-02-25)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods
>>> [7] base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.18.4 IRanges_1.8.9 biomaRt_2.6.0
>>> [4] pdInfoBuilder_1.14.1 oligo_1.14.0 oligoClasses_1.12.2
>>> [7] affxparser_1.22.1 RSQLite_0.9-4 DBI_0.2-5
>>> [10] Biobase_2.10.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] affyio_1.18.0 preprocessCore_1.12.0 RCurl_1.4-3
>>> [4] splines_2.12.2 XML_3.2-0
>>>
>>>
>>> On Mar 1, 2011, at 7:08 AM, Sean Davis wrote:
>>>
>>>>
>>>>
>>>> On Tue, Mar 1, 2011 at 4:20 AM, Pages, Herve<hpages at fhcrc.org
>>>> <mailto:hpages at fhcrc.org>> wrote:
>>>>
>>>> Hi Ravi,
>>>>
>>>> Sean's suggestion to map the probes directly to the transcriptome can
>>>> easily be achieved with the Biostrings/BSgenome/GenomicFeatures
>>>> infrastructure.
>>>>
>>>> The code below uses vwhichPDict (a variant of matchPDict) and the
>>>> danRer6
>>>> genome from UCSC (release name: Sanger Institute Zv8). Note that
>>>> UCSC just
>>>> released danRer7 (Sanger Institute Zv9) but we don't have a
>>>> BSgenome package
>>>> for it yet.
>>>>
>>>> Also I don't think we support the Nimblegen Zebrafish 12 x135K
>>>> Expression
>>>> chip so I'm using the zebrafish chip from Affymetrix instead.
>>>>
>>>> Depending on your machine and the quality of your internet connection,
>>>> this code should run in 8-15 minutes. Less than 800 Mb of RAM is used.
>>>>
>>>> ## Load the probes:
>>>> library(zebrafishprobe)
>>>> library(Biostrings)
>>>> probes<- DNAStringSet(zebrafishprobe)
>>>>
>>>> ## Compute the transcriptome (32992 transcripts):
>>>> library(GenomicFeatures)
>>>> txdb<- makeTranscriptDbFromUCSC(genome="danRer6",
>>>> tablename="ensGene")
>>>> library(BSgenome.Drerio.UCSC.danRer6)
>>>> txseqs<- extractTranscriptsFromGenome(Drerio, txdb)
>>>> ## Note that saving 'txseqs' with write.XStringSet() generates
>>>> ## a FASTA file that is 51M.
>>>>
>>>> ## Map 'probes' to 'txseqs' (allowing 2 mismatches):
>>>> pdict2<- PDict(probes, max.mismatch=2)
>>>> m2<- vwhichPDict(pdict2, txseqs, max.mismatch=2)
>>>> makeMap<- function(m, probeids, txnames)
>>>> {
>>>> probeid<- probeids[unlist(m)]
>>>> txname<- rep.int<http://rep.int/>(txnames, elementLengths(m))
>>>> ans<- unique(data.frame(probeid, txname))
>>>> rownames(ans)<- NULL
>>>> ans
>>>> }
>>>> map<- makeMap(m2, zebrafishprobe$Probe.Set.Name
>>>> <http://Probe.Set.Name/>, names(txseqs))
>>>>
>>>> Note that the longest step is not the call to vwhichPDict() but the
>>>> extraction of the transcriptome.
>>>>
>>>> However, the final result doesn't look that good. A quick look at the
>>>> mapping shows that only 62.4% of the probe ids are mapped and that
>>>> some
>>>> probe ids are mapped to more than 1 transcript:
>>>>
>>>>
>>>> Nice example, Herve!
>>>>
>>>> Mapping to multiple transcripts is to be expected, so one needs to
>>>> come up with ways of dealing with the situation. Missing probes is
>>>> also typical. These can be dealt with by mapping to the unmatched
>>>> probes to other transcript databases. For example, one could map to
>>>> Ensembl first, RefSeq second, all zebrafish mRNAs next, all zebrafish
>>>> ESTs next.
>>>>
>>>> > head(map)
>>>> probeid txname
>>>> 1 Dr.19494.1.S1_at ENSDART00000048610
>>>> 2 Dr.19494.1.S1_at ENSDART00000103479
>>>> 3 Dr.19494.1.S1_at ENSDART00000103478
>>>> 4 Dr.10343.1.S1_at ENSDART00000006449
>>>> 5 Dr.12057.1.A1_at ENSDART00000054814
>>>> 6 Dr.22271.1.A1_at ENSDART00000054814
>>>> > length(unique(zebrafishprobe$Probe.Set.Name
>>>> <http://Probe.Set.Name/>))
>>>> [1] 15617
>>>> > length(unique(map$probeid))
>>>> [1] 9746
>>>> > any(duplicated(map$probeid))
>>>> [1] TRUE
>>>>
>>>> But maybe you will have more luck with the Nimblegen Zebrafish 12
>>>> x135K
>>>> Expression chip.
>>>>
>>>>
>>>> Keep in mind that with 60mer probes, there is potentially a need to
>>>> allow gaps in the alignment and that consecutive matches of even 40 or
>>>> 50 base pairs are likely to generate a signal on a microarray.
>>>>
>>>> Sean
>>>>
>>>> One more thing. While testing the above code, I discovered 2 problems
>>>> that were preventing it to work: one with the
>>>> extractTranscriptsFromGenome()
>>>> function and one with the vwhichPDict() function. Those problems
>>>> are now
>>>> fixed in the release and devel versions of GenomicFeatures (v
>>>> 1.2.4& 1.3.13)
>>>> and Biostrings (v 2.18.3& 2.19.12). These new versions will
>>>> become available
>>>> via biocLite() in the next 24-36 hours.
>>>>
>>>> Hope this helps.
>>>> H.
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: "Ravi Karra"<ravi.karra at gmail.com
>>>> <mailto:ravi.karra at gmail.com>>
>>>> To: "Sean Davis"<sdavis2 at mail.nih.gov<mailto:sdavis2 at mail.nih.gov>>
>>>> Cc: bioconductor at r-project.org<mailto:bioconductor at r-project.org>
>>>> Sent: Saturday, February 26, 2011 2:56:02 PM
>>>> Subject: Re: [BioC] Mapping microarray probes to the genome using
>>>> findOverlaps
>>>>
>>>> Hi Sean,
>>>> I could do that, but am not sure how to. The annotated zebrafish
>>>> genome is the Tubingen strain and the some of the probes on the
>>>> array are from the EK and AB strains. This means that I need to
>>>> allow for SNP's in the alignment. I originally tried to align the
>>>> probes to the Ensembl Transcripts using matchPDict but when I
>>>> allowed for 2 mismatches (max.mismatch = 2) across the probe
>>>> sequences, my computer never stopped running the program! I found
>>>> TopHat to be much faster (8 min) and TopHat allows for a few
>>>> nucleotide wobble by default!.
>>>> Do you have a suggestion(s) on another way to align the array to
>>>> the Ensembl Transcripts?
>>>> Thanks,
>>>> Ravi
>>>>
>>>>
>>>>
>>>>
>>>> On Feb 26, 2011, at 5:45 PM, Sean Davis wrote:
>>>>
>>>> >
>>>> >
>>>> > On Sat, Feb 26, 2011 at 5:29 PM, Ravi Karra
>>>> <ravi.karra at gmail.com<mailto:ravi.karra at gmail.com>> wrote:
>>>> > Hello,
>>>> >
>>>> > I am still trying to map probes on the Nimblegen Zebrafish 12
>>>> x135K Expression to the Zv9 version of the zebrafish genome
>>>> available from Ensembl! I am very reluctantly pursuing an
>>>> alignment approach to annotation as the original annotation
>>>> provided with the array is quite outdated.
>>>> >
>>>> > I performed a gapped alignment using the individual probe
>>>> sequences (60-mers) from the array using TopHat and loaded the
>>>> results into Bioconductor as a GappedAlignments object. I have
>>>> made a TranscriptDb object using the Zv9 genome from Ensembl.
>>>> Next, I plan to use findOverlaps for the annotation. What is the
>>>> best way to get the overlaps (by exon, cds, or by transcript)? I
>>>> am a little concerned that using transcriptsByOverlaps might be a
>>>> bit too broad and result in mapping reads to multiple genes (for
>>>> example transcripts in the genome that have overlapping genomic
>>>> coordinates). By contrast, mapping with exonsByOverlaps and
>>>> cdsByOverlaps might be too restrictive and miss information in the
>>>> UTR's. My gut feeling is that annotating by cds is the
>>>> "in-between" approach. What is the recommended approach for
>>>> RNA-seq? As you can tell, I am quite new to this!
>>>> >
>>>> >
>>>> > Hi, Ravi. Sorry to answer a question with more questions, but
>>>> why not just map the probes against Ensembl Transcripts or refseq?
>>>> What is the advantage of mapping to the genome and then going back
>>>> to the transcripts?
>>>> >
>>>> > Sean
>>>>
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list