[Bioc-devel] mapping probes/probesets between platforms

Thu Jul 22 20:41:43 CEST 2010

On 07/22/2010 10:39 AM, Hervé Pagès wrote:
> Hi Peter,
>
> On 07/21/2010 03:24 PM, Sean Davis wrote:
>> There are many alternatives for alignment including blast, blat, gmap,
>> ssaha, etc. This could also be done with biostrings.
>>
>> Sean
>>
>> On Jul 21, 2010 2:43 PM, "Bazeley,
>> Peter"<Peter.Bazeley at rockets.utoledo.edu>
>> wrote:
>>
>> Hi Sean,
>>
>> For the second option, is there a specific package that you had in mind?
>> Perhaps the GenomicFeatures package, which can download UCSC tables.
>
> Depends which "second option" we are talking about.
>
> In Sean's original suggestion:
>
> the second is to map between common gene ids (unigene, ensg, entrez
> gene id, hgnc gene symbol, etc.)
>
> I think you could use our regular .db annotations which are
> gene-centric. They contain all kinds of mappings between probeset ids
> and other ids like Entrez Gene ids. If there is no .db package for
> your platform, look at the SQLforge vignette in the AnnotationDbi
> package for how to make your own.
>
> Here is an example of how you could map hgu95av2 probesets to hgu133b
> probesets based on their corresponding Entrez Gene ids:
>
> library(hgu95av2.db)
> library(hgu133b.db)
>
> ## See the mappings available for hgu95av2:
> ls('package:hgu95av2.db')
> ## and for hgu133b:
> ls('package:hgu133b.db')
>
> ## Make sure the ENTREZID mappings are what you are looking for:
> ?hgu95av2ENTREZID
> ?hgu133bENTREZID
>
> x <- toTable(hgu95av2ENTREZID)
> x[1:10, ]
> y <- toTable(hgu133bENTREZID)
> y[1:10, ]
>
> ## Combine 'x' and 'y' data frames i.e. "join" the 2 data frames based
> ## on the values in their gene_id columns:
> common_ids <- intersect(x$gene_id, y$gene_id)
> xy <- x[x$gene_id %in% common_ids, ]
> y_id2probe <- y$probe_id
> names(y_id2probe) <- y$gene_id
> y_id2probe <- y_id2probe[names(y_id2probe) %in% common_ids]
> tmp <- split(y_id2probe, names(y_id2probe))
> tmp <- tmp[xy$gene_id]
> xy <- xy[rep.int(seq_len(nrow(xy)), elementLengths(tmp)), ]
> row.names(xy) <- NULL
> xy$probe_id2 <- unname(unlist(tmp))
> xy[1:10, ] # the mapping is not one-to-one

One important correction to this. A feature that was added 2 or so 
releases ago in AnnotationDbi is to not include probes that hit
multiple genes into the maps. To add those "multiple probes" now
you need to do:

   all_hgu95av2ENTREZID <- toggleProbes(hgu95av2ENTREZID, "all")
   all_hgu133bENTREZID <- toggleProbes(hgu133bENTREZID, "all")
   x <- toTable(all_hgu95av2ENTREZID)
   y <- toTable(all_hgu133bENTREZID)

Then proceed as before.

I also forgot to mention that elementLengths() is defined in IRanges
so you need to make sure this package is loaded.

Finally, as an alternative, this "join" operation can be done directly
at the SQL level:

   sql <- paste("ATTACH DATABASE '",
                hgu133b_dbfile(),
                "' AS hg133b", sep="")
   dbGetQuery(hgu95av2_dbconn(), sql)
   sql <- "SELECT hgu95av2probes.probe_id AS probe_id,
                  hgu95av2probes.gene_id AS gene_id,
                  hg133bprobes.probe_id AS probe_id2
           FROM probes AS hgu95av2probes
             INNER JOIN hg133b.probes AS hg133bprobes
               ON (hgu95av2probes.gene_id=hg133bprobes.gene_id)"
   xy <- dbGetQuery(hgu95av2_dbconn(), sql)

This is of course much faster but requires that you have some
notions of SQL and know the underlying db schemas. See the
AnnotationDbi vignette for more information.

Cheers,
H.

>
>
> But maybe by "second option" you meant Sean's entirely different
> proposed approach to realign the probe sequences to a transcript
> library. This is more complicated but still doable with our tools.
> The main steps could be:
>
> - use the GenomicFeatures package together with the BSgenome
> data package that contains the reference genome for your platforms
> to extract the transcriptome sequences;
>
> - use vwhichPDict() (from the Biostrings package) twice: first to
> map the probe sequences in hgu95av2probe to the transcriptome
> and a second time to map the probe sequences in hgu133bprobe to
> the transcriptome;
>
> - combine the 2 mappings in a fashion similar to how 'x' and 'y'
> above were combined.
>
> Note that this time the result will be a mapping between the probes
> (not the probesets) of the 2 platforms. Some extra work would then
> be needed to convert this into a mapping between probeset ids.
>
> Hope that helps.
>
> Cheers,
> H.
>
>>
>>
>> Thanks,
>> Pete
>> ________________________________
>> From: seandavi at gmail.com [seandavi at gmail.com] on behalf of Sean Davis [
>> sdavis2 at mail.nih.gov]
>> Sent: Wednesday, July 21, 2010 7:27 AM
>> To: Bazeley, Peter
>> Cc: bioc-devel at stat.math.ethz.ch
>> Subject: Re: [Bioc-devel] mapping probes/probesets between platforms
>>
>>
>>
>>
>> On Tue, Jul 20, 2010 at 10:50 PM, Bazeley, Peter<
>> Peter.Bazeley at rockets.utoledo.edu<mailto:Peter....
>> [[alternative HTML version deleted]]
>>
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://s...
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319