On Thu, Nov 20, 2008 at 3:03 PM, Hervé Pagès <hpages@fhcrc.org> wrote:

> Hi Peter,
>
> You can find the genome coordinates of a set of probes by using the
> matchPDict()/countPDict() tool from Biostrings + the genome for hg18:
>
>  library(hgu133plus2probe)
>  library(BSgenome.Hsapiens.UCSC.hg18)
>  probes <- DNAStringSet(
>             hgu133plus2probe$sequence[hgu133plus2probe$Probe.Set.Name ==
> "201268_at"]
>           )
>
>  pdict1 <- PDict(probes)
>  nhits1 <- t(sapply(seqnames(Hsapiens),
>                    function(name) countPDict(pdict1, Hsapiens[[name]])))
>
> The 'nhits1' matrix summarizes the number of hits per chromosome (plus
> strand only). See which chromosomes have hits with:
>
>  > which(rowSums(nhits1) != 0)
>  chr12 chr17
>    12    17
>
> For completeness, we should also search the minus strands of all
> chromosomes.
> This can easily be done by taking the reverse complement sequences of the
> probes with pdict2 <- PDict(reverseComplement(probes)) and then using the
> same code as above.
>
> Then use matchPDict() to get the coordinates of the hits in chr17:
>
>  chr17_hits <- unlist(matchPDict(pdict1, Hsapiens$chr17))
>
> If you want to know the genes where those hits occur, you can use
> biomaRt as explained by Steffen on this list a few days ago:
>
>  https://stat.ethz.ch/pipermail/bioconductor/2008-November/025208.html
>
> Once you've retrieved all the genes coordinates, you can do the
> following:
>
>  # Extract the genes that belong to chr17 (plus strand only):
>  chr17_genes <- geneLocs[geneLocs$chromosome_name == "17" & geneLocs$strand
> == 1, ]
>
>  # Use overlap() from the IRanges package to find the genes
>  # where the hits occur:
>  > tree <- IntervalTree(IRanges(start=chr17_genes$start_position,
> end=chr17_genes$end_position))
>  > overlap(tree, chr17_hits, multiple = FALSE)
>  [1] 33 33 33 33 33 33 33 33 33 33
>
> So the gene hit by probe 201268_at is:
>
>  > chr17_genes[33, ]
>     ensembl_gene_id strand chromosome_name start_position end_position
>  954 ENSG00000011052      1              17       46585919     46604104
>
> which is what is reported by the hgu133plus2ENSEMBL map:
>
>  > hgu133plus2ENSEMBL[['201268_at']]
>  [1] "ENSG00000011052"
>
> Note that the same approach shows that the hits in chr12 are also within
> a gene boundaries (gene ENSG00000123009) but I'll let more qualified people
> comment about this.
>

I'll just comment here that when mapping expression probes, it is probably
more meaningful to map the probes to a transcript database rather than the
genome.  Transcripts are discontinuous in the genome and the genome contains
repeats that are not likely present in the transcript space.  That said,
transcript databases are limited, so there may be tradeoffs.

Sean


>
> Bazeley, Peter wrote:
>
>> Hello,
>>
>> R version: 2.8.0
>>
>> I just installed the hgu133plus2.db package, and am looking at the
>> hgu133plus2CHRLOC environment. I've noticed that some of the probeset
>> entries (e.g. "201268_at") have multiple locations compared to Affy's
>> annotation file. I'd like to figure out if these multiple locations are
>> current, in which case it is some sort of overlapping/repeating duplication.
>> For example:
>>
>>  as.list(hgu133plus2CHRLOC)$'201268_at'
>>>
>>      17       17       17       17 46598879 46597889 46598637 46599081
>> indicates that the probeset maps to 4 locations. Compare this to the
>> alignments info in the Affy's annotation file (from 7/8/08,
>> http://www.affymetrix.com/Auth/analysis/downloads/na26/ivt/HG-U133_Plus_2.na26.annot.csv.zip):
>>
>> chr12:119204403-119205041 (+) // 91.49 // q24.31 ///
>> chr17:46598810-46604103 (+) // 96.87 // q21.33
>>
>> which suggests one location on chromosome 17 (I'm ignoring chromosome 12
>> for now). This is a "_at" probeset, so it should map uniquely to a sequence,
>> according to Affy's "Data Analysis Fundamentals" document (and speaking to a
>> rep).
>>
>> >From the information provided by "?hgu133plus2CHRLOC", I downloaded
>> ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/database/affyU133Plus2.txt.gzfrom UCSC to see how this occured, but it is not clear. Actually, the file:
>>
>> http://www.affymetrix.com/Auth/analysis/downloads/psl/HG-U133_Plus_2.link.psl.zip
>> from Affy's support page has the same alignment info. Here's the relevant
>> PSL info:
>>
>> Target sequence name: chr17
>> Alignment start position in target: 46598810
>> Alignment end position in target: 46604103
>> Number of blocks in the alignment (a block contains no gaps): 5
>> Comma-separated list of sizes of each block: 47,130,102,113,257,
>> Comma-separated list of starting positions of each block in target:
>> 46598810,46599186,46600601,46602296,46603846,
>>
>>
>> The second location provided by CHRLOC (46597889) occurs before the start
>> of the alignment in the PSL info, so perhaps this one CHRLOC location
>> corresponds to the PSL alignment? The mappings were obtained from UCSC on
>> 2006-Apr14, so perhaps additional alignments existed at the time, which have
>> since been removed.
>>
>>
>> Thank you for any help. Hopefully I'm just missing something obvious
>> (well, non-obvious for me).
>>
>> Peter Bazeley
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor@stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages@fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

	[[alternative HTML version deleted]]