[BioC] multiple locations for probeset in hgu133plus2CHRLOC vs. UCSC PSL data

Keith Satterley keith at wehi.EDU.AU
Tue Nov 18 06:53:26 CET 2008


Hi Peter,

I can add some extra information that may explain some of the data. If I may 
take the liberty of pointing you to a tool I have written and made available on 
my Institute web site, (It is the GABOS/GAFEP tool at 
http://bioinf.wehi.edu.au/gabos/) which will allow you to conveniently recall 
data from various gene/probe definition files.

On GABOS, select the hg18 genome and check the Annotation file, affyU133Plus2 
and uncheck other annotation files. Click the "Zero GAFEP Params" button, then 
enter 201268_at in the "List of Gene Names" box. Click the "Retrieve Sequence" 
button.

You will get the five blocks of sequence, with their co-ordinates relative to 
chr17 as shown below:
=======================
 > hg18 chr17 + affyU133Plus2 201268_at Exon '1/5 [ 1 47 ] 46598811 46598857 [ 
-0 47 +0 ] 46598811 46598857
TCTGCTCTCCCAGCGCAGCGCCGCCGCCCGGCCCCTCCAGCTTCCCG
 > hg18 chr17 + affyU133Plus2 201268_at Exon '2/5 [ 1 130 ] 46599187 46599316 [ 
-0 130 +0 ] 46599187 46599316
GACCATGGCCAACCTGGAGCGCACCTTCATCGCCATCAAGCCGGACGGCGTGCAGCGCGGCCTGGTGGGCGAGATCATCAAGCGCTTCGAGCAGAAGGGA
TTCCGCCTCGTGGCCATGAAGTTCCTCCGG
 > hg18 chr17 + affyU133Plus2 201268_at Exon '3/5 [ 1 102 ] 46600602 46600703 [ 
-0 102 +0 ] 46600602 46600703
GCCTCTGAAGAACACCTGAAGCAGCACTACATTGACCTGAAAGACCGACCATTCTTCCCTGGGCTGGTGAAGTACATGAACTCAGGGCCGGTTGTGGCCA
TG
 > hg18 chr17 + affyU133Plus2 201268_at Exon '4/5 [ 1 113 ] 46602297 46602409 [ 
-0 113 +0 ] 46602297 46602409
GTCTGGGAGGGGCTGAACGTGGTGAAGACAGGCCGAGTGATGCTTGGGGAGACCAATCCAGCAGATTCAAAGCCAGGCACCATTCGTGGGGACTTCTGCA
TTCAGGTTGGCAG
 > hg18 chr17 + affyU133Plus2 201268_at Exon '5/5 [ 1 257 ] 46603847 46604103 [ 
-0 257 +0 ] 46603847 46604103
GAACATCATTCATGGCAGTGATTCAGTAAAAAGTGCTGAAAAAGAAATCAGCCTATGGTTTAAGCCTGAAGAACTGGTTGACTACAAGTCTTGTGCTCAT
GACTGGGTCTATGAATAAGAGGTGGACACAACAGCAGTCTCCTTCAGCACGGCGTGGTGTGTCCCTGGACACAGCTCTTCATTCCATTGACTTAGAGGCA
ACAGGATTGATCATTCTTTTATAGAGCATATTTGCCAATAAAGCTTTTGGAAGCCGG
=======================
My understanding is that the affy files on the UCSC site define the gene that 
the affy probes were designed around.

You can also use the GABOS tool to retrieve genes defined around your area of 
interest. To do this, select hg18, chr17, check refFlat (which is the set of 
RefSeq genes with their browser gene name included), or any of the other gene 
definition files, click the "Zero GAFEP Params" button, then enter a Sequence 
Range under the chromosome selection, for example in your situation, 46.5m-46.7m 
should cover your area of interest. I would also suggest you check the box "Do 
NOT display Sequence Data", click the "Retrieve Sequence" button.

About 60 lines (each corresponding to an exon) are listed. You can see that the 
NM_001018137-NME2 gene corresponds to your affy probe. (Note the GABOS beginning 
co-ordinates are one bigger than your affy co-ordinates.). Below is the 
NM_001018137-NME2 gene data retrieved by GABOS.
=======================
 > hg18 chr17 + refFlat NM_001018137-NME2 Exon '1/5 [ 1 152 ] 46597890 46598041 
[ -0 152 +0 ] 46597890 46598041
 > hg18 chr17 + refFlat NM_001018137-NME2 Exon '2/5 [ 1 130 ] 46599187 46599316 
[ -0 130 +0 ] 46599187 46599316
 > hg18 chr17 + refFlat NM_001018137-NME2 Exon '3/5 [ 1 102 ] 46600602 46600703 
[ -0 102 +0 ] 46600602 46600703
 > hg18 chr17 + refFlat NM_001018137-NME2 Exon '4/5 [ 1 113 ] 46602297 46602409 
[ -0 113 +0 ] 46602297 46602409
 > hg18 chr17 + refFlat NM_001018137-NME2 Exon '5/5 [ 1 258 ] 46603847 46604104 
[ -0 258 +0 ] 46603847 46604104
========================

Hope this helps explain the data a little, I'll leave it to others to explain 
how the hgu133plus2.db package works,
					
hope that helps,

Keith

========================
Keith Satterley
Bioinformatics Division
The Walter and Eliza Hall Institute of Medical Research
Parkville, Melbourne,
Victoria, Australia
=======================

Sean Davis wrote:
> On Mon, Nov 17, 2008 at 8:28 PM, Bazeley, Peter
> <Peter.Bazeley at utoledo.edu>wrote:
> 
>> Hello,
>>
>> R version: 2.8.0
>>
>> I just installed the hgu133plus2.db package, and am looking at the
>> hgu133plus2CHRLOC environment. I've noticed that some of the probeset
>> entries (e.g. "201268_at") have multiple locations compared to Affy's
>> annotation file. I'd like to figure out if these multiple locations are
>> current, in which case it is some sort of overlapping/repeating duplication.
>> For example:
>>
>>> as.list(hgu133plus2CHRLOC)$'201268_at'
>>      17       17       17       17
>> 46598879 46597889 46598637 46599081
>>
>> indicates that the probeset maps to 4 locations. Compare this to the
>> alignments info in the Affy's annotation file (from 7/8/08,
>> http://www.affymetrix.com/Auth/analysis/downloads/na26/ivt/HG-U133_Plus_2.na26.annot.csv.zip
>> ):
>>
>> chr12:119204403-119205041 (+) // 91.49 // q24.31 ///
>> chr17:46598810-46604103 (+) // 96.87 // q21.33
>>
>> which suggests one location on chromosome 17 (I'm ignoring chromosome 12
>> for now). This is a "_at" probeset, so it should map uniquely to a sequence,
>> according to Affy's "Data Analysis Fundamentals" document (and speaking to a
>> rep).
>>
>> >From the information provided by "?hgu133plus2CHRLOC", I downloaded
>>
>> ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/database/affyU133Plus2.txt.gz
>> from UCSC to see how this occured, but it is not clear. Actually, the file:
>>
>> http://www.affymetrix.com/Auth/analysis/downloads/psl/HG-U133_Plus_2.link.psl.zip
>> from Affy's support page has the same alignment info. Here's the relevant
>> PSL info:
>>
>> Target sequence name: chr17
>> Alignment start position in target: 46598810
>> Alignment end position in target: 46604103
>> Number of blocks in the alignment (a block contains no gaps): 5
>> Comma-separated list of sizes of each block: 47,130,102,113,257,
>> Comma-separated list of starting positions of each block in target:
>> 46598810,46599186,46600601,46602296,46603846,
>>
>>
>> The second location provided by CHRLOC (46597889) occurs before the start
>> of the alignment in the PSL info, so perhaps this one CHRLOC location
>> corresponds to the PSL alignment? The mappings were obtained from UCSC on
>> 2006-Apr14, so perhaps additional alignments existed at the time, which have
>> since been removed.
>>
>>
>> Thank you for any help. Hopefully I'm just missing something obvious (well,
>> non-obvious for me).
>>
> 
> Marc can answer with more authority, but I think that the confusion has to
> do with the fact that everything is mapped through Entrez Gene ID and NOT
> the transcript.  If you look in the UCSC genome browser from which the
> alignments are created, you will see that Entrez ID 4831 has four RefSeqs
> associated with it.  Hence, there are four alignments.  With the actual
> probe sequences, one could potentially make an argument for one transcript
> over another, but relying on affy's call of which transcript is to the
> "representative" one is probably not a reliable way to choose one transcript
> over another.
> 
> Hope that helps,
> Sean
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list