[BioC] unmapped keys in hugene10stprobeset.db

Wed Aug 18 01:35:14 CEST 2010

Thanks for clarification Marc,
most of my 'improvements' are not Entrez Gene-centric. I glean far more annotation for each probeset by parsing the mrna_assignment field if the gene_assignment field is empty. This usually results in at least a genbank ID and description of the transcript, and as I pointed out earlier, microRNA's and snoRNA's.
I'll investigate the potential relationships between the new stuff that i'm uncovering and Entrez Gene
cheers,
Mark

On 18/08/2010, at 3:12 AM, Marc Carlson wrote:

> Hi Mark,
> 
> You should talk to me about annotations.  I maintain the annotation
> repository here and make sure that all of the packages get re-made for
> each release etc..  This particular package was contributed and is
> maintained by Arthur Li.  So I will contact the two of you off list as
> needed, depending on what you find out in the "improvement" department.
> 
> Something that may help you to be aware of as you explore this is that
> the annotations and the SQLForge code that generates them are all entrez
> gene centric.  So you need to be able to connect the probe to an entrez
> gene ID that was not mapped to before in order to "improve" them.  But,
> if you have new information about probes that map to things like
> microRNAs, then that really could help since there *are* entrez gene IDs
> for those things in NCBI (and in our supporting "org" packages.  This is
> true even though these things are not really genes in the strictest
> sense of the word. 
> 
> 
>  Marc
> 
> 
> On 08/16/2010 05:05 PM, Mark Cowley wrote:
>> hi Paul & Marc,
>> in addition to the thousands of control probes, there are non protein coding genes on these arrays - things like snoRNA's and precursor microRNA's which might not have a classical gene symbol.
>> I find that the mrna_assignment column from the Affy csv has a lot more information for these genes than the BioC annotation packages, so i'll do what you suggest Marc & try to 'improve' the mapping via the SQLForge code. I've done a fair amount of the groundwork on this already, so who could I communicate these changes to?
>> 
>> cheers,
>> Mark
>> -----------------------------------------------------
>> Mark Cowley, PhD
>> 
>> Peter Wills Bioinformatics Centre
>> Garvan Institute of Medical Research, Sydney, Australia
>> -----------------------------------------------------
>> 
>> On 17/08/2010, at 9:26 AM, Marc Carlson wrote:
>> 
>> 
>>> Hi Paul,
>>> 
>>> I looked into this for you.  Often there will be discrepancies like this
>>> for purely historical reasons.  For example, Affy may have made the
>>> probes based on one idea about what the transcriptome looked like and
>>> then this could have changed by the time they shipped their product. 
>>> That kind of discrepancy happens all the time and especially with older
>>> chips.  But in your case, you also seem to have a lot of control probes
>>> on this platform. 
>>> 
>>> You can extract the unmatched probes like this:
>>> 
>>> library (hugene10stprobeset.db)
>>> a = hugene10stprobesetENTREZID
>>> oddProbes = keys(a)[! (keys(a) %in% mappedkeys(a))]
>>> 
>>> I actually pulled down the .csv mapping from Affymetrix that Arthur Li
>>> would have used to generate this database.  And I noticed that all the
>>> oddProbes I was looking at were control probes.  In fact, more than 4
>>> thousand of these probes are control probes.  Looking more closely at
>>> this file, you will see that many, many other probes have no gene
>>> mapping to them even though they are not listed as control probes.  What
>>> is going on with some of those probesets?  Why has Affy refused to
>>> assign an identity those ones?  That is really more of a question for
>>> Affymetrix than for us.
>>> 
>>> When we map these IDs to make annotation packages, we look for known
>>> gene IDs from the manufacturer (unigene, refseq etc.), and we then map
>>> those onto entrez gene IDs from NCBI and from there onto other
>>> annotations.  But if the people who make the array are not willing to
>>> tell us what these things map to then we could really only speculate
>>> about what they are.
>>> 
>>> But, if you have some external information that helps you to decide what
>>> these probes really map to, (maybe you have mapped the probesets onto
>>> the genome yourself or else maybe you feel that you can extract a little
>>> more data out of Affys .csv file than this author did), then in that
>>> case you can always feed that "improved" mapping into the SQLForge code
>>> in the AnnotationDbi package and generate your very own version of this
>>> annotation package.  It is pretty straightforward to do so and is
>>> described in the SQLForge vignette here:
>>> 
>>> http://www.bioconductor.org/packages/release/bioc/html/AnnotationDbi.html
>>> 
>>> I hope this helps explain things,
>>> 
>>> 
>>> Marc
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 08/16/2010 02:17 PM, Paul Shannon wrote:
>>> 
>>>> Here's an annotation question someone might be able to help me out with.  I'll be grateful.
>>>> 
>>>> Affymetrix describes their 'GeneChip Human Gene 1.0 ST Array':
>>>> 
>>>> Each of the 28,869 genes is represented on the array by approximately 26 probes spread across the full length of the gene, providing a more complete and more accurate picture of gene expression than 3’ based expression array designs. ... The Gene 1.0 ST Array uses a subset of probes from the Human Exon 1.0 ST Array and covers only well-annotated content. 
>>>> 
>>>> This sounds to me like affy started with sequence from exons of ~29k genes and created probes.  
>>>> But when I look at the bioc annotation for this chip (hugene10stprobeset.db, hugene10sttranscriptcluster.db), I find that about 7% of the probes are NOT annotated to geneIDs.  The sibling array, hugene10sttranscriptclusterENTREZID, though smaller, has a higher proportion of unmapped probes.
>>>> 
>>>>   library (hugene10stprobeset.db)
>>>>   library (hugene10sttranscriptcluster.db)
>>>>   bm = hugene10stprobesetENTREZID
>>>>   length (keys (bm))    #  257022
>>>>   count.mappedkeys (bm) #  238141
>>>>                # unmapped:  18881
>>>>    cm = hugene10sttranscriptclusterENTREZID
>>>>    length (keys (cm));   #  33257                                                                                                                         
>>>>    count.mappedkeys (cm) #  21787
>>>> 
>>>> The same proportions (unmapped/mapped seems to hold true for hugene10stprobesetENSEMBL as well.
>>>> 
>>>> Can anyone suggest where I can get entrez geneID annotations for these unmapped probes?   Or otherwise clear up my confusion? 
>>>> 
>>>> Thanks!
>>>> 
>>>> - Paul
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> 
>>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> 
>> 
>> 
>> 
>> 
>