[BioC] unmapped keys in hugene10stprobeset.db
Mark Cowley
m.cowley at garvan.org.au
Wed Aug 18 01:35:14 CEST 2010
Thanks for clarification Marc,
most of my 'improvements' are not Entrez Gene-centric. I glean far more annotation for each probeset by parsing the mrna_assignment field if the gene_assignment field is empty. This usually results in at least a genbank ID and description of the transcript, and as I pointed out earlier, microRNA's and snoRNA's.
I'll investigate the potential relationships between the new stuff that i'm uncovering and Entrez Gene
cheers,
Mark
On 18/08/2010, at 3:12 AM, Marc Carlson wrote:
> Hi Mark,
>
> You should talk to me about annotations. I maintain the annotation
> repository here and make sure that all of the packages get re-made for
> each release etc.. This particular package was contributed and is
> maintained by Arthur Li. So I will contact the two of you off list as
> needed, depending on what you find out in the "improvement" department.
>
> Something that may help you to be aware of as you explore this is that
> the annotations and the SQLForge code that generates them are all entrez
> gene centric. So you need to be able to connect the probe to an entrez
> gene ID that was not mapped to before in order to "improve" them. But,
> if you have new information about probes that map to things like
> microRNAs, then that really could help since there *are* entrez gene IDs
> for those things in NCBI (and in our supporting "org" packages. This is
> true even though these things are not really genes in the strictest
> sense of the word.
>
>
> Marc
>
>
> On 08/16/2010 05:05 PM, Mark Cowley wrote:
>> hi Paul & Marc,
>> in addition to the thousands of control probes, there are non protein coding genes on these arrays - things like snoRNA's and precursor microRNA's which might not have a classical gene symbol.
>> I find that the mrna_assignment column from the Affy csv has a lot more information for these genes than the BioC annotation packages, so i'll do what you suggest Marc & try to 'improve' the mapping via the SQLForge code. I've done a fair amount of the groundwork on this already, so who could I communicate these changes to?
>>
>> cheers,
>> Mark
>> -----------------------------------------------------
>> Mark Cowley, PhD
>>
>> Peter Wills Bioinformatics Centre
>> Garvan Institute of Medical Research, Sydney, Australia
>> -----------------------------------------------------
>>
>> On 17/08/2010, at 9:26 AM, Marc Carlson wrote:
>>
>>
>>> Hi Paul,
>>>
>>> I looked into this for you. Often there will be discrepancies like this
>>> for purely historical reasons. For example, Affy may have made the
>>> probes based on one idea about what the transcriptome looked like and
>>> then this could have changed by the time they shipped their product.
>>> That kind of discrepancy happens all the time and especially with older
>>> chips. But in your case, you also seem to have a lot of control probes
>>> on this platform.
>>>
>>> You can extract the unmatched probes like this:
>>>
>>> library (hugene10stprobeset.db)
>>> a = hugene10stprobesetENTREZID
>>> oddProbes = keys(a)[! (keys(a) %in% mappedkeys(a))]
>>>
>>> I actually pulled down the .csv mapping from Affymetrix that Arthur Li
>>> would have used to generate this database. And I noticed that all the
>>> oddProbes I was looking at were control probes. In fact, more than 4
>>> thousand of these probes are control probes. Looking more closely at
>>> this file, you will see that many, many other probes have no gene
>>> mapping to them even though they are not listed as control probes. What
>>> is going on with some of those probesets? Why has Affy refused to
>>> assign an identity those ones? That is really more of a question for
>>> Affymetrix than for us.
>>>
>>> When we map these IDs to make annotation packages, we look for known
>>> gene IDs from the manufacturer (unigene, refseq etc.), and we then map
>>> those onto entrez gene IDs from NCBI and from there onto other
>>> annotations. But if the people who make the array are not willing to
>>> tell us what these things map to then we could really only speculate
>>> about what they are.
>>>
>>> But, if you have some external information that helps you to decide what
>>> these probes really map to, (maybe you have mapped the probesets onto
>>> the genome yourself or else maybe you feel that you can extract a little
>>> more data out of Affys .csv file than this author did), then in that
>>> case you can always feed that "improved" mapping into the SQLForge code
>>> in the AnnotationDbi package and generate your very own version of this
>>> annotation package. It is pretty straightforward to do so and is
>>> described in the SQLForge vignette here:
>>>
>>> http://www.bioconductor.org/packages/release/bioc/html/AnnotationDbi.html
>>>
>>> I hope this helps explain things,
>>>
>>>
>>> Marc
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 08/16/2010 02:17 PM, Paul Shannon wrote:
>>>
>>>> Here's an annotation question someone might be able to help me out with. I'll be grateful.
>>>>
>>>> Affymetrix describes their 'GeneChip Human Gene 1.0 ST Array':
>>>>
>>>> Each of the 28,869 genes is represented on the array by approximately 26 probes spread across the full length of the gene, providing a more complete and more accurate picture of gene expression than 3’ based expression array designs. ... The Gene 1.0 ST Array uses a subset of probes from the Human Exon 1.0 ST Array and covers only well-annotated content.
>>>>
>>>> This sounds to me like affy started with sequence from exons of ~29k genes and created probes.
>>>> But when I look at the bioc annotation for this chip (hugene10stprobeset.db, hugene10sttranscriptcluster.db), I find that about 7% of the probes are NOT annotated to geneIDs. The sibling array, hugene10sttranscriptclusterENTREZID, though smaller, has a higher proportion of unmapped probes.
>>>>
>>>> library (hugene10stprobeset.db)
>>>> library (hugene10sttranscriptcluster.db)
>>>> bm = hugene10stprobesetENTREZID
>>>> length (keys (bm)) # 257022
>>>> count.mappedkeys (bm) # 238141
>>>> # unmapped: 18881
>>>> cm = hugene10sttranscriptclusterENTREZID
>>>> length (keys (cm)); # 33257
>>>> count.mappedkeys (cm) # 21787
>>>>
>>>> The same proportions (unmapped/mapped seems to hold true for hugene10stprobesetENSEMBL as well.
>>>>
>>>> Can anyone suggest where I can get entrez geneID annotations for these unmapped probes? Or otherwise clear up my confusion?
>>>>
>>>> Thanks!
>>>>
>>>> - Paul
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>>
>>
>
More information about the Bioconductor
mailing list