[BioC] unmapped keys in hugene10stprobeset.db
Mark Cowley
m.cowley at garvan.org.au
Tue Aug 17 02:05:06 CEST 2010
hi Paul & Marc,
in addition to the thousands of control probes, there are non protein coding genes on these arrays - things like snoRNA's and precursor microRNA's which might not have a classical gene symbol.
I find that the mrna_assignment column from the Affy csv has a lot more information for these genes than the BioC annotation packages, so i'll do what you suggest Marc & try to 'improve' the mapping via the SQLForge code. I've done a fair amount of the groundwork on this already, so who could I communicate these changes to?
cheers,
Mark
-----------------------------------------------------
Mark Cowley, PhD
Peter Wills Bioinformatics Centre
Garvan Institute of Medical Research, Sydney, Australia
-----------------------------------------------------
On 17/08/2010, at 9:26 AM, Marc Carlson wrote:
> Hi Paul,
>
> I looked into this for you. Often there will be discrepancies like this
> for purely historical reasons. For example, Affy may have made the
> probes based on one idea about what the transcriptome looked like and
> then this could have changed by the time they shipped their product.
> That kind of discrepancy happens all the time and especially with older
> chips. But in your case, you also seem to have a lot of control probes
> on this platform.
>
> You can extract the unmatched probes like this:
>
> library (hugene10stprobeset.db)
> a = hugene10stprobesetENTREZID
> oddProbes = keys(a)[! (keys(a) %in% mappedkeys(a))]
>
> I actually pulled down the .csv mapping from Affymetrix that Arthur Li
> would have used to generate this database. And I noticed that all the
> oddProbes I was looking at were control probes. In fact, more than 4
> thousand of these probes are control probes. Looking more closely at
> this file, you will see that many, many other probes have no gene
> mapping to them even though they are not listed as control probes. What
> is going on with some of those probesets? Why has Affy refused to
> assign an identity those ones? That is really more of a question for
> Affymetrix than for us.
>
> When we map these IDs to make annotation packages, we look for known
> gene IDs from the manufacturer (unigene, refseq etc.), and we then map
> those onto entrez gene IDs from NCBI and from there onto other
> annotations. But if the people who make the array are not willing to
> tell us what these things map to then we could really only speculate
> about what they are.
>
> But, if you have some external information that helps you to decide what
> these probes really map to, (maybe you have mapped the probesets onto
> the genome yourself or else maybe you feel that you can extract a little
> more data out of Affys .csv file than this author did), then in that
> case you can always feed that "improved" mapping into the SQLForge code
> in the AnnotationDbi package and generate your very own version of this
> annotation package. It is pretty straightforward to do so and is
> described in the SQLForge vignette here:
>
> http://www.bioconductor.org/packages/release/bioc/html/AnnotationDbi.html
>
> I hope this helps explain things,
>
>
> Marc
>
>
>
>
>
>
> On 08/16/2010 02:17 PM, Paul Shannon wrote:
>> Here's an annotation question someone might be able to help me out with. I'll be grateful.
>>
>> Affymetrix describes their 'GeneChip Human Gene 1.0 ST Array':
>>
>> Each of the 28,869 genes is represented on the array by approximately 26 probes spread across the full length of the gene, providing a more complete and more accurate picture of gene expression than 3’ based expression array designs. ... The Gene 1.0 ST Array uses a subset of probes from the Human Exon 1.0 ST Array and covers only well-annotated content.
>>
>> This sounds to me like affy started with sequence from exons of ~29k genes and created probes.
>> But when I look at the bioc annotation for this chip (hugene10stprobeset.db, hugene10sttranscriptcluster.db), I find that about 7% of the probes are NOT annotated to geneIDs. The sibling array, hugene10sttranscriptclusterENTREZID, though smaller, has a higher proportion of unmapped probes.
>>
>> library (hugene10stprobeset.db)
>> library (hugene10sttranscriptcluster.db)
>> bm = hugene10stprobesetENTREZID
>> length (keys (bm)) # 257022
>> count.mappedkeys (bm) # 238141
>> # unmapped: 18881
>> cm = hugene10sttranscriptclusterENTREZID
>> length (keys (cm)); # 33257
>> count.mappedkeys (cm) # 21787
>>
>> The same proportions (unmapped/mapped seems to hold true for hugene10stprobesetENSEMBL as well.
>>
>> Can anyone suggest where I can get entrez geneID annotations for these unmapped probes? Or otherwise clear up my confusion?
>>
>> Thanks!
>>
>> - Paul
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list