[Bioc-devel] (missing?) UCSCKG -> SYMBOL mappings in Homo.sapiens (etc.)

Thu Feb 14 00:35:26 CET 2013

Just posting  an update on this,

Just as I was composing a carefully worded email to the folks at UCSC, I 
see they seem to have fixed the table browser so that it now looks the 
same as the FTP site (and hence the results that come back from 
rtracklayer).  This means that the UCSC files now look like our 
annotation again.

   Marc

On 02/12/2013 01:35 PM, Marc Carlson wrote:
> On 02/12/2013 10:04 AM, Tim Triche, Jr. wrote:
>> re:  '[BioC] question about Gviz' thread fallout:
>>
>> Yesterday I rolled a relatively simple programmatic way to label UCSC
>> KnownGene entries with their symbols.  However, some isoforms (e.g. some
>> for NRIP1 and CDKN2B) seem to be missing from the mappings.
>>
>> Investigating a bit, and referring to ?org.Hs.egUCSCKG, I find
>>
>> ...This mapping is based on the very latest build available at UCSC
>>      for this organism as of March 2010.  2.6 is the last release where
>>      you can expect it to be here.  The GenomicFeatures package
>>      contains functionality that replaces the need for this mapping...
>>
>> Alas, I'm too thick to find where, in the TxDb or elsewhere, I could
>> retrieve Hugo IDs for UCSC KnownGene entries without using org.Hs.egSYMBOL.
>>    The latter is what I usually do:
>>
>>     library(Homo.sapiens)
>>
>>     txs<- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>     head(names(txs))
>>     ## [1] "1"         "10"        "100"       "1000"      "10000"
>> "100008586"
>>
>>     names(txs)<- mget(names(txs), org.Hs.egSYMBOL, ifnotfound=NA)
>>     head(names(txs))
>>     ## [1] "A1BG"    "NAT2"    "ADA"     "CDH2"    "AKT3"    "GAGE12F"
>>
>> Now, I thought for a while, hell, this gets them all!  But, not really...
>>
>>     txs$NRIP1
>>     ## GRanges with 1 range and 2 metadata columns:
>>     ##       seqnames               ranges strand |     tx_id     tx_name
>>     ##<Rle>              <IRanges>    <Rle>   |<integer>   <character>
>>     ##   [1]    chr21 [16333556, 16437126]      - |     71301  uc002yjx.2
>>
>> Well, that's one of the isoforms.  But what about the other ones?
>>
>>     org.Hs.egUCSCKG[[ "c002yjx.1" ]]
>>     ## NULL
>>
>>     org.Hs.egUCSCKG[[ "uc010gkz.1" ]]
>>     ## NULL
>>
>> I know UCSC identifiers can be a bit of a pain in the ass, but there do
>> exist mappings for these.  If they're going to be used as primary
>> identifiers for the TxDb packages, would it be possible to update them?
>>
>> If it's an issue of time constraints, I will take a stab at it, but that
>> will almost guarantee more prattling from me on the mailing list.  On the
>> other hand, it might move GAF3.0 annotations out of the station.
>>
>> Much obliged for any insights from the core developers.
>
>
> Hi Tim,
>
> So continuing from the other thread...
>
> 1st thing I noticed is that if you try to look up the two known gene IDs
> that you gave me you will not have any luck.  From using the web
> service, it seems that they are not actually valid UCSC known gene IDs.
> At 1st I thought that maybe there had been updates since the last
> Bioconductor release in October, but pasting these IDs into the UCSC
> genome browser only lead me to this:
>
> # Sorry, couldn't locate uc010gkz.1 in genome database
> # Sorry, couldn't locate c002yjx.1 in genome database
>
> So at this point I was a little curious where you actually got these ids
> from?  (I will actually return to this in a minute)
>
> Anyhow, looking deeper the website indicates that there is another
> isoform for NRIP1 (other than:"uc002yjx.2") .  It is called
> "uc021whl.1".  And it does indeed come up empty handed if you call
> select like this:
>
> select(Homo.sapiens, cols=c("SYMBOL","TXNAME"),
> keys=c("uc002yjx.2","uc021whl.1"), keytype="TXNAME")
>
> So what happened here?  Well the track data from UCSC doesn't have a
> gene assigned to that isoform yet.  So the DB has no way of knowing that
> it's connected.  Incidentally, this is still true even if you were to
> download it this morning.
>
> So here we have a situation where the UCSC web site has been updated,
> but their track table (and in particular the table called
> "knownToLocusLink") is not perfectly in sync with the web site.
>
> Even weirder is the fact that if you use the "table browser" to download
> the "latest" knownToLocusLink" table (which is yet another service on
> their web site), you will get a table that has two isoforms (associated
> with NPRIP1) that look very similar to the ones you mentioned before.
> In fact I am willing to guess that this is where you got these from, and
> that the shortened one is just a copy-paste typo).
>
> So the problem here is that there seem to be three different ways to get
> the same kind of data from UCSC genome browser.  There is the
> website/browser.  There is rtracklayer, and then there is also the web
> form access to the table browser (which is what I think you used).  AND
> they all three seem to be in disagreement with each other.  My suspicion
> is that some of these are just more up to date than others.  But I think
> that only UCSC will really know which one is most current or why they
> seem to disagree.
>
> I have CC'd michael who maintains the excellent rtracklayer package in
> case he has some insight.
>
>
>     Marc
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel