[BioC] question about TranscriptDb
Marc Carlson
mcarlson at fhcrc.org
Tue Dec 11 00:34:50 CET 2012
Unfortunately, if we did that, there could be all sorts of unfortunate
consequences.
By doing this, you would be introducing an arbitrary number of new
strings as IDs for all of these orphaned transcripts. And unlike NAs
(which is the traditional way of indicating that data is missing in R),
you would get no warnings about any of these when you used them in
subsequent analysis. Others could use your new faux IDs to get into
all sorts of trouble. And would be even worse because they would mixed
in with real IDs (entrez gene IDs) which would lend them a confusing air
of authenticity. Downstream users might even mix the faux IDs from
different species etc.
And even if we accepted the risks, we don't even have a good way of
always grouping the unassigned transcripts, which means that transcripts
that are probably from the same gene will be assigned like this:
unknown1 = tx1 (overlaps with tx2)
unknown2 = tx2 (overlaps with tx1)
etc.
Which means that this strategy would also end up implying things that we
know are sometimes not going to be true. Meanwhile these half wrong
unknown transcript assignments will be mixed in with the "real" ones...
I could go on and on, but I am hoping you can see some of what I am
concerned about?
Anyhow you can already discover about which genes are associated with
transcripts in many other ways. The simplest approach is probably to
just use select() like this:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
k = keys(txdb, "TXNAME")
res <- select(txdb, cols=c("TXNAME","GENEID"), keys=k, keytype="TXNAME")
head(res)
Alternatively you could ALSO do something like this (if you had
previously already called transcripts like below):
t <- transcripts(txdb,columns="gene_id")
as.character(mcols(t)$gene_id)
Marc
On 12/10/2012 12:25 PM, Ryan C. Thompson wrote:
> I have also been bitten by the fact that some transcripts are missing
> gene IDs. Is it possible to add placeholder gene IDs to these? For
> example, just assigning them UNKNOWN1, UNKNOWN2, etc.?
>
> On Mon 10 Dec 2012 11:40:35 AM PST, Marc Carlson wrote:
>> Hi Matthew,
>>
>> Thanks for your detailed exploration of this. After looking more
>> closely, I think the confusion here is being caused by the fact that you
>> are looking at the kgXref table, and what was actually used to attach
>> gene Ids to the TxDb database is actually the knownToLocusLink
>> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=316115443&hgta_doSchemaDb=hg19&hgta_doSchemaTable=knownToLocusLink>
>>
>> table. Adding to the mayhem, UCSC has apparently decided to allow
>> different values to exist into the latest versions of these two tables.
>>
>> We chose to use the Entrez Gene IDs as gene identifiers because (unlike
>> gene symbols) they represent a real identifier and can thus be relied on
>> to not have multiple different meanings etc.
>>
>>
>> Marc
>>
>>
>>
>> On 12/10/2012 09:06 AM, Matthew D. Wilkerson wrote:
>>> Hello,
>>>
>>> I have a question about the gene_id attribute of
>>> TxDb.Hsapiens.UCSC.hg19.knownGene, version 2.80 (latest).
>>>
>>> I noticed that some transcripts such as uc021ums.1, do not have an
>>> associated gene_id.
>>>
>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>> t=transcripts(txdb,columns=c("gene_id","tx_id","tx_name","cds_id","cds_name"))
>>>
>>>
>>> t[ which(elementMetadata(t)[,"tx_name"]=="uc021ums.1"), ]
>>>
>>> I understand that some ucsc genes might not have an entrez gene id
>>> associated.
>>> I checked this locus and found that currently UCSC db does have this
>>> locus associated with LINGO3.
>>>
>>> #hg19.knownGene.name hg19.knownGene.chrom
>>> hg19.knownGene.strand hg19.knownGene.txStart
>>> hg19.knownGene.txEnd hg19.knownGene.cdsStart
>>> hg19.knownGene.cdsEnd hg19.knownGene.exonCount
>>> hg19.knownGene.exonStarts hg19.knownGene.exonEnds
>>> hg19.knownGene.proteinID hg19.knownGene.alignID
>>> hg19.kgXref.kgID hg19.kgXref.geneSymbol
>>> uc021ums.1 chr19 - 2289996 2291775 2289996
>>> 2291775 1 2289996, 2291775, P0C6S8 uc021ums.1
>>> uc021ums.1 LINGO3
>>>
>>>
>>> The kgXref table was last updated 2/5/12.
>>>
>>>
>>> The bioconductor package was made on:
>>> Creation time: 2012-09-10 12:56:25 -0700 (Mon, 10 Sep 2012)
>>>
>>> If this date also refers to the date of download, then why is this
>>> transcript not affiliated with LINGO3?
>>> If not, then what date does known gene refer to?
>>>
>>>
>>> Thanks,
>>> Matt
>>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list