[BioC] TranscriptDb of GENCODE Genes
Marc Carlson
mcarlson at fhcrc.org
Mon Aug 12 20:44:54 CEST 2013
Hi Dario,
I think it's great to discuss this. It is not often enough that people
point out just how unusual some of these gtf files can be.
However, it is not intended job of the makeTranscriptDbFromGFF()
function to do data cleanup or custom pre-filtering. That function is
already stressed enough just handling all the peculiar ways that GFF and
GTF files can essentially represent the same information. So much so
that that my efforts to have one job per function is already being
strained in this case.
But if you wanted to make some functions that cleaned up unwanted
features from gtf and gff files. It is possible that other people might
find that useful too.
Marc
On 08/09/2013 12:00 AM, Dario Strbenac wrote:
> Hello,
>
> Who else uses the GENCODE annotation in their analyses ? I just found out that some transcripts are annotated as incomplete fragments. This is described in http://www.gencodegenes.org/gencode_tags.html but not in "GENCODE: the reference human genome annotation for The ENCODE Project." Genome Research, 2012.
>
> cds_end_NF : the coding region end could not be confirmed.
> cds_start_NF : the coding region start could not be confirmed.
> mRNA_end_NF : the mRNA end could not be confirmed.
> mRNA_start_NF : the mRNA start could not be confirmed.
>
> Over 10 % of transcripts are missing their RNA ends and almost as many are missing either a 5' UTR or a 3' UTR.
>
> /nb/dario/genes$ egrep -c "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf
> 194871
> /nb/dario/genes$ egrep "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf | grep -c mRNA_end_NF -
> 21699
> /nb/dario/genes$ egrep "(HAVANA|ENSEMBL) transcript" gencode.v17.annotation.gtf | grep -c cds_end_NF -
> 19788
>
> Have you been using this gene annotation as-is for counting in windows around transcription start sites or transcription end sites ? Have you been using the functions fiveUTRsByTranscript or threeUTRsByTranscript ? If so, your results are incorrect, too.
>
> Also, can there be a way for the function makeTranscriptDbFromGFF to filter on elements of the attribute column ? This finding makes it unusable for reading into R the GENCODE annotation, as it now is.
>
> This can also be observed by noticing that some transcripts have a 3' UTR, but no 5' UTR, and vice-versa :
>
> genes<- makeTranscriptDbFromGFF("gencode.v17.annotation.gtf", format = "gtf", exonRankAttributeName = "exon_number")
> UTR5 <- fiveUTRsByTranscript(genes, use.names = TRUE)
> UTR3 <- threeUTRsByTranscript(genes, use.names = TRUE)
> whichNo3prime <- setdiff(names(UTR5), names(UTR3))
> whichNo5prime <- setdiff(names(UTR3), names(UTR5))
>
>> length(whichNo5prime)
> [1] 12217
>> length(whichNo3prime)
> [1] 16675
>
> So, 12217 have no 5' UTR, but a 3' UTR. 16675 transcripts have a 5' UTR, but no 3' UTR.
>
> Also, note that some transcripts don't have the expected attribute set. Have a look at ENST00000381469.2 in a genome browser and notice it's missing mRNA_start_NF. Or, is it possible to start translation from the very first 3 bases of a transcript ?
>
> --------------------------------------
> Dario Strbenac
> PhD Student
> University of Sydney
> Camperdown NSW 2050
> Australia
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list