[BioC] TranscriptDb of GENCODE Genes

Mon Aug 12 20:44:54 CEST 2013

Hi Dario,

I think it's great to discuss this.  It is not often enough that people 
point out just how unusual some of these gtf files can be.

However, it is not intended job of the makeTranscriptDbFromGFF() 
function to do data cleanup or custom pre-filtering.  That function is 
already stressed enough just handling all the peculiar ways that GFF and 
GTF files can essentially represent the same information.  So much so 
that that my efforts to have one job per function is already being 
strained in this case.

But if you wanted to make some functions that cleaned up unwanted 
features from gtf and gff files.  It is possible that other people might 
find that useful too.

   Marc

On 08/09/2013 12:00 AM, Dario Strbenac wrote:
> Hello,
>
> Who else uses the GENCODE annotation in their analyses ? I just found out that some transcripts are annotated as incomplete fragments. This is described in http://www.gencodegenes.org/gencode_tags.html but not in "GENCODE: the reference human genome annotation for The ENCODE Project." Genome Research, 2012.
>
> cds_end_NF : the coding region end could not be confirmed.
> cds_start_NF : the coding region start could not be confirmed.
> mRNA_end_NF : the mRNA end could not be confirmed.
> mRNA_start_NF : the mRNA start could not be confirmed.
>
> Over 10 % of transcripts are missing their RNA ends and almost as many are missing either a 5' UTR or a 3' UTR.
>
> /nb/dario/genes$ egrep -c "(HAVANA|ENSEMBL)     transcript" gencode.v17.annotation.gtf
> 194871
> /nb/dario/genes$ egrep "(HAVANA|ENSEMBL)        transcript" gencode.v17.annotation.gtf | grep -c mRNA_end_NF -
> 21699
> /nb/dario/genes$ egrep "(HAVANA|ENSEMBL)        transcript" gencode.v17.annotation.gtf | grep -c cds_end_NF -
> 19788
>
> Have you been using this gene annotation as-is for counting in windows around transcription start sites or transcription end sites ? Have you been using the functions fiveUTRsByTranscript or threeUTRsByTranscript ? If so, your results are incorrect, too.
>
> Also, can there be a way for the function makeTranscriptDbFromGFF to filter on elements of the attribute column ? This finding makes it unusable for reading into R the GENCODE annotation, as it now is.
>
> This can also be observed by noticing that some transcripts have a 3' UTR, but no 5' UTR, and vice-versa :
>
> genes<- makeTranscriptDbFromGFF("gencode.v17.annotation.gtf", format = "gtf", exonRankAttributeName = "exon_number")
> UTR5 <- fiveUTRsByTranscript(genes, use.names = TRUE)
> UTR3 <- threeUTRsByTranscript(genes, use.names = TRUE)
> whichNo3prime <- setdiff(names(UTR5), names(UTR3))
> whichNo5prime <- setdiff(names(UTR3), names(UTR5))
>
>> length(whichNo5prime)
> [1] 12217
>> length(whichNo3prime)
> [1] 16675
>
> So, 12217 have no 5' UTR, but a 3' UTR. 16675 transcripts have a 5' UTR, but no 3' UTR.
>
> Also, note that some transcripts don't have the expected attribute set. Have a look at ENST00000381469.2 in a genome browser and notice it's missing mRNA_start_NF. Or, is it possible to start translation from the very first 3 bases of a transcript ?
>
> --------------------------------------
> Dario Strbenac
> PhD Student
> University of Sydney
> Camperdown NSW 2050
> Australia
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor