[BioC] MakeTranscriptDbFromGFF
Marc Carlson
mcarlson at fhcrc.org
Fri May 3 00:52:38 CEST 2013
Hi Ugo,
The 15 GB tarball you sent me to contains several different GTF files
for genes. I grabbed this one as it seemed to be the most recent:
Mus_musculus/UCSC/mm9/Annotation/Archives/archive-2013-03-06-15-01-24/Genes/genes.gtf
So looking at this file I can reproduce the problem you mentioned. And
it shows me two problems. The 1st problem is that the only field that
seems to contain any information about exon positions is called phase
(and not "exon_number" as was in the code I see from before). There is
not actually any field called "exon_number" in this file. Either way,
one thing you can check is to make sure that the string you give here is
the same as the appropriate field name that is used by the file. There
is no way to know this information in advance since GTF does not specify
how to encode this information (and in fact the information is entirely
optional).
The second problem is that even "phase" can't work right now since the
authors of this gtf file have decided to only associate the exon rank
information only with CDS and never with exons features. So there is
not any actual 'exon' position information in this file, only
information for CDS positions. Now that I see people doing these files
in this way, I plan to enhance the parser so that it can process files
of this kind.
Is there a reason why you wanted to use this file and not the data
contained in this package here?
http://www.bioconductor.org/packages/2.12/data/annotation/html/TxDb.Mmusculus.UCSC.mm9.knownGene.html
Marc
On 05/02/2013 02:59 AM, Ugo Borello wrote:
> Dear Marc
> Sorry I was not precise on the origin of the gtf annotation file; I got the
> gtf file from here:
> http://tophat.cbcb.umd.edu/igenomes.shtml
>
> And more precisely from the Mus musculus/UCSC/mm9 folder
> Here the description of the content of the folder:
> ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/README.txt
>
> I realized reading the README.txt file that actually the genes.gtf file I
> used is the Ensembl annotation of the mm9 release.
> So, I changed dataSource = "Ensembl" in the function call and I got the same
> error message:
> Error in data.frame(..., check.names = FALSE) :
> arguments imply differing number of rows: 541775, 0
>
> At the end of my previous email you have the result of calling:
> annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE)
>
>
> Thank you
>
> Ugo
>
>> From: Marc Carlson <mcarlson at fhcrc.org>
>> Date: Wed, 01 May 2013 15:33:02 -0700
>> To: <bioconductor at r-project.org>
>> Subject: Re: [BioC] MakeTranscriptDbFromGFF
>>
>> Hi Ugo,
>>
>> Which UCSC file was it that you were trying to process?
>>
>>
>> Marc
>>
>>
>>
>> On 05/01/2013 02:21 AM, Ugo Borello wrote:
>>> Good morning,
>>>
>>> I have a little problem creating a TranscriptDb object using the function
>>> makeTranscriptDbFromGFF. I want to use this annotation to count the overlaps
>>> of my genomic alignments with genes.
>>>
>>>
>>> I ran:
>>>
>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>> + exonRankAttributeName = "exon_number",
>>> + chrominfo = chrominfo,
>>> + dataSource = "UCSC",
>>> + species = "Mus musculus")
>>>
>>> And I got this error message:
>>> Error in data.frame(..., check.names = FALSE) :
>>> arguments imply differing number of rows: 541775, 0
>>>
>>>
>>> "chrominfo" was (info retrieved from the fasta genome file):
>>>
>>>> chrominfo
>>> chrom length is_circular
>>> 1 chr10 129993255 FALSE
>>> 2 chr11 121843856 FALSE
>>> 3 chr12 121257530 FALSE
>>> 4 chr13 120284312 FALSE
>>> 5 chr14 125194864 FALSE
>>> 6 chr15 103494974 FALSE
>>> 7 chr16 98319150 FALSE
>>> 8 chr17 95272651 FALSE
>>> 9 chr18 90772031 FALSE
>>> 10 chr19 61342430 FALSE
>>> 11 chr1 197195432 FALSE
>>> 12 chr2 181748087 FALSE
>>> 13 chr3 159599783 FALSE
>>> 14 chr4 155630120 FALSE
>>> 15 chr5 152537259 FALSE
>>> 16 chr6 149517037 FALSE
>>> 17 chr7 152524553 FALSE
>>> 18 chr8 131738871 FALSE
>>> 19 chr9 124076172 FALSE
>>> 20 chrM 16299 TRUE
>>> 21 chrX 166650296 FALSE
>>> 22 chrY 15902555 FALSE
>>>
>>>
>>> I ran it again without the "exonRankAttributeName" argument and I got:
>>>
>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>> + chrominfo = chrominfo,
>>> + dataSource = "UCSC",
>>> + species = "Mus musculus")
>>> extracting transcript information
>>> Estimating transcript ranges.
>>> Extracting gene IDs
>>> Processing splicing information for gtf file.
>>> Deducing exon rank from relative coordinates provided
>>> Prepare the 'metadata' data frame ... metadata: OK
>>> Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom",
>>> :
>>> all the values in 'transcripts$tx_chrom' must be present in
>>> 'chrominfo$chrom'
>>> In addition: Warning message:
>>> In .deduceExonRankings(exs, format = "gtf") :
>>> Infering Exon Rankings. If this is not what you expected, then please be
>>> sure that you have provided a valid attribute for exonRankAttributeName
>>>
>>>
>>> Without the "chrominfo" argument I got the same error message as the first
>>> time:
>>>
>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>> + exonRankAttributeName = "exon_number",
>>> + dataSource = "UCSC",
>>> + species = "Mus musculus")
>>>
>>> Error in data.frame(..., check.names = FALSE) :
>>> arguments imply differing number of rows: 541775, 0
>>>
>>>
>>> Finally when I eliminated both the "exonRankAttributeName" and the
>>> "chrominfo" arguments it worked but the warning reminded me of the
>>> "exonRankAttributeName" argument and the chromosome names are now different
>>> from the ones in the genome file and there is no info on their length
>>>
>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>> + dataSource = "UCSC",
>>> + species = "Mus musculus")
>>> extracting transcript information
>>> Estimating transcript ranges.
>>> Extracting gene IDs
>>> Processing splicing information for gtf file.
>>> Deducing exon rank from relative coordinates provided
>>> Prepare the 'metadata' data frame ... metadata: OK
>>> Now generating chrominfo from available sequence names. No chromosome length
>>> information is available.
>>> Warning messages:
>>> 1: In .deduceExonRankings(exs, format = "gtf") :
>>> Infering Exon Rankings. If this is not what you expected, then please be
>>> sure that you have provided a valid attribute for exonRankAttributeName
>>> 2: In matchCircularity(chroms, circ_seqs) :
>>> None of the strings in your circ_seqs argument match your seqnames.
>>>
>>>> seqinfo(txdb)
>>> Seqinfo of length 32
>>> seqnames seqlengths isCircular genome
>>> chr13 <NA> FALSE <NA>
>>> chr9 <NA> FALSE <NA>
>>> chr6 <NA> FALSE <NA>
>>> chrX <NA> FALSE <NA>
>>> chr17 <NA> FALSE <NA>
>>> chr2 <NA> FALSE <NA>
>>> chr7 <NA> FALSE <NA>
>>> chr18 <NA> FALSE <NA>
>>> chr8 <NA> FALSE <NA>
>>> ... ... ... ...
>>> chrY_random <NA> FALSE <NA>
>>> chrX_random <NA> FALSE <NA>
>>> chr5_random <NA> FALSE <NA>
>>> chr4_random <NA> FALSE <NA>
>>> chrY <NA> FALSE <NA>
>>> chr7_random <NA> FALSE <NA>
>>> chr17_random <NA> FALSE <NA>
>>> chr13_random <NA> FALSE <NA>
>>> chr1_random <NA> FALSE <NA>
>>>
>>>
>>>
>>>
>>> What am I doing wrong in my original call to makeTranscriptDbFromGFF?
>>>
>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>> exonRankAttributeName = "exon_number",
>>> chrominfo = chrominfo,
>>> dataSource = "UCSC",
>>> species = "Mus musculus")
>>>
>>> Why am I getting this unfair error message?
>>> Thank you for your help
>>> Ugo
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> FYI, this is my annFile (My gtf annotation file was downloaded together
>>> with a fasta file containing the mouse genome from UCSC):
>>>
>>>> annFile
>>> GRanges with 595632 ranges and 9 metadata columns:
>>> seqnames ranges strand | source type
>>> score phase
>>> <Rle> <IRanges> <Rle> | <factor> <factor>
>>> <numeric> <integer>
>>> [1] chr1 [3204563, 3207049] - | unknown exon
>>> <NA> <NA>
>>> [2] chr1 [3206103, 3206105] - | unknown stop_codon
>>> <NA> <NA>
>>> [3] chr1 [3206106, 3207049] - | unknown CDS
>>> <NA> 2
>>> [4] chr1 [3411783, 3411982] - | unknown CDS
>>> <NA> 1
>>> [5] chr1 [3411783, 3411982] - | unknown exon
>>> <NA> <NA>
>>> ... ... ... ... ... ... ...
>>> ... ...
>>> [595628] chrY_random [54422360, 54422362] + | unknown stop_codon
>>> <NA> <NA>
>>> [595629] chrY_random [58501955, 58502946] + | unknown exon
>>> <NA> <NA>
>>> [595630] chrY_random [58502132, 58502812] + | unknown CDS
>>> <NA> 0
>>> [595631] chrY_random [58502132, 58502134] + | unknown start_codon
>>> <NA> <NA>
>>> [595632] chrY_random [58502813, 58502815] + | unknown stop_codon
>>> <NA> <NA>
>>> gene_id transcript_id gene_name p_id tss_id
>>> <character> <character> <character> <character> <character>
>>> [1] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
>>> [2] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
>>> [3] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
>>> [4] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
>>> [5] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
>>> ... ... ... ... ... ...
>>> [595628] LOC100039753 NM_001017394 LOC100039753 P10196 TSS19491
>>> [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
>>> [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
>>> [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
>>> [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
>>> ---
>>> seqlengths:
>>> chr1 chr10 chr11 chr12 ... chrX_random
>>> chrY chrY_random
>>> NA NA NA NA ... NA
>>> NA NA
>>>
>>>
>>>> sessionInfo()
>>> R version 3.0.0 (2013-04-03)
>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets methods
>>> base
>>>
>>> other attached packages:
>>> [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2
>>> Biostrings_2.28.0
>>> [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0
>>> GenomicRanges_1.12.2
>>> [9] IRanges_1.18.0 BiocGenerics_0.6.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5
>>> BSgenome_1.28.0
>>> [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3
>>> lattice_0.20-15
>>> [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0
>>> stats4_3.0.0
>>> [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list