[BioC] MakeTranscriptDbFromGFF
Ugo Borello
ugo.borello at inserm.fr
Thu May 2 11:59:50 CEST 2013
Dear Marc
Sorry I was not precise on the origin of the gtf annotation file; I got the
gtf file from here:
http://tophat.cbcb.umd.edu/igenomes.shtml
And more precisely from the Mus musculus/UCSC/mm9 folder
Here the description of the content of the folder:
ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/README.txt
I realized reading the README.txt file that actually the genes.gtf file I
used is the Ensembl annotation of the mm9 release.
So, I changed dataSource = "Ensembl" in the function call and I got the same
error message:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 541775, 0
At the end of my previous email you have the result of calling:
annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE)
Thank you
Ugo
> From: Marc Carlson <mcarlson at fhcrc.org>
> Date: Wed, 01 May 2013 15:33:02 -0700
> To: <bioconductor at r-project.org>
> Subject: Re: [BioC] MakeTranscriptDbFromGFF
>
> Hi Ugo,
>
> Which UCSC file was it that you were trying to process?
>
>
> Marc
>
>
>
> On 05/01/2013 02:21 AM, Ugo Borello wrote:
>> Good morning,
>>
>> I have a little problem creating a TranscriptDb object using the function
>> makeTranscriptDbFromGFF. I want to use this annotation to count the overlaps
>> of my genomic alignments with genes.
>>
>>
>> I ran:
>>
>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>> + exonRankAttributeName = "exon_number",
>> + chrominfo = chrominfo,
>> + dataSource = "UCSC",
>> + species = "Mus musculus")
>>
>> And I got this error message:
>> Error in data.frame(..., check.names = FALSE) :
>> arguments imply differing number of rows: 541775, 0
>>
>>
>> "chrominfo" was (info retrieved from the fasta genome file):
>>
>>> chrominfo
>> chrom length is_circular
>> 1 chr10 129993255 FALSE
>> 2 chr11 121843856 FALSE
>> 3 chr12 121257530 FALSE
>> 4 chr13 120284312 FALSE
>> 5 chr14 125194864 FALSE
>> 6 chr15 103494974 FALSE
>> 7 chr16 98319150 FALSE
>> 8 chr17 95272651 FALSE
>> 9 chr18 90772031 FALSE
>> 10 chr19 61342430 FALSE
>> 11 chr1 197195432 FALSE
>> 12 chr2 181748087 FALSE
>> 13 chr3 159599783 FALSE
>> 14 chr4 155630120 FALSE
>> 15 chr5 152537259 FALSE
>> 16 chr6 149517037 FALSE
>> 17 chr7 152524553 FALSE
>> 18 chr8 131738871 FALSE
>> 19 chr9 124076172 FALSE
>> 20 chrM 16299 TRUE
>> 21 chrX 166650296 FALSE
>> 22 chrY 15902555 FALSE
>>
>>
>> I ran it again without the "exonRankAttributeName" argument and I got:
>>
>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>> + chrominfo = chrominfo,
>> + dataSource = "UCSC",
>> + species = "Mus musculus")
>> extracting transcript information
>> Estimating transcript ranges.
>> Extracting gene IDs
>> Processing splicing information for gtf file.
>> Deducing exon rank from relative coordinates provided
>> Prepare the 'metadata' data frame ... metadata: OK
>> Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom",
>> :
>> all the values in 'transcripts$tx_chrom' must be present in
>> 'chrominfo$chrom'
>> In addition: Warning message:
>> In .deduceExonRankings(exs, format = "gtf") :
>> Infering Exon Rankings. If this is not what you expected, then please be
>> sure that you have provided a valid attribute for exonRankAttributeName
>>
>>
>> Without the "chrominfo" argument I got the same error message as the first
>> time:
>>
>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>> + exonRankAttributeName = "exon_number",
>> + dataSource = "UCSC",
>> + species = "Mus musculus")
>>
>> Error in data.frame(..., check.names = FALSE) :
>> arguments imply differing number of rows: 541775, 0
>>
>>
>> Finally when I eliminated both the "exonRankAttributeName" and the
>> "chrominfo" arguments it worked but the warning reminded me of the
>> "exonRankAttributeName" argument and the chromosome names are now different
>> from the ones in the genome file and there is no info on their length
>>
>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>> + dataSource = "UCSC",
>> + species = "Mus musculus")
>> extracting transcript information
>> Estimating transcript ranges.
>> Extracting gene IDs
>> Processing splicing information for gtf file.
>> Deducing exon rank from relative coordinates provided
>> Prepare the 'metadata' data frame ... metadata: OK
>> Now generating chrominfo from available sequence names. No chromosome length
>> information is available.
>> Warning messages:
>> 1: In .deduceExonRankings(exs, format = "gtf") :
>> Infering Exon Rankings. If this is not what you expected, then please be
>> sure that you have provided a valid attribute for exonRankAttributeName
>> 2: In matchCircularity(chroms, circ_seqs) :
>> None of the strings in your circ_seqs argument match your seqnames.
>>
>>> seqinfo(txdb)
>> Seqinfo of length 32
>> seqnames seqlengths isCircular genome
>> chr13 <NA> FALSE <NA>
>> chr9 <NA> FALSE <NA>
>> chr6 <NA> FALSE <NA>
>> chrX <NA> FALSE <NA>
>> chr17 <NA> FALSE <NA>
>> chr2 <NA> FALSE <NA>
>> chr7 <NA> FALSE <NA>
>> chr18 <NA> FALSE <NA>
>> chr8 <NA> FALSE <NA>
>> ... ... ... ...
>> chrY_random <NA> FALSE <NA>
>> chrX_random <NA> FALSE <NA>
>> chr5_random <NA> FALSE <NA>
>> chr4_random <NA> FALSE <NA>
>> chrY <NA> FALSE <NA>
>> chr7_random <NA> FALSE <NA>
>> chr17_random <NA> FALSE <NA>
>> chr13_random <NA> FALSE <NA>
>> chr1_random <NA> FALSE <NA>
>>
>>
>>
>>
>> What am I doing wrong in my original call to makeTranscriptDbFromGFF?
>>
>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>> exonRankAttributeName = "exon_number",
>> chrominfo = chrominfo,
>> dataSource = "UCSC",
>> species = "Mus musculus")
>>
>> Why am I getting this unfair error message?
>> Thank you for your help
>> Ugo
>>
>>
>>
>>
>>
>>
>>
>>
>> FYI, this is my annFile (My gtf annotation file was downloaded together
>> with a fasta file containing the mouse genome from UCSC):
>>
>>> annFile
>> GRanges with 595632 ranges and 9 metadata columns:
>> seqnames ranges strand | source type
>> score phase
>> <Rle> <IRanges> <Rle> | <factor> <factor>
>> <numeric> <integer>
>> [1] chr1 [3204563, 3207049] - | unknown exon
>> <NA> <NA>
>> [2] chr1 [3206103, 3206105] - | unknown stop_codon
>> <NA> <NA>
>> [3] chr1 [3206106, 3207049] - | unknown CDS
>> <NA> 2
>> [4] chr1 [3411783, 3411982] - | unknown CDS
>> <NA> 1
>> [5] chr1 [3411783, 3411982] - | unknown exon
>> <NA> <NA>
>> ... ... ... ... ... ... ...
>> ... ...
>> [595628] chrY_random [54422360, 54422362] + | unknown stop_codon
>> <NA> <NA>
>> [595629] chrY_random [58501955, 58502946] + | unknown exon
>> <NA> <NA>
>> [595630] chrY_random [58502132, 58502812] + | unknown CDS
>> <NA> 0
>> [595631] chrY_random [58502132, 58502134] + | unknown start_codon
>> <NA> <NA>
>> [595632] chrY_random [58502813, 58502815] + | unknown stop_codon
>> <NA> <NA>
>> gene_id transcript_id gene_name p_id tss_id
>> <character> <character> <character> <character> <character>
>> [1] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
>> [2] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
>> [3] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
>> [4] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
>> [5] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
>> ... ... ... ... ... ...
>> [595628] LOC100039753 NM_001017394 LOC100039753 P10196 TSS19491
>> [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
>> [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
>> [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
>> [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
>> ---
>> seqlengths:
>> chr1 chr10 chr11 chr12 ... chrX_random
>> chrY chrY_random
>> NA NA NA NA ... NA
>> NA NA
>>
>>
>>> sessionInfo()
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets methods
>> base
>>
>> other attached packages:
>> [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2
>> Biostrings_2.28.0
>> [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0
>> GenomicRanges_1.12.2
>> [9] IRanges_1.18.0 BiocGenerics_0.6.0
>>
>> loaded via a namespace (and not attached):
>> [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5
>> BSgenome_1.28.0
>> [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3
>> lattice_0.20-15
>> [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0
>> stats4_3.0.0
>> [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list