[BioC] MakeTranscriptDbFromGFF
Marc Carlson
mcarlson at fhcrc.org
Thu May 2 00:33:02 CEST 2013
Hi Ugo,
Which UCSC file was it that you were trying to process?
Marc
On 05/01/2013 02:21 AM, Ugo Borello wrote:
> Good morning,
>
> I have a little problem creating a TranscriptDb object using the function
> makeTranscriptDbFromGFF. I want to use this annotation to count the overlaps
> of my genomic alignments with genes.
>
>
> I ran:
>
>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
> + exonRankAttributeName = "exon_number",
> + chrominfo = chrominfo,
> + dataSource = "UCSC",
> + species = "Mus musculus")
>
> And I got this error message:
> Error in data.frame(..., check.names = FALSE) :
> arguments imply differing number of rows: 541775, 0
>
>
> "chrominfo" was (info retrieved from the fasta genome file):
>
>> chrominfo
> chrom length is_circular
> 1 chr10 129993255 FALSE
> 2 chr11 121843856 FALSE
> 3 chr12 121257530 FALSE
> 4 chr13 120284312 FALSE
> 5 chr14 125194864 FALSE
> 6 chr15 103494974 FALSE
> 7 chr16 98319150 FALSE
> 8 chr17 95272651 FALSE
> 9 chr18 90772031 FALSE
> 10 chr19 61342430 FALSE
> 11 chr1 197195432 FALSE
> 12 chr2 181748087 FALSE
> 13 chr3 159599783 FALSE
> 14 chr4 155630120 FALSE
> 15 chr5 152537259 FALSE
> 16 chr6 149517037 FALSE
> 17 chr7 152524553 FALSE
> 18 chr8 131738871 FALSE
> 19 chr9 124076172 FALSE
> 20 chrM 16299 TRUE
> 21 chrX 166650296 FALSE
> 22 chrY 15902555 FALSE
>
>
> I ran it again without the "exonRankAttributeName" argument and I got:
>
>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
> + chrominfo = chrominfo,
> + dataSource = "UCSC",
> + species = "Mus musculus")
> extracting transcript information
> Estimating transcript ranges.
> Extracting gene IDs
> Processing splicing information for gtf file.
> Deducing exon rank from relative coordinates provided
> Prepare the 'metadata' data frame ... metadata: OK
> Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom",
> :
> all the values in 'transcripts$tx_chrom' must be present in
> 'chrominfo$chrom'
> In addition: Warning message:
> In .deduceExonRankings(exs, format = "gtf") :
> Infering Exon Rankings. If this is not what you expected, then please be
> sure that you have provided a valid attribute for exonRankAttributeName
>
>
> Without the "chrominfo" argument I got the same error message as the first
> time:
>
>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
> + exonRankAttributeName = "exon_number",
> + dataSource = "UCSC",
> + species = "Mus musculus")
>
> Error in data.frame(..., check.names = FALSE) :
> arguments imply differing number of rows: 541775, 0
>
>
> Finally when I eliminated both the "exonRankAttributeName" and the
> "chrominfo" arguments it worked but the warning reminded me of the
> "exonRankAttributeName" argument and the chromosome names are now different
> from the ones in the genome file and there is no info on their length
>
>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
> + dataSource = "UCSC",
> + species = "Mus musculus")
> extracting transcript information
> Estimating transcript ranges.
> Extracting gene IDs
> Processing splicing information for gtf file.
> Deducing exon rank from relative coordinates provided
> Prepare the 'metadata' data frame ... metadata: OK
> Now generating chrominfo from available sequence names. No chromosome length
> information is available.
> Warning messages:
> 1: In .deduceExonRankings(exs, format = "gtf") :
> Infering Exon Rankings. If this is not what you expected, then please be
> sure that you have provided a valid attribute for exonRankAttributeName
> 2: In matchCircularity(chroms, circ_seqs) :
> None of the strings in your circ_seqs argument match your seqnames.
>
>> seqinfo(txdb)
> Seqinfo of length 32
> seqnames seqlengths isCircular genome
> chr13 <NA> FALSE <NA>
> chr9 <NA> FALSE <NA>
> chr6 <NA> FALSE <NA>
> chrX <NA> FALSE <NA>
> chr17 <NA> FALSE <NA>
> chr2 <NA> FALSE <NA>
> chr7 <NA> FALSE <NA>
> chr18 <NA> FALSE <NA>
> chr8 <NA> FALSE <NA>
> ... ... ... ...
> chrY_random <NA> FALSE <NA>
> chrX_random <NA> FALSE <NA>
> chr5_random <NA> FALSE <NA>
> chr4_random <NA> FALSE <NA>
> chrY <NA> FALSE <NA>
> chr7_random <NA> FALSE <NA>
> chr17_random <NA> FALSE <NA>
> chr13_random <NA> FALSE <NA>
> chr1_random <NA> FALSE <NA>
>
>
>
>
> What am I doing wrong in my original call to makeTranscriptDbFromGFF?
>
> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
> exonRankAttributeName = "exon_number",
> chrominfo = chrominfo,
> dataSource = "UCSC",
> species = "Mus musculus")
>
> Why am I getting this unfair error message?
> Thank you for your help
> Ugo
>
>
>
>
>
>
>
>
> FYI, this is my annFile (My gtf annotation file was downloaded together
> with a fasta file containing the mouse genome from UCSC):
>
>> annFile
> GRanges with 595632 ranges and 9 metadata columns:
> seqnames ranges strand | source type
> score phase
> <Rle> <IRanges> <Rle> | <factor> <factor>
> <numeric> <integer>
> [1] chr1 [3204563, 3207049] - | unknown exon
> <NA> <NA>
> [2] chr1 [3206103, 3206105] - | unknown stop_codon
> <NA> <NA>
> [3] chr1 [3206106, 3207049] - | unknown CDS
> <NA> 2
> [4] chr1 [3411783, 3411982] - | unknown CDS
> <NA> 1
> [5] chr1 [3411783, 3411982] - | unknown exon
> <NA> <NA>
> ... ... ... ... ... ... ...
> ... ...
> [595628] chrY_random [54422360, 54422362] + | unknown stop_codon
> <NA> <NA>
> [595629] chrY_random [58501955, 58502946] + | unknown exon
> <NA> <NA>
> [595630] chrY_random [58502132, 58502812] + | unknown CDS
> <NA> 0
> [595631] chrY_random [58502132, 58502134] + | unknown start_codon
> <NA> <NA>
> [595632] chrY_random [58502813, 58502815] + | unknown stop_codon
> <NA> <NA>
> gene_id transcript_id gene_name p_id tss_id
> <character> <character> <character> <character> <character>
> [1] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
> [2] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
> [3] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
> [4] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
> [5] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
> ... ... ... ... ... ...
> [595628] LOC100039753 NM_001017394 LOC100039753 P10196 TSS19491
> [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
> [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
> [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
> [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
> ---
> seqlengths:
> chr1 chr10 chr11 chr12 ... chrX_random
> chrY chrY_random
> NA NA NA NA ... NA
> NA NA
>
>
>> sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> base
>
> other attached packages:
> [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2
> Biostrings_2.28.0
> [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0
> GenomicRanges_1.12.2
> [9] IRanges_1.18.0 BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5
> BSgenome_1.28.0
> [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3
> lattice_0.20-15
> [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0
> stats4_3.0.0
> [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list