[BioC] MakeTranscriptDbFromGFF
Ugo Borello
ugo.borello at inserm.fr
Wed May 1 11:21:30 CEST 2013
Good morning,
I have a little problem creating a TranscriptDb object using the function
makeTranscriptDbFromGFF. I want to use this annotation to count the overlaps
of my genomic alignments with genes.
I ran:
> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
+ exonRankAttributeName = "exon_number",
+ chrominfo = chrominfo,
+ dataSource = "UCSC",
+ species = "Mus musculus")
And I got this error message:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 541775, 0
"chrominfo" was (info retrieved from the fasta genome file):
> chrominfo
chrom length is_circular
1 chr10 129993255 FALSE
2 chr11 121843856 FALSE
3 chr12 121257530 FALSE
4 chr13 120284312 FALSE
5 chr14 125194864 FALSE
6 chr15 103494974 FALSE
7 chr16 98319150 FALSE
8 chr17 95272651 FALSE
9 chr18 90772031 FALSE
10 chr19 61342430 FALSE
11 chr1 197195432 FALSE
12 chr2 181748087 FALSE
13 chr3 159599783 FALSE
14 chr4 155630120 FALSE
15 chr5 152537259 FALSE
16 chr6 149517037 FALSE
17 chr7 152524553 FALSE
18 chr8 131738871 FALSE
19 chr9 124076172 FALSE
20 chrM 16299 TRUE
21 chrX 166650296 FALSE
22 chrY 15902555 FALSE
I ran it again without the "exonRankAttributeName" argument and I got:
> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
+ chrominfo = chrominfo,
+ dataSource = "UCSC",
+ species = "Mus musculus")
extracting transcript information
Estimating transcript ranges.
Extracting gene IDs
Processing splicing information for gtf file.
Deducing exon rank from relative coordinates provided
Prepare the 'metadata' data frame ... metadata: OK
Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom",
:
all the values in 'transcripts$tx_chrom' must be present in
'chrominfo$chrom'
In addition: Warning message:
In .deduceExonRankings(exs, format = "gtf") :
Infering Exon Rankings. If this is not what you expected, then please be
sure that you have provided a valid attribute for exonRankAttributeName
Without the "chrominfo" argument I got the same error message as the first
time:
> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
+ exonRankAttributeName = "exon_number",
+ dataSource = "UCSC",
+ species = "Mus musculus")
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 541775, 0
Finally when I eliminated both the "exonRankAttributeName" and the
"chrominfo" arguments it worked but the warning reminded me of the
"exonRankAttributeName" argument and the chromosome names are now different
from the ones in the genome file and there is no info on their length
> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
+ dataSource = "UCSC",
+ species = "Mus musculus")
extracting transcript information
Estimating transcript ranges.
Extracting gene IDs
Processing splicing information for gtf file.
Deducing exon rank from relative coordinates provided
Prepare the 'metadata' data frame ... metadata: OK
Now generating chrominfo from available sequence names. No chromosome length
information is available.
Warning messages:
1: In .deduceExonRankings(exs, format = "gtf") :
Infering Exon Rankings. If this is not what you expected, then please be
sure that you have provided a valid attribute for exonRankAttributeName
2: In matchCircularity(chroms, circ_seqs) :
None of the strings in your circ_seqs argument match your seqnames.
> seqinfo(txdb)
Seqinfo of length 32
seqnames seqlengths isCircular genome
chr13 <NA> FALSE <NA>
chr9 <NA> FALSE <NA>
chr6 <NA> FALSE <NA>
chrX <NA> FALSE <NA>
chr17 <NA> FALSE <NA>
chr2 <NA> FALSE <NA>
chr7 <NA> FALSE <NA>
chr18 <NA> FALSE <NA>
chr8 <NA> FALSE <NA>
... ... ... ...
chrY_random <NA> FALSE <NA>
chrX_random <NA> FALSE <NA>
chr5_random <NA> FALSE <NA>
chr4_random <NA> FALSE <NA>
chrY <NA> FALSE <NA>
chr7_random <NA> FALSE <NA>
chr17_random <NA> FALSE <NA>
chr13_random <NA> FALSE <NA>
chr1_random <NA> FALSE <NA>
What am I doing wrong in my original call to makeTranscriptDbFromGFF?
txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
exonRankAttributeName = "exon_number",
chrominfo = chrominfo,
dataSource = "UCSC",
species = "Mus musculus")
Why am I getting this unfair error message?
Thank you for your help
Ugo
FYI, this is my annFile (My gtf annotation file was downloaded together
with a fasta file containing the mouse genome from UCSC):
> annFile
GRanges with 595632 ranges and 9 metadata columns:
seqnames ranges strand | source type
score phase
<Rle> <IRanges> <Rle> | <factor> <factor>
<numeric> <integer>
[1] chr1 [3204563, 3207049] - | unknown exon
<NA> <NA>
[2] chr1 [3206103, 3206105] - | unknown stop_codon
<NA> <NA>
[3] chr1 [3206106, 3207049] - | unknown CDS
<NA> 2
[4] chr1 [3411783, 3411982] - | unknown CDS
<NA> 1
[5] chr1 [3411783, 3411982] - | unknown exon
<NA> <NA>
... ... ... ... ... ... ...
... ...
[595628] chrY_random [54422360, 54422362] + | unknown stop_codon
<NA> <NA>
[595629] chrY_random [58501955, 58502946] + | unknown exon
<NA> <NA>
[595630] chrY_random [58502132, 58502812] + | unknown CDS
<NA> 0
[595631] chrY_random [58502132, 58502134] + | unknown start_codon
<NA> <NA>
[595632] chrY_random [58502813, 58502815] + | unknown stop_codon
<NA> <NA>
gene_id transcript_id gene_name p_id tss_id
<character> <character> <character> <character> <character>
[1] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
[2] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
[3] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
[4] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
[5] Xkr4 NM_001011874 Xkr4 P2739 TSS1881
... ... ... ... ... ...
[595628] LOC100039753 NM_001017394 LOC100039753 P10196 TSS19491
[595629] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
[595630] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
[595631] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
[595632] LOC100039614 NM_001160137_4 LOC100039614 P22060 TSS4342
---
seqlengths:
chr1 chr10 chr11 chr12 ... chrX_random
chrY chrY_random
NA NA NA NA ... NA
NA NA
> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
base
other attached packages:
[1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2
Biostrings_2.28.0
[5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0
GenomicRanges_1.12.2
[9] IRanges_1.18.0 BiocGenerics_0.6.0
loaded via a namespace (and not attached):
[1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5
BSgenome_1.28.0
[5] DBI_0.2-5 grid_3.0.0 hwriter_1.3
lattice_0.20-15
[9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0
stats4_3.0.0
[13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0
More information about the Bioconductor
mailing list