[BioC] MakeTranscriptDbFromGFF

Wed May 1 11:21:30 CEST 2013

Good morning,

I have a little problem creating a TranscriptDb object using the function
makeTranscriptDbFromGFF. I want to use this annotation to count the overlaps
of my genomic alignments with genes.

I ran:

> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
+ exonRankAttributeName = "exon_number",
+ chrominfo = chrominfo,
+ dataSource = "UCSC",
+ species = "Mus musculus")

And I got this error message:
Error in data.frame(..., check.names = FALSE) :
  arguments imply differing number of rows: 541775, 0

"chrominfo" was (info retrieved from the fasta genome file):

> chrominfo
   chrom    length is_circular
1  chr10 129993255       FALSE
2  chr11 121843856       FALSE
3  chr12 121257530       FALSE
4  chr13 120284312       FALSE
5  chr14 125194864       FALSE
6  chr15 103494974       FALSE
7  chr16  98319150       FALSE
8  chr17  95272651       FALSE
9  chr18  90772031       FALSE
10 chr19  61342430       FALSE
11  chr1 197195432       FALSE
12  chr2 181748087       FALSE
13  chr3 159599783       FALSE
14  chr4 155630120       FALSE
15  chr5 152537259       FALSE
16  chr6 149517037       FALSE
17  chr7 152524553       FALSE
18  chr8 131738871       FALSE
19  chr9 124076172       FALSE
20  chrM     16299        TRUE
21  chrX 166650296       FALSE
22  chrY  15902555       FALSE

I ran it again without the "exonRankAttributeName" argument and I got:

> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
+ chrominfo = chrominfo,
+ dataSource = "UCSC",
+ species = "Mus musculus")
extracting transcript information
Estimating transcript ranges.
Extracting gene IDs
Processing splicing information for gtf file.
Deducing exon rank from relative coordinates provided
Prepare the 'metadata' data frame ... metadata: OK
Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom",
: 
  all the values in 'transcripts$tx_chrom' must be present in
'chrominfo$chrom'
In addition: Warning message:
In .deduceExonRankings(exs, format = "gtf") :
  Infering Exon Rankings.  If this is not what you expected, then please be
sure that you have provided a valid attribute for exonRankAttributeName

Without the "chrominfo" argument I got the same error message as the first
time:

> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
+ exonRankAttributeName = "exon_number",
+ dataSource = "UCSC",
+ species = "Mus musculus")

Error in data.frame(..., check.names = FALSE) :
  arguments imply differing number of rows: 541775, 0

Finally when I eliminated both the "exonRankAttributeName" and the
"chrominfo" arguments it worked but the warning reminded me of the
"exonRankAttributeName" argument and the chromosome names are now different
from the ones in the genome file and there is no info on their length

> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
+ dataSource = "UCSC",
+ species = "Mus musculus")
extracting transcript information
Estimating transcript ranges.
Extracting gene IDs
Processing splicing information for gtf file.
Deducing exon rank from relative coordinates provided
Prepare the 'metadata' data frame ... metadata: OK
Now generating chrominfo from available sequence names. No chromosome length
information is available.
Warning messages:
1: In .deduceExonRankings(exs, format = "gtf") :
  Infering Exon Rankings.  If this is not what you expected, then please be
sure that you have provided a valid attribute for exonRankAttributeName
2: In matchCircularity(chroms, circ_seqs) :
  None of the strings in your circ_seqs argument match your seqnames.

> seqinfo(txdb)
Seqinfo of length 32
seqnames     seqlengths isCircular genome
chr13              <NA>      FALSE   <NA>
chr9               <NA>      FALSE   <NA>
chr6               <NA>      FALSE   <NA>
chrX               <NA>      FALSE   <NA>
chr17              <NA>      FALSE   <NA>
chr2               <NA>      FALSE   <NA>
chr7               <NA>      FALSE   <NA>
chr18              <NA>      FALSE   <NA>
chr8               <NA>      FALSE   <NA>
...                 ...        ...    ...
chrY_random        <NA>      FALSE   <NA>
chrX_random        <NA>      FALSE   <NA>
chr5_random        <NA>      FALSE   <NA>
chr4_random        <NA>      FALSE   <NA>
chrY               <NA>      FALSE   <NA>
chr7_random        <NA>      FALSE   <NA>
chr17_random       <NA>      FALSE   <NA>
chr13_random       <NA>      FALSE   <NA>
chr1_random        <NA>      FALSE   <NA>

What am I doing wrong in my original call to makeTranscriptDbFromGFF?

txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
                               exonRankAttributeName = "exon_number",
                               chrominfo = chrominfo,
                               dataSource = "UCSC",
                               species = "Mus musculus")

Why am I getting this unfair error message?
Thank you for your help
Ugo

FYI, this is my annFile (My gtf  annotation file was downloaded together
with a fasta file containing the mouse genome from UCSC):

> annFile
GRanges with 595632 ranges and 9 metadata columns:
              seqnames               ranges strand   |   source        type
score     phase
                 <Rle>            <IRanges>  <Rle>   | <factor>    <factor>
<numeric> <integer>
       [1]        chr1   [3204563, 3207049]      -   |  unknown        exon
<NA>      <NA>
       [2]        chr1   [3206103, 3206105]      -   |  unknown  stop_codon
<NA>      <NA>
       [3]        chr1   [3206106, 3207049]      -   |  unknown         CDS
<NA>         2
       [4]        chr1   [3411783, 3411982]      -   |  unknown         CDS
<NA>         1
       [5]        chr1   [3411783, 3411982]      -   |  unknown        exon
<NA>      <NA>
       ...         ...                  ...    ... ...      ...         ...
...       ...
  [595628] chrY_random [54422360, 54422362]      +   |  unknown  stop_codon
<NA>      <NA>
  [595629] chrY_random [58501955, 58502946]      +   |  unknown        exon
<NA>      <NA>
  [595630] chrY_random [58502132, 58502812]      +   |  unknown         CDS
<NA>         0
  [595631] chrY_random [58502132, 58502134]      +   |  unknown start_codon
<NA>      <NA>
  [595632] chrY_random [58502813, 58502815]      +   |  unknown  stop_codon
<NA>      <NA>
                gene_id  transcript_id    gene_name        p_id      tss_id
            <character>    <character>  <character> <character> <character>
       [1]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
       [2]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
       [3]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
       [4]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
       [5]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
       ...          ...            ...          ...         ...         ...
  [595628] LOC100039753   NM_001017394 LOC100039753      P10196    TSS19491
  [595629] LOC100039614 NM_001160137_4 LOC100039614      P22060     TSS4342
  [595630] LOC100039614 NM_001160137_4 LOC100039614      P22060     TSS4342
  [595631] LOC100039614 NM_001160137_4 LOC100039614      P22060     TSS4342
  [595632] LOC100039614 NM_001160137_4 LOC100039614      P22060     TSS4342
  ---
  seqlengths:
           chr1        chr10        chr11        chr12 ...  chrX_random
chrY  chrY_random
             NA           NA           NA           NA ...           NA
NA           NA

> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
base     

other attached packages:
 [1] rtracklayer_1.20.1     Rbowtie_1.0.2          Rsamtools_1.12.2
Biostrings_2.28.0  
 [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1   Biobase_2.20.0
GenomicRanges_1.12.2
 [9] IRanges_1.18.0         BiocGenerics_0.6.0

loaded via a namespace (and not attached):
 [1] BiocInstaller_1.10.0 biomaRt_2.16.0       bitops_1.0-5
BSgenome_1.28.0    
 [5] DBI_0.2-5            grid_3.0.0           hwriter_1.3
lattice_0.20-15    
 [9] RCurl_1.95-4.1       RSQLite_0.11.3       ShortRead_1.18.0
stats4_3.0.0       
[13] tools_3.0.0          XML_3.95-0.2         zlibbioc_1.6.0