[BioC] MakeTranscriptDbFromGFF

Thu May 2 00:33:02 CEST 2013

Hi Ugo,

Which UCSC file was it that you were trying to process?

   Marc

On 05/01/2013 02:21 AM, Ugo Borello wrote:
> Good morning,
>
> I have a little problem creating a TranscriptDb object using the function
> makeTranscriptDbFromGFF. I want to use this annotation to count the overlaps
> of my genomic alignments with genes.
>
>
> I ran:
>
>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
> + exonRankAttributeName = "exon_number",
> + chrominfo = chrominfo,
> + dataSource = "UCSC",
> + species = "Mus musculus")
>
> And I got this error message:
> Error in data.frame(..., check.names = FALSE) :
>    arguments imply differing number of rows: 541775, 0
>
>
> "chrominfo" was (info retrieved from the fasta genome file):
>
>> chrominfo
>     chrom    length is_circular
> 1  chr10 129993255       FALSE
> 2  chr11 121843856       FALSE
> 3  chr12 121257530       FALSE
> 4  chr13 120284312       FALSE
> 5  chr14 125194864       FALSE
> 6  chr15 103494974       FALSE
> 7  chr16  98319150       FALSE
> 8  chr17  95272651       FALSE
> 9  chr18  90772031       FALSE
> 10 chr19  61342430       FALSE
> 11  chr1 197195432       FALSE
> 12  chr2 181748087       FALSE
> 13  chr3 159599783       FALSE
> 14  chr4 155630120       FALSE
> 15  chr5 152537259       FALSE
> 16  chr6 149517037       FALSE
> 17  chr7 152524553       FALSE
> 18  chr8 131738871       FALSE
> 19  chr9 124076172       FALSE
> 20  chrM     16299        TRUE
> 21  chrX 166650296       FALSE
> 22  chrY  15902555       FALSE
>
>
> I ran it again without the "exonRankAttributeName" argument and I got:
>
>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
> + chrominfo = chrominfo,
> + dataSource = "UCSC",
> + species = "Mus musculus")
> extracting transcript information
> Estimating transcript ranges.
> Extracting gene IDs
> Processing splicing information for gtf file.
> Deducing exon rank from relative coordinates provided
> Prepare the 'metadata' data frame ... metadata: OK
> Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom",
> :
>    all the values in 'transcripts$tx_chrom' must be present in
> 'chrominfo$chrom'
> In addition: Warning message:
> In .deduceExonRankings(exs, format = "gtf") :
>    Infering Exon Rankings.  If this is not what you expected, then please be
> sure that you have provided a valid attribute for exonRankAttributeName
>
>
> Without the "chrominfo" argument I got the same error message as the first
> time:
>
>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
> + exonRankAttributeName = "exon_number",
> + dataSource = "UCSC",
> + species = "Mus musculus")
>
> Error in data.frame(..., check.names = FALSE) :
>    arguments imply differing number of rows: 541775, 0
>
>
> Finally when I eliminated both the "exonRankAttributeName" and the
> "chrominfo" arguments it worked but the warning reminded me of the
> "exonRankAttributeName" argument and the chromosome names are now different
> from the ones in the genome file and there is no info on their length
>
>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
> + dataSource = "UCSC",
> + species = "Mus musculus")
> extracting transcript information
> Estimating transcript ranges.
> Extracting gene IDs
> Processing splicing information for gtf file.
> Deducing exon rank from relative coordinates provided
> Prepare the 'metadata' data frame ... metadata: OK
> Now generating chrominfo from available sequence names. No chromosome length
> information is available.
> Warning messages:
> 1: In .deduceExonRankings(exs, format = "gtf") :
>    Infering Exon Rankings.  If this is not what you expected, then please be
> sure that you have provided a valid attribute for exonRankAttributeName
> 2: In matchCircularity(chroms, circ_seqs) :
>    None of the strings in your circ_seqs argument match your seqnames.
>
>> seqinfo(txdb)
> Seqinfo of length 32
> seqnames     seqlengths isCircular genome
> chr13              <NA>      FALSE   <NA>
> chr9               <NA>      FALSE   <NA>
> chr6               <NA>      FALSE   <NA>
> chrX               <NA>      FALSE   <NA>
> chr17              <NA>      FALSE   <NA>
> chr2               <NA>      FALSE   <NA>
> chr7               <NA>      FALSE   <NA>
> chr18              <NA>      FALSE   <NA>
> chr8               <NA>      FALSE   <NA>
> ...                 ...        ...    ...
> chrY_random        <NA>      FALSE   <NA>
> chrX_random        <NA>      FALSE   <NA>
> chr5_random        <NA>      FALSE   <NA>
> chr4_random        <NA>      FALSE   <NA>
> chrY               <NA>      FALSE   <NA>
> chr7_random        <NA>      FALSE   <NA>
> chr17_random       <NA>      FALSE   <NA>
> chr13_random       <NA>      FALSE   <NA>
> chr1_random        <NA>      FALSE   <NA>
>
>
>
>
> What am I doing wrong in my original call to makeTranscriptDbFromGFF?
>
> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>                                 exonRankAttributeName = "exon_number",
>                                 chrominfo = chrominfo,
>                                 dataSource = "UCSC",
>                                 species = "Mus musculus")
>
> Why am I getting this unfair error message?
> Thank you for your help
> Ugo
>
>
>
>
>
>
>
>
> FYI, this is my annFile (My gtf  annotation file was downloaded together
> with a fasta file containing the mouse genome from UCSC):
>
>> annFile
> GRanges with 595632 ranges and 9 metadata columns:
>                seqnames               ranges strand   |   source        type
> score     phase
>                   <Rle>            <IRanges>  <Rle>   | <factor>    <factor>
> <numeric> <integer>
>         [1]        chr1   [3204563, 3207049]      -   |  unknown        exon
> <NA>      <NA>
>         [2]        chr1   [3206103, 3206105]      -   |  unknown  stop_codon
> <NA>      <NA>
>         [3]        chr1   [3206106, 3207049]      -   |  unknown         CDS
> <NA>         2
>         [4]        chr1   [3411783, 3411982]      -   |  unknown         CDS
> <NA>         1
>         [5]        chr1   [3411783, 3411982]      -   |  unknown        exon
> <NA>      <NA>
>         ...         ...                  ...    ... ...      ...         ...
> ...       ...
>    [595628] chrY_random [54422360, 54422362]      +   |  unknown  stop_codon
> <NA>      <NA>
>    [595629] chrY_random [58501955, 58502946]      +   |  unknown        exon
> <NA>      <NA>
>    [595630] chrY_random [58502132, 58502812]      +   |  unknown         CDS
> <NA>         0
>    [595631] chrY_random [58502132, 58502134]      +   |  unknown start_codon
> <NA>      <NA>
>    [595632] chrY_random [58502813, 58502815]      +   |  unknown  stop_codon
> <NA>      <NA>
>                  gene_id  transcript_id    gene_name        p_id      tss_id
>              <character>    <character>  <character> <character> <character>
>         [1]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
>         [2]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
>         [3]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
>         [4]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
>         [5]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
>         ...          ...            ...          ...         ...         ...
>    [595628] LOC100039753   NM_001017394 LOC100039753      P10196    TSS19491
>    [595629] LOC100039614 NM_001160137_4 LOC100039614      P22060     TSS4342
>    [595630] LOC100039614 NM_001160137_4 LOC100039614      P22060     TSS4342
>    [595631] LOC100039614 NM_001160137_4 LOC100039614      P22060     TSS4342
>    [595632] LOC100039614 NM_001160137_4 LOC100039614      P22060     TSS4342
>    ---
>    seqlengths:
>             chr1        chr10        chr11        chr12 ...  chrX_random
> chrY  chrY_random
>               NA           NA           NA           NA ...           NA
> NA           NA
>
>
>> sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
>   [1] rtracklayer_1.20.1     Rbowtie_1.0.2          Rsamtools_1.12.2
> Biostrings_2.28.0
>   [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1   Biobase_2.20.0
> GenomicRanges_1.12.2
>   [9] IRanges_1.18.0         BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
>   [1] BiocInstaller_1.10.0 biomaRt_2.16.0       bitops_1.0-5
> BSgenome_1.28.0
>   [5] DBI_0.2-5            grid_3.0.0           hwriter_1.3
> lattice_0.20-15
>   [9] RCurl_1.95-4.1       RSQLite_0.11.3       ShortRead_1.18.0
> stats4_3.0.0
> [13] tools_3.0.0          XML_3.95-0.2         zlibbioc_1.6.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor