[BioC] MakeTranscriptDbFromGFF

Ugo Borello ugo.borello at inserm.fr
Thu May 2 11:59:50 CEST 2013


Dear Marc
Sorry I was not precise on the origin of the gtf annotation file; I got the
gtf file from here:
http://tophat.cbcb.umd.edu/igenomes.shtml

And more precisely from the Mus musculus/UCSC/mm9 folder
Here the description of the content of the folder:
ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/README.txt

I realized reading the README.txt file that actually the genes.gtf file I
used is the Ensembl annotation of the mm9 release.
So, I changed dataSource = "Ensembl" in the function call and I got the same
error message:
Error in data.frame(..., check.names = FALSE) :
  arguments imply differing number of rows: 541775, 0

At the end of my previous email you have the result of calling:
annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE)


Thank you

Ugo

> From: Marc Carlson <mcarlson at fhcrc.org>
> Date: Wed, 01 May 2013 15:33:02 -0700
> To: <bioconductor at r-project.org>
> Subject: Re: [BioC] MakeTranscriptDbFromGFF
> 
> Hi Ugo,
> 
> Which UCSC file was it that you were trying to process?
> 
> 
>    Marc
> 
> 
> 
> On 05/01/2013 02:21 AM, Ugo Borello wrote:
>> Good morning,
>> 
>> I have a little problem creating a TranscriptDb object using the function
>> makeTranscriptDbFromGFF. I want to use this annotation to count the overlaps
>> of my genomic alignments with genes.
>> 
>> 
>> I ran:
>> 
>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>> + exonRankAttributeName = "exon_number",
>> + chrominfo = chrominfo,
>> + dataSource = "UCSC",
>> + species = "Mus musculus")
>> 
>> And I got this error message:
>> Error in data.frame(..., check.names = FALSE) :
>>    arguments imply differing number of rows: 541775, 0
>> 
>> 
>> "chrominfo" was (info retrieved from the fasta genome file):
>> 
>>> chrominfo
>>     chrom    length is_circular
>> 1  chr10 129993255       FALSE
>> 2  chr11 121843856       FALSE
>> 3  chr12 121257530       FALSE
>> 4  chr13 120284312       FALSE
>> 5  chr14 125194864       FALSE
>> 6  chr15 103494974       FALSE
>> 7  chr16  98319150       FALSE
>> 8  chr17  95272651       FALSE
>> 9  chr18  90772031       FALSE
>> 10 chr19  61342430       FALSE
>> 11  chr1 197195432       FALSE
>> 12  chr2 181748087       FALSE
>> 13  chr3 159599783       FALSE
>> 14  chr4 155630120       FALSE
>> 15  chr5 152537259       FALSE
>> 16  chr6 149517037       FALSE
>> 17  chr7 152524553       FALSE
>> 18  chr8 131738871       FALSE
>> 19  chr9 124076172       FALSE
>> 20  chrM     16299        TRUE
>> 21  chrX 166650296       FALSE
>> 22  chrY  15902555       FALSE
>> 
>> 
>> I ran it again without the "exonRankAttributeName" argument and I got:
>> 
>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>> + chrominfo = chrominfo,
>> + dataSource = "UCSC",
>> + species = "Mus musculus")
>> extracting transcript information
>> Estimating transcript ranges.
>> Extracting gene IDs
>> Processing splicing information for gtf file.
>> Deducing exon rank from relative coordinates provided
>> Prepare the 'metadata' data frame ... metadata: OK
>> Error in .checkForeignKey(transcripts_tx_chrom, NA, "transcripts$tx_chrom",
>> :
>>    all the values in 'transcripts$tx_chrom' must be present in
>> 'chrominfo$chrom'
>> In addition: Warning message:
>> In .deduceExonRankings(exs, format = "gtf") :
>>    Infering Exon Rankings.  If this is not what you expected, then please be
>> sure that you have provided a valid attribute for exonRankAttributeName
>> 
>> 
>> Without the "chrominfo" argument I got the same error message as the first
>> time:
>> 
>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>> + exonRankAttributeName = "exon_number",
>> + dataSource = "UCSC",
>> + species = "Mus musculus")
>> 
>> Error in data.frame(..., check.names = FALSE) :
>>    arguments imply differing number of rows: 541775, 0
>> 
>> 
>> Finally when I eliminated both the "exonRankAttributeName" and the
>> "chrominfo" arguments it worked but the warning reminded me of the
>> "exonRankAttributeName" argument and the chromosome names are now different
>> from the ones in the genome file and there is no info on their length
>> 
>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>> + dataSource = "UCSC",
>> + species = "Mus musculus")
>> extracting transcript information
>> Estimating transcript ranges.
>> Extracting gene IDs
>> Processing splicing information for gtf file.
>> Deducing exon rank from relative coordinates provided
>> Prepare the 'metadata' data frame ... metadata: OK
>> Now generating chrominfo from available sequence names. No chromosome length
>> information is available.
>> Warning messages:
>> 1: In .deduceExonRankings(exs, format = "gtf") :
>>    Infering Exon Rankings.  If this is not what you expected, then please be
>> sure that you have provided a valid attribute for exonRankAttributeName
>> 2: In matchCircularity(chroms, circ_seqs) :
>>    None of the strings in your circ_seqs argument match your seqnames.
>> 
>>> seqinfo(txdb)
>> Seqinfo of length 32
>> seqnames     seqlengths isCircular genome
>> chr13              <NA>      FALSE   <NA>
>> chr9               <NA>      FALSE   <NA>
>> chr6               <NA>      FALSE   <NA>
>> chrX               <NA>      FALSE   <NA>
>> chr17              <NA>      FALSE   <NA>
>> chr2               <NA>      FALSE   <NA>
>> chr7               <NA>      FALSE   <NA>
>> chr18              <NA>      FALSE   <NA>
>> chr8               <NA>      FALSE   <NA>
>> ...                 ...        ...    ...
>> chrY_random        <NA>      FALSE   <NA>
>> chrX_random        <NA>      FALSE   <NA>
>> chr5_random        <NA>      FALSE   <NA>
>> chr4_random        <NA>      FALSE   <NA>
>> chrY               <NA>      FALSE   <NA>
>> chr7_random        <NA>      FALSE   <NA>
>> chr17_random       <NA>      FALSE   <NA>
>> chr13_random       <NA>      FALSE   <NA>
>> chr1_random        <NA>      FALSE   <NA>
>> 
>> 
>> 
>> 
>> What am I doing wrong in my original call to makeTranscriptDbFromGFF?
>> 
>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>                                 exonRankAttributeName = "exon_number",
>>                                 chrominfo = chrominfo,
>>                                 dataSource = "UCSC",
>>                                 species = "Mus musculus")
>> 
>> Why am I getting this unfair error message?
>> Thank you for your help
>> Ugo
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> FYI, this is my annFile (My gtf  annotation file was downloaded together
>> with a fasta file containing the mouse genome from UCSC):
>> 
>>> annFile
>> GRanges with 595632 ranges and 9 metadata columns:
>>                seqnames               ranges strand   |   source        type
>> score     phase
>>                   <Rle>            <IRanges>  <Rle>   | <factor>    <factor>
>> <numeric> <integer>
>>         [1]        chr1   [3204563, 3207049]      -   |  unknown        exon
>> <NA>      <NA>
>>         [2]        chr1   [3206103, 3206105]      -   |  unknown  stop_codon
>> <NA>      <NA>
>>         [3]        chr1   [3206106, 3207049]      -   |  unknown         CDS
>> <NA>         2
>>         [4]        chr1   [3411783, 3411982]      -   |  unknown         CDS
>> <NA>         1
>>         [5]        chr1   [3411783, 3411982]      -   |  unknown        exon
>> <NA>      <NA>
>>         ...         ...                  ...    ... ...      ...         ...
>> ...       ...
>>    [595628] chrY_random [54422360, 54422362]      +   |  unknown  stop_codon
>> <NA>      <NA>
>>    [595629] chrY_random [58501955, 58502946]      +   |  unknown        exon
>> <NA>      <NA>
>>    [595630] chrY_random [58502132, 58502812]      +   |  unknown         CDS
>> <NA>         0
>>    [595631] chrY_random [58502132, 58502134]      +   |  unknown start_codon
>> <NA>      <NA>
>>    [595632] chrY_random [58502813, 58502815]      +   |  unknown  stop_codon
>> <NA>      <NA>
>>                  gene_id  transcript_id    gene_name        p_id      tss_id
>>              <character>    <character>  <character> <character> <character>
>>         [1]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
>>         [2]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
>>         [3]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
>>         [4]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
>>         [5]         Xkr4   NM_001011874         Xkr4       P2739     TSS1881
>>         ...          ...            ...          ...         ...         ...
>>    [595628] LOC100039753   NM_001017394 LOC100039753      P10196    TSS19491
>>    [595629] LOC100039614 NM_001160137_4 LOC100039614      P22060     TSS4342
>>    [595630] LOC100039614 NM_001160137_4 LOC100039614      P22060     TSS4342
>>    [595631] LOC100039614 NM_001160137_4 LOC100039614      P22060     TSS4342
>>    [595632] LOC100039614 NM_001160137_4 LOC100039614      P22060     TSS4342
>>    ---
>>    seqlengths:
>>             chr1        chr10        chr11        chr12 ...  chrX_random
>> chrY  chrY_random
>>               NA           NA           NA           NA ...           NA
>> NA           NA
>> 
>> 
>>> sessionInfo()
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>> 
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>> 
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>> base
>> 
>> other attached packages:
>>   [1] rtracklayer_1.20.1     Rbowtie_1.0.2          Rsamtools_1.12.2
>> Biostrings_2.28.0
>>   [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1   Biobase_2.20.0
>> GenomicRanges_1.12.2
>>   [9] IRanges_1.18.0         BiocGenerics_0.6.0
>> 
>> loaded via a namespace (and not attached):
>>   [1] BiocInstaller_1.10.0 biomaRt_2.16.0       bitops_1.0-5
>> BSgenome_1.28.0
>>   [5] DBI_0.2-5            grid_3.0.0           hwriter_1.3
>> lattice_0.20-15
>>   [9] RCurl_1.95-4.1       RSQLite_0.11.3       ShortRead_1.18.0
>> stats4_3.0.0
>> [13] tools_3.0.0          XML_3.95-0.2         zlibbioc_1.6.0
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list