[BioC] MakeTranscriptDbFromGFF

Fri May 3 22:10:47 CEST 2013

Thank you very much, Marc.
I will work on it.
Ugo

Quoting Marc Carlson <mcarlson at fhcrc.org>:

> Hi Ugo,
>
>
> On 05/03/2013 04:00 AM, Ugo Borello wrote:
>> Dear Carl,
>> Thank you very much; it makes sense now.
>>
>> To quantify gene expression of my RNASeq samples I  use your
>> TxDb.Mmusculus.UCSC.mm9.knownGene annotation together with the alignments to
>> the BSgenome.Mmusculus.UCSC.mm9 genome.
>> Now that I used the UCSC mm9 mouse genome from the Illumina iGenomes I
>> wanted to use their annotation.
>>
>>
>> Anyway, I have now a general question.
>> For the mouse mm9 genome (genome.fa) obtained from Illumina iGenomes
>>> genome<- scanFaIndex('genome.fa')
>>> seqlengths(genome)
>>     chr10     chr11     chr12     chr13     chr14     chr15     chr16
>> chr17     chr18     chr19
>> 129993255 121843856 121257530 120284312 125194864 103494974  98319150
>> 95272651  90772031  61342430
>>      chr1      chr2      chr3      chr4      chr5      chr6      chr7
>> chr8      chr9      chrM
>> 197195432 181748087 159599783 155630120 152537259 149517037 152524553
>> 131738871 124076172     16299
>>      chrX      chrY
>> 166650296  15902555
>>
>> For your TxDb.Mmusculus.UCSC.mm9.knownGene
>>> seqlengths( TxDb.Mmusculus.UCSC.mm9.knownGene)
>>         chr1         chr2         chr3         chr4         chr5
>> chr6         chr7         chr8         chr9
>>    197195432    181748087    159599783    155630120    152537259
>> 149517037    152524553    131738871    124076172
>>        chr10        chr11        chr12        chr13        chr14
>> chr15        chr16        chr17        chr18
>>    129993255    121843856    121257530    120284312    125194864
>> 103494974     98319150     95272651     90772031
>>        chr19         chrX         chrY         chrM  chr1_random
>> chr3_random  chr4_random  chr5_random  chr7_random
>>     61342430    166650296     15902555        16299      1231697
>> 41899       160594       357350       362490
>>  chr8_random  chr9_random chr13_random chr16_random chr17_random
>> chrX_random  chrY_random chrUn_random
>>       849593       449403       400311         3994       628739
>> 1785075     58682461      5900358
>>
>>
>> So. I want to filter out the '_random' stuff; is this the only and right way
>> to do it?
>
> It depends on whether you want to use the _random stuff.  My impression
> of how people use this is that most people don't.  So they would use
> isActiveSeq() to toggle those off.
>
>>> ann<- TxDb.Mmusculus.UCSC.mm9.knownGene
>>> isActiveSeq(ann)[seqlevels(ann)] <- FALSE
>>> isActiveSeq(ann) <- c("chr10"=TRUE, "chr11"=TRUE,
>> +                       "chr12"=TRUE,"chr13"=TRUE,
>> +                       "chr14"=TRUE,"chr15"=TRUE,
>> +                       "chr16"=TRUE, "chr17"=TRUE,
>> +                       "chr18"=TRUE, "chr19"=TRUE,
>> +                       "chr1"=TRUE, "chr2"=TRUE,
>> +                       "chr3"=TRUE, "chr4"=TRUE,
>> +                       "chr5"=TRUE, "chr6"=TRUE,
>> +                       "chr7"=TRUE, "chr8"=TRUE,
>> +                       "chr9"=TRUE, "chrM"=TRUE,
>> +                       "chrX"=TRUE, "chrY"=TRUE)
>> Then get my gene info this way?
>>> genesInfo<- exons(ann, columns='gene_id')
>> How can I add also gene names or symbols?
> Again, it depends on what you want to do.  If you are dealing with just
> a TranscriptDb like this, then there are not gene symbols attached
> already.  So for that object yes, you would need to do it like that.
>
>
> Now if you wanted gene symbols they are pretty easy to get.  You can go
> and get gene symbols by loading an org package like this:
>
> library(org.Mm.eg.db)
> cols(org.Mm.eg.db)
>
> And then you could use select() to retrieve those (by using the gene
> IDs as keys from your TranscriptDb object).  But some people find this
> extra step to be inconvenient, so I have created another route.  And
> that is to use a OrganismDb object.  There is a package for this
> already that I will use to demo here:
>
> library(Mus.musculus)
> cols(Mus.musculus)
>
> You will notice that this kind of object has all the stuff you want in
> one place.  I suspect that you will find that to be a lot more
> convenient:
>
> This basically means that you can do the thing you were interested in
> like this:
>
> genesInfo <- exons(Mus.musculus, columns=c("GENEID","SYMBOL"))
>
> *But* in your case, you are using mm9, so you will want to make a
> custom object (the Mus.musculus object is based on mm10).  This is not
> hard to do, but it is an extra step.  You can read how to do it in
> section 2 of the following vignette:
>
> http://www.bioconductor.org/packages/2.12/bioc/vignettes/OrganismDbi/inst/doc/OrganismDbi.pdf
>
>
>
>>
>>
>>
>> Thank you again for your help.
>>
>> Ugo
>>
>>
>>
>>
>>> From: Marc Carlson <mcarlson at fhcrc.org>
>>> Date: Thu, 02 May 2013 15:52:38 -0700
>>> To: Ugo Borello <ugo.borello at inserm.fr>
>>> Cc: <bioconductor at r-project.org>
>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF
>>>
>>> Hi Ugo,
>>>
>>> The 15 GB tarball you sent me to contains several different GTF files
>>> for genes.  I grabbed this one as it seemed to be the most recent:
>>>
>>> Mus_musculus/UCSC/mm9/Annotation/Archives/archive-2013-03-06-15-01-24/Genes/ge
>>> nes.gtf
>>>
>>> So looking at this file I can reproduce the problem you mentioned.  And
>>> it shows me two problems.  The 1st problem is that the only field that
>>> seems to contain any information about exon positions is called phase
>>> (and not "exon_number" as was in the code I see from before).  There is
>>> not actually any field called "exon_number" in this file.  Either way,
>>> one thing you can check is to make sure that the string you give here is
>>> the same as the appropriate field name that is used by the file.  There
>>> is no way to know this information in advance since GTF does not specify
>>> how to encode this information (and in fact the information is entirely
>>> optional).
>>>
>>> The second problem is that even "phase" can't work right now since the
>>> authors of this gtf file have decided to only associate the exon rank
>>> information only with CDS and never with exons features.  So there is
>>> not any actual 'exon' position information in this file, only
>>> information for CDS positions.  Now that I see people doing these files
>>> in this way, I plan to enhance the parser so that it can process files
>>> of this kind.
>>>
>>>
>>> Is there a reason why you wanted to use this file and not the data
>>> contained in this package here?
>>>
>>> http://www.bioconductor.org/packages/2.12/data/annotation/html/TxDb.Mmusculus.
>>> UCSC.mm9.knownGene.html
>>>
>>>
>>>    Marc
>>>
>>>
>>>
>>> On 05/02/2013 02:59 AM, Ugo Borello wrote:
>>>> Dear Marc
>>>> Sorry I was not precise on the origin of the gtf annotation file;  
>>>>  I got the
>>>> gtf file from here:
>>>> http://tophat.cbcb.umd.edu/igenomes.shtml
>>>>
>>>> And more precisely from the Mus musculus/UCSC/mm9 folder
>>>> Here the description of the content of the folder:
>>>> ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/README.txt
>>>>
>>>> I realized reading the README.txt file that actually the genes.gtf file I
>>>> used is the Ensembl annotation of the mm9 release.
>>>> So, I changed dataSource = "Ensembl" in the function call and I   
>>>> got the same
>>>> error message:
>>>> Error in data.frame(..., check.names = FALSE) :
>>>>    arguments imply differing number of rows: 541775, 0
>>>>
>>>> At the end of my previous email you have the result of calling:
>>>> annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE)
>>>>
>>>>
>>>> Thank you
>>>>
>>>> Ugo
>>>>
>>>>> From: Marc Carlson <mcarlson at fhcrc.org>
>>>>> Date: Wed, 01 May 2013 15:33:02 -0700
>>>>> To: <bioconductor at r-project.org>
>>>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF
>>>>>
>>>>> Hi Ugo,
>>>>>
>>>>> Which UCSC file was it that you were trying to process?
>>>>>
>>>>>
>>>>>     Marc
>>>>>
>>>>>
>>>>>
>>>>> On 05/01/2013 02:21 AM, Ugo Borello wrote:
>>>>>> Good morning,
>>>>>>
>>>>>> I have a little problem creating a TranscriptDb object using   
>>>>>> the function
>>>>>> makeTranscriptDbFromGFF. I want to use this annotation to count the
>>>>>> overlaps
>>>>>> of my genomic alignments with genes.
>>>>>>
>>>>>>
>>>>>> I ran:
>>>>>>
>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>> + exonRankAttributeName = "exon_number",
>>>>>> + chrominfo = chrominfo,
>>>>>> + dataSource = "UCSC",
>>>>>> + species = "Mus musculus")
>>>>>>
>>>>>> And I got this error message:
>>>>>> Error in data.frame(..., check.names = FALSE) :
>>>>>>     arguments imply differing number of rows: 541775, 0
>>>>>>
>>>>>>
>>>>>> "chrominfo" was (info retrieved from the fasta genome file):
>>>>>>
>>>>>>> chrominfo
>>>>>>      chrom    length is_circular
>>>>>> 1  chr10 129993255       FALSE
>>>>>> 2  chr11 121843856       FALSE
>>>>>> 3  chr12 121257530       FALSE
>>>>>> 4  chr13 120284312       FALSE
>>>>>> 5  chr14 125194864       FALSE
>>>>>> 6  chr15 103494974       FALSE
>>>>>> 7  chr16  98319150       FALSE
>>>>>> 8  chr17  95272651       FALSE
>>>>>> 9  chr18  90772031       FALSE
>>>>>> 10 chr19  61342430       FALSE
>>>>>> 11  chr1 197195432       FALSE
>>>>>> 12  chr2 181748087       FALSE
>>>>>> 13  chr3 159599783       FALSE
>>>>>> 14  chr4 155630120       FALSE
>>>>>> 15  chr5 152537259       FALSE
>>>>>> 16  chr6 149517037       FALSE
>>>>>> 17  chr7 152524553       FALSE
>>>>>> 18  chr8 131738871       FALSE
>>>>>> 19  chr9 124076172       FALSE
>>>>>> 20  chrM     16299        TRUE
>>>>>> 21  chrX 166650296       FALSE
>>>>>> 22  chrY  15902555       FALSE
>>>>>>
>>>>>>
>>>>>> I ran it again without the "exonRankAttributeName" argument and I got:
>>>>>>
>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>> + chrominfo = chrominfo,
>>>>>> + dataSource = "UCSC",
>>>>>> + species = "Mus musculus")
>>>>>> extracting transcript information
>>>>>> Estimating transcript ranges.
>>>>>> Extracting gene IDs
>>>>>> Processing splicing information for gtf file.
>>>>>> Deducing exon rank from relative coordinates provided
>>>>>> Prepare the 'metadata' data frame ... metadata: OK
>>>>>> Error in .checkForeignKey(transcripts_tx_chrom, NA,   
>>>>>> "transcripts$tx_chrom",
>>>>>> :
>>>>>>     all the values in 'transcripts$tx_chrom' must be present in
>>>>>> 'chrominfo$chrom'
>>>>>> In addition: Warning message:
>>>>>> In .deduceExonRankings(exs, format = "gtf") :
>>>>>>     Infering Exon Rankings.  If this is not what you expected,   
>>>>>> then please
>>>>>> be
>>>>>> sure that you have provided a valid attribute for exonRankAttributeName
>>>>>>
>>>>>>
>>>>>> Without the "chrominfo" argument I got the same error message   
>>>>>> as the first
>>>>>> time:
>>>>>>
>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>> + exonRankAttributeName = "exon_number",
>>>>>> + dataSource = "UCSC",
>>>>>> + species = "Mus musculus")
>>>>>>
>>>>>> Error in data.frame(..., check.names = FALSE) :
>>>>>>     arguments imply differing number of rows: 541775, 0
>>>>>>
>>>>>>
>>>>>> Finally when I eliminated both the "exonRankAttributeName" and the
>>>>>> "chrominfo" arguments it worked but the warning reminded me of the
>>>>>> "exonRankAttributeName" argument and the chromosome names are   
>>>>>> now different
>>>>>> from the ones in the genome file and there is no info on their length
>>>>>>
>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>> + dataSource = "UCSC",
>>>>>> + species = "Mus musculus")
>>>>>> extracting transcript information
>>>>>> Estimating transcript ranges.
>>>>>> Extracting gene IDs
>>>>>> Processing splicing information for gtf file.
>>>>>> Deducing exon rank from relative coordinates provided
>>>>>> Prepare the 'metadata' data frame ... metadata: OK
>>>>>> Now generating chrominfo from available sequence names. No chromosome
>>>>>> length
>>>>>> information is available.
>>>>>> Warning messages:
>>>>>> 1: In .deduceExonRankings(exs, format = "gtf") :
>>>>>>     Infering Exon Rankings.  If this is not what you expected,   
>>>>>> then please
>>>>>> be
>>>>>> sure that you have provided a valid attribute for exonRankAttributeName
>>>>>> 2: In matchCircularity(chroms, circ_seqs) :
>>>>>>     None of the strings in your circ_seqs argument match your seqnames.
>>>>>>
>>>>>>> seqinfo(txdb)
>>>>>> Seqinfo of length 32
>>>>>> seqnames     seqlengths isCircular genome
>>>>>> chr13              <NA>      FALSE   <NA>
>>>>>> chr9               <NA>      FALSE   <NA>
>>>>>> chr6               <NA>      FALSE   <NA>
>>>>>> chrX               <NA>      FALSE   <NA>
>>>>>> chr17              <NA>      FALSE   <NA>
>>>>>> chr2               <NA>      FALSE   <NA>
>>>>>> chr7               <NA>      FALSE   <NA>
>>>>>> chr18              <NA>      FALSE   <NA>
>>>>>> chr8               <NA>      FALSE   <NA>
>>>>>> ...                 ...        ...    ...
>>>>>> chrY_random        <NA>      FALSE   <NA>
>>>>>> chrX_random        <NA>      FALSE   <NA>
>>>>>> chr5_random        <NA>      FALSE   <NA>
>>>>>> chr4_random        <NA>      FALSE   <NA>
>>>>>> chrY               <NA>      FALSE   <NA>
>>>>>> chr7_random        <NA>      FALSE   <NA>
>>>>>> chr17_random       <NA>      FALSE   <NA>
>>>>>> chr13_random       <NA>      FALSE   <NA>
>>>>>> chr1_random        <NA>      FALSE   <NA>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> What am I doing wrong in my original call to makeTranscriptDbFromGFF?
>>>>>>
>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>>                                  exonRankAttributeName = "exon_number",
>>>>>>                                  chrominfo = chrominfo,
>>>>>>                                  dataSource = "UCSC",
>>>>>>                                  species = "Mus musculus")
>>>>>>
>>>>>> Why am I getting this unfair error message?
>>>>>> Thank you for your help
>>>>>> Ugo
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> FYI, this is my annFile (My gtf  annotation file was downloaded together
>>>>>> with a fasta file containing the mouse genome from UCSC):
>>>>>>
>>>>>>> annFile
>>>>>> GRanges with 595632 ranges and 9 metadata columns:
>>>>>>                 seqnames               ranges strand   |   source
>>>>>> type
>>>>>> score     phase
>>>>>>                    <Rle>            <IRanges>  <Rle>   | <factor>
>>>>>> <factor>
>>>>>> <numeric> <integer>
>>>>>>          [1]        chr1   [3204563, 3207049]      -   |  unknown
>>>>>> exon
>>>>>> <NA>      <NA>
>>>>>>          [2]        chr1   [3206103, 3206105]      -   |  unknown
>>>>>> stop_codon
>>>>>> <NA>      <NA>
>>>>>>          [3]        chr1   [3206106, 3207049]      -   |  unknown
>>>>>> CDS
>>>>>> <NA>         2
>>>>>>          [4]        chr1   [3411783, 3411982]      -   |  unknown
>>>>>> CDS
>>>>>> <NA>         1
>>>>>>          [5]        chr1   [3411783, 3411982]      -   |  unknown
>>>>>> exon
>>>>>> <NA>      <NA>
>>>>>>          ...         ...                  ...    ... ...      ...
>>>>>> ...
>>>>>> ...       ...
>>>>>>     [595628] chrY_random [54422360, 54422362]      +   |  unknown
>>>>>> stop_codon
>>>>>> <NA>      <NA>
>>>>>>     [595629] chrY_random [58501955, 58502946]      +   |  unknown
>>>>>> exon
>>>>>> <NA>      <NA>
>>>>>>     [595630] chrY_random [58502132, 58502812]      +   |  unknown
>>>>>> CDS
>>>>>> <NA>         0
>>>>>>     [595631] chrY_random [58502132, 58502134]      +   |  unknown
>>>>>> start_codon
>>>>>> <NA>      <NA>
>>>>>>     [595632] chrY_random [58502813, 58502815]      +   |  unknown
>>>>>> stop_codon
>>>>>> <NA>      <NA>
>>>>>>                   gene_id  transcript_id    gene_name        p_id
>>>>>> tss_id
>>>>>>               <character>    <character>  <character> <character>
>>>>>> <character>
>>>>>>          [1]         Xkr4   NM_001011874         Xkr4       P2739
>>>>>> TSS1881
>>>>>>          [2]         Xkr4   NM_001011874         Xkr4       P2739
>>>>>> TSS1881
>>>>>>          [3]         Xkr4   NM_001011874         Xkr4       P2739
>>>>>> TSS1881
>>>>>>          [4]         Xkr4   NM_001011874         Xkr4       P2739
>>>>>> TSS1881
>>>>>>          [5]         Xkr4   NM_001011874         Xkr4       P2739
>>>>>> TSS1881
>>>>>>          ...          ...            ...          ...         ...
>>>>>> ...
>>>>>>     [595628] LOC100039753   NM_001017394 LOC100039753      P10196
>>>>>> TSS19491
>>>>>>     [595629] LOC100039614 NM_001160137_4 LOC100039614      P22060
>>>>>> TSS4342
>>>>>>     [595630] LOC100039614 NM_001160137_4 LOC100039614      P22060
>>>>>> TSS4342
>>>>>>     [595631] LOC100039614 NM_001160137_4 LOC100039614      P22060
>>>>>> TSS4342
>>>>>>     [595632] LOC100039614 NM_001160137_4 LOC100039614      P22060
>>>>>> TSS4342
>>>>>>     ---
>>>>>>     seqlengths:
>>>>>>              chr1        chr10        chr11        chr12 ...    
>>>>>> chrX_random
>>>>>> chrY  chrY_random
>>>>>>                NA           NA           NA           NA ...          NA
>>>>>> NA           NA
>>>>>>
>>>>>>
>>>>>>> sessionInfo()
>>>>>> R version 3.0.0 (2013-04-03)
>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>>>> base
>>>>>>
>>>>>> other attached packages:
>>>>>>    [1] rtracklayer_1.20.1     Rbowtie_1.0.2          Rsamtools_1.12.2
>>>>>> Biostrings_2.28.0
>>>>>>    [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1   Biobase_2.20.0
>>>>>> GenomicRanges_1.12.2
>>>>>>    [9] IRanges_1.18.0         BiocGenerics_0.6.0
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>>    [1] BiocInstaller_1.10.0 biomaRt_2.16.0       bitops_1.0-5
>>>>>> BSgenome_1.28.0
>>>>>>    [5] DBI_0.2-5            grid_3.0.0           hwriter_1.3
>>>>>> lattice_0.20-15
>>>>>>    [9] RCurl_1.95-4.1       RSQLite_0.11.3       ShortRead_1.18.0
>>>>>> stats4_3.0.0
>>>>>> [13] tools_3.0.0          XML_3.95-0.2         zlibbioc_1.6.0
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>