[BioC] MakeTranscriptDbFromGFF
Ugo Borello
ugo.borello at inserm.fr
Fri May 3 22:10:47 CEST 2013
Thank you very much, Marc.
I will work on it.
Ugo
Quoting Marc Carlson <mcarlson at fhcrc.org>:
> Hi Ugo,
>
>
> On 05/03/2013 04:00 AM, Ugo Borello wrote:
>> Dear Carl,
>> Thank you very much; it makes sense now.
>>
>> To quantify gene expression of my RNASeq samples I use your
>> TxDb.Mmusculus.UCSC.mm9.knownGene annotation together with the alignments to
>> the BSgenome.Mmusculus.UCSC.mm9 genome.
>> Now that I used the UCSC mm9 mouse genome from the Illumina iGenomes I
>> wanted to use their annotation.
>>
>>
>> Anyway, I have now a general question.
>> For the mouse mm9 genome (genome.fa) obtained from Illumina iGenomes
>>> genome<- scanFaIndex('genome.fa')
>>> seqlengths(genome)
>> chr10 chr11 chr12 chr13 chr14 chr15 chr16
>> chr17 chr18 chr19
>> 129993255 121843856 121257530 120284312 125194864 103494974 98319150
>> 95272651 90772031 61342430
>> chr1 chr2 chr3 chr4 chr5 chr6 chr7
>> chr8 chr9 chrM
>> 197195432 181748087 159599783 155630120 152537259 149517037 152524553
>> 131738871 124076172 16299
>> chrX chrY
>> 166650296 15902555
>>
>> For your TxDb.Mmusculus.UCSC.mm9.knownGene
>>> seqlengths( TxDb.Mmusculus.UCSC.mm9.knownGene)
>> chr1 chr2 chr3 chr4 chr5
>> chr6 chr7 chr8 chr9
>> 197195432 181748087 159599783 155630120 152537259
>> 149517037 152524553 131738871 124076172
>> chr10 chr11 chr12 chr13 chr14
>> chr15 chr16 chr17 chr18
>> 129993255 121843856 121257530 120284312 125194864
>> 103494974 98319150 95272651 90772031
>> chr19 chrX chrY chrM chr1_random
>> chr3_random chr4_random chr5_random chr7_random
>> 61342430 166650296 15902555 16299 1231697
>> 41899 160594 357350 362490
>> chr8_random chr9_random chr13_random chr16_random chr17_random
>> chrX_random chrY_random chrUn_random
>> 849593 449403 400311 3994 628739
>> 1785075 58682461 5900358
>>
>>
>> So. I want to filter out the '_random' stuff; is this the only and right way
>> to do it?
>
> It depends on whether you want to use the _random stuff. My impression
> of how people use this is that most people don't. So they would use
> isActiveSeq() to toggle those off.
>
>>> ann<- TxDb.Mmusculus.UCSC.mm9.knownGene
>>> isActiveSeq(ann)[seqlevels(ann)] <- FALSE
>>> isActiveSeq(ann) <- c("chr10"=TRUE, "chr11"=TRUE,
>> + "chr12"=TRUE,"chr13"=TRUE,
>> + "chr14"=TRUE,"chr15"=TRUE,
>> + "chr16"=TRUE, "chr17"=TRUE,
>> + "chr18"=TRUE, "chr19"=TRUE,
>> + "chr1"=TRUE, "chr2"=TRUE,
>> + "chr3"=TRUE, "chr4"=TRUE,
>> + "chr5"=TRUE, "chr6"=TRUE,
>> + "chr7"=TRUE, "chr8"=TRUE,
>> + "chr9"=TRUE, "chrM"=TRUE,
>> + "chrX"=TRUE, "chrY"=TRUE)
>> Then get my gene info this way?
>>> genesInfo<- exons(ann, columns='gene_id')
>> How can I add also gene names or symbols?
> Again, it depends on what you want to do. If you are dealing with just
> a TranscriptDb like this, then there are not gene symbols attached
> already. So for that object yes, you would need to do it like that.
>
>
> Now if you wanted gene symbols they are pretty easy to get. You can go
> and get gene symbols by loading an org package like this:
>
> library(org.Mm.eg.db)
> cols(org.Mm.eg.db)
>
> And then you could use select() to retrieve those (by using the gene
> IDs as keys from your TranscriptDb object). But some people find this
> extra step to be inconvenient, so I have created another route. And
> that is to use a OrganismDb object. There is a package for this
> already that I will use to demo here:
>
> library(Mus.musculus)
> cols(Mus.musculus)
>
> You will notice that this kind of object has all the stuff you want in
> one place. I suspect that you will find that to be a lot more
> convenient:
>
> This basically means that you can do the thing you were interested in
> like this:
>
> genesInfo <- exons(Mus.musculus, columns=c("GENEID","SYMBOL"))
>
> *But* in your case, you are using mm9, so you will want to make a
> custom object (the Mus.musculus object is based on mm10). This is not
> hard to do, but it is an extra step. You can read how to do it in
> section 2 of the following vignette:
>
> http://www.bioconductor.org/packages/2.12/bioc/vignettes/OrganismDbi/inst/doc/OrganismDbi.pdf
>
>
>
>>
>>
>>
>> Thank you again for your help.
>>
>> Ugo
>>
>>
>>
>>
>>> From: Marc Carlson <mcarlson at fhcrc.org>
>>> Date: Thu, 02 May 2013 15:52:38 -0700
>>> To: Ugo Borello <ugo.borello at inserm.fr>
>>> Cc: <bioconductor at r-project.org>
>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF
>>>
>>> Hi Ugo,
>>>
>>> The 15 GB tarball you sent me to contains several different GTF files
>>> for genes. I grabbed this one as it seemed to be the most recent:
>>>
>>> Mus_musculus/UCSC/mm9/Annotation/Archives/archive-2013-03-06-15-01-24/Genes/ge
>>> nes.gtf
>>>
>>> So looking at this file I can reproduce the problem you mentioned. And
>>> it shows me two problems. The 1st problem is that the only field that
>>> seems to contain any information about exon positions is called phase
>>> (and not "exon_number" as was in the code I see from before). There is
>>> not actually any field called "exon_number" in this file. Either way,
>>> one thing you can check is to make sure that the string you give here is
>>> the same as the appropriate field name that is used by the file. There
>>> is no way to know this information in advance since GTF does not specify
>>> how to encode this information (and in fact the information is entirely
>>> optional).
>>>
>>> The second problem is that even "phase" can't work right now since the
>>> authors of this gtf file have decided to only associate the exon rank
>>> information only with CDS and never with exons features. So there is
>>> not any actual 'exon' position information in this file, only
>>> information for CDS positions. Now that I see people doing these files
>>> in this way, I plan to enhance the parser so that it can process files
>>> of this kind.
>>>
>>>
>>> Is there a reason why you wanted to use this file and not the data
>>> contained in this package here?
>>>
>>> http://www.bioconductor.org/packages/2.12/data/annotation/html/TxDb.Mmusculus.
>>> UCSC.mm9.knownGene.html
>>>
>>>
>>> Marc
>>>
>>>
>>>
>>> On 05/02/2013 02:59 AM, Ugo Borello wrote:
>>>> Dear Marc
>>>> Sorry I was not precise on the origin of the gtf annotation file;
>>>> I got the
>>>> gtf file from here:
>>>> http://tophat.cbcb.umd.edu/igenomes.shtml
>>>>
>>>> And more precisely from the Mus musculus/UCSC/mm9 folder
>>>> Here the description of the content of the folder:
>>>> ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/README.txt
>>>>
>>>> I realized reading the README.txt file that actually the genes.gtf file I
>>>> used is the Ensembl annotation of the mm9 release.
>>>> So, I changed dataSource = "Ensembl" in the function call and I
>>>> got the same
>>>> error message:
>>>> Error in data.frame(..., check.names = FALSE) :
>>>> arguments imply differing number of rows: 541775, 0
>>>>
>>>> At the end of my previous email you have the result of calling:
>>>> annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE)
>>>>
>>>>
>>>> Thank you
>>>>
>>>> Ugo
>>>>
>>>>> From: Marc Carlson <mcarlson at fhcrc.org>
>>>>> Date: Wed, 01 May 2013 15:33:02 -0700
>>>>> To: <bioconductor at r-project.org>
>>>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF
>>>>>
>>>>> Hi Ugo,
>>>>>
>>>>> Which UCSC file was it that you were trying to process?
>>>>>
>>>>>
>>>>> Marc
>>>>>
>>>>>
>>>>>
>>>>> On 05/01/2013 02:21 AM, Ugo Borello wrote:
>>>>>> Good morning,
>>>>>>
>>>>>> I have a little problem creating a TranscriptDb object using
>>>>>> the function
>>>>>> makeTranscriptDbFromGFF. I want to use this annotation to count the
>>>>>> overlaps
>>>>>> of my genomic alignments with genes.
>>>>>>
>>>>>>
>>>>>> I ran:
>>>>>>
>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>> + exonRankAttributeName = "exon_number",
>>>>>> + chrominfo = chrominfo,
>>>>>> + dataSource = "UCSC",
>>>>>> + species = "Mus musculus")
>>>>>>
>>>>>> And I got this error message:
>>>>>> Error in data.frame(..., check.names = FALSE) :
>>>>>> arguments imply differing number of rows: 541775, 0
>>>>>>
>>>>>>
>>>>>> "chrominfo" was (info retrieved from the fasta genome file):
>>>>>>
>>>>>>> chrominfo
>>>>>> chrom length is_circular
>>>>>> 1 chr10 129993255 FALSE
>>>>>> 2 chr11 121843856 FALSE
>>>>>> 3 chr12 121257530 FALSE
>>>>>> 4 chr13 120284312 FALSE
>>>>>> 5 chr14 125194864 FALSE
>>>>>> 6 chr15 103494974 FALSE
>>>>>> 7 chr16 98319150 FALSE
>>>>>> 8 chr17 95272651 FALSE
>>>>>> 9 chr18 90772031 FALSE
>>>>>> 10 chr19 61342430 FALSE
>>>>>> 11 chr1 197195432 FALSE
>>>>>> 12 chr2 181748087 FALSE
>>>>>> 13 chr3 159599783 FALSE
>>>>>> 14 chr4 155630120 FALSE
>>>>>> 15 chr5 152537259 FALSE
>>>>>> 16 chr6 149517037 FALSE
>>>>>> 17 chr7 152524553 FALSE
>>>>>> 18 chr8 131738871 FALSE
>>>>>> 19 chr9 124076172 FALSE
>>>>>> 20 chrM 16299 TRUE
>>>>>> 21 chrX 166650296 FALSE
>>>>>> 22 chrY 15902555 FALSE
>>>>>>
>>>>>>
>>>>>> I ran it again without the "exonRankAttributeName" argument and I got:
>>>>>>
>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>> + chrominfo = chrominfo,
>>>>>> + dataSource = "UCSC",
>>>>>> + species = "Mus musculus")
>>>>>> extracting transcript information
>>>>>> Estimating transcript ranges.
>>>>>> Extracting gene IDs
>>>>>> Processing splicing information for gtf file.
>>>>>> Deducing exon rank from relative coordinates provided
>>>>>> Prepare the 'metadata' data frame ... metadata: OK
>>>>>> Error in .checkForeignKey(transcripts_tx_chrom, NA,
>>>>>> "transcripts$tx_chrom",
>>>>>> :
>>>>>> all the values in 'transcripts$tx_chrom' must be present in
>>>>>> 'chrominfo$chrom'
>>>>>> In addition: Warning message:
>>>>>> In .deduceExonRankings(exs, format = "gtf") :
>>>>>> Infering Exon Rankings. If this is not what you expected,
>>>>>> then please
>>>>>> be
>>>>>> sure that you have provided a valid attribute for exonRankAttributeName
>>>>>>
>>>>>>
>>>>>> Without the "chrominfo" argument I got the same error message
>>>>>> as the first
>>>>>> time:
>>>>>>
>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>> + exonRankAttributeName = "exon_number",
>>>>>> + dataSource = "UCSC",
>>>>>> + species = "Mus musculus")
>>>>>>
>>>>>> Error in data.frame(..., check.names = FALSE) :
>>>>>> arguments imply differing number of rows: 541775, 0
>>>>>>
>>>>>>
>>>>>> Finally when I eliminated both the "exonRankAttributeName" and the
>>>>>> "chrominfo" arguments it worked but the warning reminded me of the
>>>>>> "exonRankAttributeName" argument and the chromosome names are
>>>>>> now different
>>>>>> from the ones in the genome file and there is no info on their length
>>>>>>
>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>> + dataSource = "UCSC",
>>>>>> + species = "Mus musculus")
>>>>>> extracting transcript information
>>>>>> Estimating transcript ranges.
>>>>>> Extracting gene IDs
>>>>>> Processing splicing information for gtf file.
>>>>>> Deducing exon rank from relative coordinates provided
>>>>>> Prepare the 'metadata' data frame ... metadata: OK
>>>>>> Now generating chrominfo from available sequence names. No chromosome
>>>>>> length
>>>>>> information is available.
>>>>>> Warning messages:
>>>>>> 1: In .deduceExonRankings(exs, format = "gtf") :
>>>>>> Infering Exon Rankings. If this is not what you expected,
>>>>>> then please
>>>>>> be
>>>>>> sure that you have provided a valid attribute for exonRankAttributeName
>>>>>> 2: In matchCircularity(chroms, circ_seqs) :
>>>>>> None of the strings in your circ_seqs argument match your seqnames.
>>>>>>
>>>>>>> seqinfo(txdb)
>>>>>> Seqinfo of length 32
>>>>>> seqnames seqlengths isCircular genome
>>>>>> chr13 <NA> FALSE <NA>
>>>>>> chr9 <NA> FALSE <NA>
>>>>>> chr6 <NA> FALSE <NA>
>>>>>> chrX <NA> FALSE <NA>
>>>>>> chr17 <NA> FALSE <NA>
>>>>>> chr2 <NA> FALSE <NA>
>>>>>> chr7 <NA> FALSE <NA>
>>>>>> chr18 <NA> FALSE <NA>
>>>>>> chr8 <NA> FALSE <NA>
>>>>>> ... ... ... ...
>>>>>> chrY_random <NA> FALSE <NA>
>>>>>> chrX_random <NA> FALSE <NA>
>>>>>> chr5_random <NA> FALSE <NA>
>>>>>> chr4_random <NA> FALSE <NA>
>>>>>> chrY <NA> FALSE <NA>
>>>>>> chr7_random <NA> FALSE <NA>
>>>>>> chr17_random <NA> FALSE <NA>
>>>>>> chr13_random <NA> FALSE <NA>
>>>>>> chr1_random <NA> FALSE <NA>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> What am I doing wrong in my original call to makeTranscriptDbFromGFF?
>>>>>>
>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>> exonRankAttributeName = "exon_number",
>>>>>> chrominfo = chrominfo,
>>>>>> dataSource = "UCSC",
>>>>>> species = "Mus musculus")
>>>>>>
>>>>>> Why am I getting this unfair error message?
>>>>>> Thank you for your help
>>>>>> Ugo
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> FYI, this is my annFile (My gtf annotation file was downloaded together
>>>>>> with a fasta file containing the mouse genome from UCSC):
>>>>>>
>>>>>>> annFile
>>>>>> GRanges with 595632 ranges and 9 metadata columns:
>>>>>> seqnames ranges strand | source
>>>>>> type
>>>>>> score phase
>>>>>> <Rle> <IRanges> <Rle> | <factor>
>>>>>> <factor>
>>>>>> <numeric> <integer>
>>>>>> [1] chr1 [3204563, 3207049] - | unknown
>>>>>> exon
>>>>>> <NA> <NA>
>>>>>> [2] chr1 [3206103, 3206105] - | unknown
>>>>>> stop_codon
>>>>>> <NA> <NA>
>>>>>> [3] chr1 [3206106, 3207049] - | unknown
>>>>>> CDS
>>>>>> <NA> 2
>>>>>> [4] chr1 [3411783, 3411982] - | unknown
>>>>>> CDS
>>>>>> <NA> 1
>>>>>> [5] chr1 [3411783, 3411982] - | unknown
>>>>>> exon
>>>>>> <NA> <NA>
>>>>>> ... ... ... ... ... ...
>>>>>> ...
>>>>>> ... ...
>>>>>> [595628] chrY_random [54422360, 54422362] + | unknown
>>>>>> stop_codon
>>>>>> <NA> <NA>
>>>>>> [595629] chrY_random [58501955, 58502946] + | unknown
>>>>>> exon
>>>>>> <NA> <NA>
>>>>>> [595630] chrY_random [58502132, 58502812] + | unknown
>>>>>> CDS
>>>>>> <NA> 0
>>>>>> [595631] chrY_random [58502132, 58502134] + | unknown
>>>>>> start_codon
>>>>>> <NA> <NA>
>>>>>> [595632] chrY_random [58502813, 58502815] + | unknown
>>>>>> stop_codon
>>>>>> <NA> <NA>
>>>>>> gene_id transcript_id gene_name p_id
>>>>>> tss_id
>>>>>> <character> <character> <character> <character>
>>>>>> <character>
>>>>>> [1] Xkr4 NM_001011874 Xkr4 P2739
>>>>>> TSS1881
>>>>>> [2] Xkr4 NM_001011874 Xkr4 P2739
>>>>>> TSS1881
>>>>>> [3] Xkr4 NM_001011874 Xkr4 P2739
>>>>>> TSS1881
>>>>>> [4] Xkr4 NM_001011874 Xkr4 P2739
>>>>>> TSS1881
>>>>>> [5] Xkr4 NM_001011874 Xkr4 P2739
>>>>>> TSS1881
>>>>>> ... ... ... ... ...
>>>>>> ...
>>>>>> [595628] LOC100039753 NM_001017394 LOC100039753 P10196
>>>>>> TSS19491
>>>>>> [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060
>>>>>> TSS4342
>>>>>> [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060
>>>>>> TSS4342
>>>>>> [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060
>>>>>> TSS4342
>>>>>> [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060
>>>>>> TSS4342
>>>>>> ---
>>>>>> seqlengths:
>>>>>> chr1 chr10 chr11 chr12 ...
>>>>>> chrX_random
>>>>>> chrY chrY_random
>>>>>> NA NA NA NA ... NA
>>>>>> NA NA
>>>>>>
>>>>>>
>>>>>>> sessionInfo()
>>>>>> R version 3.0.0 (2013-04-03)
>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel stats graphics grDevices utils datasets methods
>>>>>> base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2
>>>>>> Biostrings_2.28.0
>>>>>> [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0
>>>>>> GenomicRanges_1.12.2
>>>>>> [9] IRanges_1.18.0 BiocGenerics_0.6.0
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5
>>>>>> BSgenome_1.28.0
>>>>>> [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3
>>>>>> lattice_0.20-15
>>>>>> [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0
>>>>>> stats4_3.0.0
>>>>>> [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
More information about the Bioconductor
mailing list