[BioC] MakeTranscriptDbFromGFF
Ugo Borello
ugo.borello at inserm.fr
Tue May 7 11:58:51 CEST 2013
Thank you very very much, Marc.
Ugo
> From: Marc Carlson <mcarlson at fhcrc.org>
> Date: Mon, 06 May 2013 15:00:09 -0700
> To: <bioconductor at r-project.org>
> Subject: Re: [BioC] MakeTranscriptDbFromGFF
>
> Hi Ugo,
>
> So I have good news and bad news regarding that data file you wanted to
> use from:
>
> http://tophat.cbcb.umd.edu/igenomes.shtml
>
> The good news is that I have modified makeTranscriptDbFromGFF() to
> notice whenever a file only has exon rank information for CDS and to do
> the right thing (and make use of that information) while issuing a
> warning that it is doing so. The bad news is that after looking more
> carefully at the data from the site listed above, there is no way to
> make that data into a transcriptome without inferring the exon ranks.
> At least not in it's current state. The files on offer simply don't
> contain exon rank information. What I initially thought might be exon
> ranks is (after more careful inspection) something else, and so can't be
> used for that.
>
> So you probably should go with the advice from my last message and make
> a custom mm9 package that makes use of the UCSC transcriptome (or maybe
> one from BiomaRt depending on what you have aligned your data to). But
> I would not recommend using the data at that web site to make
> transcriptomes unless they can give you a file that has exon rank
> information.
>
>
> Marc
>
>
>
> On 05/06/2013 01:32 PM, Marc Carlson wrote:
>> Hi Ugo,
>>
>> This worked for me:
>>
>> library(OrganismDbi)
>>
>> gd <- list(join1 = c(GO.db="GOID", org.Mm.eg.db="GO"),
>> join2 = c(org.Mm.eg.db="ENTREZID",
>> TxDb.Mmusculus.UCSC.mm10.knownGene="GENEID"))
>>
>> makeOrganismPackage(pkgname = "Mus.musculus.mm9",
>> graphData = gd,
>> organism = "Mus musculus",
>> version = "1.0",
>> maintainer ="You <maintainer at someplace.org>",
>> author = "You",
>> destDir = ".",
>> license = "Artistic-2.0")
>>
>>
>> Then I can do stuff like this (after I run R CMD INSTALL
>> Mus.musculus.mm9) :
>>
>> library(Mus.musculus.mm9)
>>
>> select(Mus.musculus.mm9,
>> head(keys(Mus.musculus.mm9,keytype="ENTREZID")),c("SYMBOL","TXSTRAND"),keytyp
>> e="ENTREZID")
>>
>> txs <- transcripts(Mus.musculus.mm9, columns="SYMBOL")
>>
>>
>> Etc.
>>
>>
>> As for your other question, that error means that the chromosome names
>> in your input file are not all present in the chrominfo that you have
>> used. Usually this means that there is a row that says something like
>> chrom8_random (or something like that) as it's chromosome name in the
>> input file. And you don't have that in your chrominfo table, so the
>> database is stopping, because it doesn't have any way to know what
>> chrom8_random is?
>>
>> To fix it, you can either remove unwanted rows from your input file or
>> you can add new rows to your chrominfo object.
>>
>>
>> Marc
>>
>>
>>
>> On 05/04/2013 07:18 AM, Ugo Borello wrote:
>>> Dear Marc,
>>> I apologize for bothering you again, but I was intrigued by the
>>> custom Mus.musculus.mm9 object. It sounds very convenient indeed.
>>> so I tried, as you suggested:
>>>
>>> gd<-c(org.Mm.eg.db='SYMBOL',
>>> TxDb.Mmusculus.UCSC.mm9.knownGene="GENEID")
>>> ## I also tried to set org.Mm.eg.db='ENTREZID'
>>>
>>> destination <- tempfile()
>>> dir.create(destination)
>>> makeOrganismPackage(pkgname = "Mus.musculus.mm9",
>>> graphData = gd,
>>> organism = "Mus musculus",
>>> version = "1.0.0",
>>> maintainer = "Package
>>> Maintainer<maintainer at somewhere.com>",
>>> author = "SomeBody",
>>> destDir = destination,
>>> license = "Artistic-2.0")
>>>
>>> and I got:
>>> Error in cbind(pkgs, keys) :
>>> number of rows of matrices must match (see arg 2)
>>>
>>> I am surely missing something here. Any suggestions?
>>>
>>>
>>> And, coming back to makeTranscriptDbFromGFF.
>>> I obtained my gtf annotation file from UCSC using genePredToGtf, as
>>> the UCSC people suggest
>>> (http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format)
>>>
>>> txdb <-makeTranscriptDbFromGFF(file = 'mm9.gtf', format = "gtf",
>>> + exonRankAttributeName = "exon_number",
>>> + chrominfo = chrominfo,
>>> + dataSource = "UCSC",
>>> + species = "Mus musculus")
>>> extracting transcript information
>>> Estimating transcript ranges.
>>> Extracting gene IDs
>>> Processing splicing information for gtf file.
>>> Prepare the 'metadata' data frame ... metadata: OK
>>> Error in .checkForeignKey(transcripts_tx_chrom, NA,
>>> "transcripts$tx_chrom", :
>>> all the values in 'transcripts$tx_chrom' must be present in
>>> 'chrominfo$chrom'
>>>
>>> Why I get this error now?
>>>
>>>> chrominfo
>>> chrom length is_circular
>>> 1 chr10 129993255 FALSE
>>> 2 chr11 121843856 FALSE
>>> 3 chr12 121257530 FALSE
>>> 4 chr13 120284312 FALSE
>>> 5 chr14 125194864 FALSE
>>> 6 chr15 103494974 FALSE
>>> 7 chr16 98319150 FALSE
>>> 8 chr17 95272651 FALSE
>>> 9 chr18 90772031 FALSE
>>> 10 chr19 61342430 FALSE
>>> 11 chr1 197195432 FALSE
>>> 12 chr2 181748087 FALSE
>>> 13 chr3 159599783 FALSE
>>> 14 chr4 155630120 FALSE
>>> 15 chr5 152537259 FALSE
>>> 16 chr6 149517037 FALSE
>>> 17 chr7 152524553 FALSE
>>> 18 chr8 131738871 FALSE
>>> 19 chr9 124076172 FALSE
>>> 20 chrM 16299 FALSE
>>> 21 chrX 166650296 FALSE
>>> 22 chrY 15902555 FALSE
>>>
>>> Thank you very much for your time and patience.
>>>
>>> Ugo
>>>
>>>
>>>
>>>
>>> P.S.
>>> This is the annotation
>>>> annFile3<- import.gff('mm9.gtf', format='gtf', asRangedData=FALSE)
>>>> annFile3
>>> GRanges with 962651 ranges and 9 metadata columns:
>>> seqnames ranges strand | source
>>> type score phase gene_id transcript_id
>>> <Rle> <IRanges> <Rle> | <factor>
>>> <factor> <numeric> <integer> <character> <character>
>>> [1] chr1 [3195985, 3197398] - | knownGene
>>> exon <NA> <NA> uc007aet.1 uc007aet.1
>>> [2] chr1 [3203520, 3205713] - | knownGene
>>> exon <NA> <NA> uc007aet.1 uc007aet.1
>>> [3] chr1 [3204563, 3207049] - | knownGene
>>> exon <NA> <NA> Q5GH67 uc007aeu.1
>>> [4] chr1 [3206106, 3207049] - | knownGene
>>> CDS <NA> 2 Q5GH67 uc007aeu.1
>>> [5] chr1 [3411783, 3411982] - | knownGene
>>> exon <NA> <NA> Q5GH67 uc007aeu.1
>>> ... ... ... ... ... ...
>>> ... ... ... ... ...
>>> [962647] chrY_random [58502132, 58502365] + | knownGene
>>> CDS <NA> 0 Q62458 uc012htl.1
>>> [962648] chrY_random [58502369, 58502946] + | knownGene
>>> exon <NA> <NA> Q62458 uc012htl.1
>>> [962649] chrY_random [58502369, 58502812] + | knownGene
>>> CDS <NA> 0 Q62458 uc012htl.1
>>> [962650] chrY_random [58502132, 58502134] + | knownGene
>>> start_codon <NA> 0 Q62458 uc012htl.1
>>> [962651] chrY_random [58502813, 58502815] + | knownGene
>>> stop_codon <NA> 0 Q62458 uc012htl.1
>>> exon_number exon_id gene_name
>>> <numeric> <character> <character>
>>> [1] 1 uc007aet.1.1 <NA>
>>> [2] 2 uc007aet.1.2 <NA>
>>> [3] 1 uc007aeu.1.1 Q5GH67
>>> [4] 1 uc007aeu.1.1 Q5GH67
>>> [5] 2 uc007aeu.1.2 Q5GH67
>>> ... ... ... ...
>>> [962647] 1 uc012htl.1.1 Q62458
>>> [962648] 2 uc012htl.1.2 Q62458
>>> [962649] 2 uc012htl.1.2 Q62458
>>> [962650] 1 uc012htl.1.1 Q62458
>>> [962651] 1 uc012htl.1.1 Q62458
>>> ---
>>> seqlengths:
>>> chr1 chr10 chr11 chr12 chr13
>>> ... chrX chrX_random chrY chrY_random
>>> NA NA NA NA NA
>>> ... NA
>>>
>>>
>>> Quoting Marc Carlson <mcarlson at fhcrc.org>:
>>>
>>>> Hi Ugo,
>>>>
>>>>
>>>> On 05/03/2013 04:00 AM, Ugo Borello wrote:
>>>>> Dear Carl,
>>>>> Thank you very much; it makes sense now.
>>>>>
>>>>> To quantify gene expression of my RNASeq samples I use your
>>>>> TxDb.Mmusculus.UCSC.mm9.knownGene annotation together with the
>>>>> alignments to
>>>>> the BSgenome.Mmusculus.UCSC.mm9 genome.
>>>>> Now that I used the UCSC mm9 mouse genome from the Illumina iGenomes I
>>>>> wanted to use their annotation.
>>>>>
>>>>>
>>>>> Anyway, I have now a general question.
>>>>> For the mouse mm9 genome (genome.fa) obtained from Illumina iGenomes
>>>>>> genome<- scanFaIndex('genome.fa')
>>>>>> seqlengths(genome)
>>>>> chr10 chr11 chr12 chr13 chr14 chr15 chr16
>>>>> chr17 chr18 chr19
>>>>> 129993255 121843856 121257530 120284312 125194864 103494974 98319150
>>>>> 95272651 90772031 61342430
>>>>> chr1 chr2 chr3 chr4 chr5 chr6 chr7
>>>>> chr8 chr9 chrM
>>>>> 197195432 181748087 159599783 155630120 152537259 149517037 152524553
>>>>> 131738871 124076172 16299
>>>>> chrX chrY
>>>>> 166650296 15902555
>>>>>
>>>>> For your TxDb.Mmusculus.UCSC.mm9.knownGene
>>>>>> seqlengths( TxDb.Mmusculus.UCSC.mm9.knownGene)
>>>>> chr1 chr2 chr3 chr4 chr5
>>>>> chr6 chr7 chr8 chr9
>>>>> 197195432 181748087 159599783 155630120 152537259
>>>>> 149517037 152524553 131738871 124076172
>>>>> chr10 chr11 chr12 chr13 chr14
>>>>> chr15 chr16 chr17 chr18
>>>>> 129993255 121843856 121257530 120284312 125194864
>>>>> 103494974 98319150 95272651 90772031
>>>>> chr19 chrX chrY chrM chr1_random
>>>>> chr3_random chr4_random chr5_random chr7_random
>>>>> 61342430 166650296 15902555 16299 1231697
>>>>> 41899 160594 357350 362490
>>>>> chr8_random chr9_random chr13_random chr16_random chr17_random
>>>>> chrX_random chrY_random chrUn_random
>>>>> 849593 449403 400311 3994 628739
>>>>> 1785075 58682461 5900358
>>>>>
>>>>>
>>>>> So. I want to filter out the '_random' stuff; is this the only and
>>>>> right way
>>>>> to do it?
>>>>
>>>> It depends on whether you want to use the _random stuff. My impression
>>>> of how people use this is that most people don't. So they would use
>>>> isActiveSeq() to toggle those off.
>>>>
>>>>>> ann<- TxDb.Mmusculus.UCSC.mm9.knownGene
>>>>>> isActiveSeq(ann)[seqlevels(ann)] <- FALSE
>>>>>> isActiveSeq(ann) <- c("chr10"=TRUE, "chr11"=TRUE,
>>>>> + "chr12"=TRUE,"chr13"=TRUE,
>>>>> + "chr14"=TRUE,"chr15"=TRUE,
>>>>> + "chr16"=TRUE, "chr17"=TRUE,
>>>>> + "chr18"=TRUE, "chr19"=TRUE,
>>>>> + "chr1"=TRUE, "chr2"=TRUE,
>>>>> + "chr3"=TRUE, "chr4"=TRUE,
>>>>> + "chr5"=TRUE, "chr6"=TRUE,
>>>>> + "chr7"=TRUE, "chr8"=TRUE,
>>>>> + "chr9"=TRUE, "chrM"=TRUE,
>>>>> + "chrX"=TRUE, "chrY"=TRUE)
>>>>> Then get my gene info this way?
>>>>>> genesInfo<- exons(ann, columns='gene_id')
>>>>> How can I add also gene names or symbols?
>>>> Again, it depends on what you want to do. If you are dealing with just
>>>> a TranscriptDb like this, then there are not gene symbols attached
>>>> already. So for that object yes, you would need to do it like that.
>>>>
>>>>
>>>> Now if you wanted gene symbols they are pretty easy to get. You can go
>>>> and get gene symbols by loading an org package like this:
>>>>
>>>> library(org.Mm.eg.db)
>>>> cols(org.Mm.eg.db)
>>>>
>>>> And then you could use select() to retrieve those (by using the gene
>>>> IDs as keys from your TranscriptDb object). But some people find this
>>>> extra step to be inconvenient, so I have created another route. And
>>>> that is to use a OrganismDb object. There is a package for this
>>>> already that I will use to demo here:
>>>>
>>>> library(Mus.musculus)
>>>> cols(Mus.musculus)
>>>>
>>>> You will notice that this kind of object has all the stuff you want in
>>>> one place. I suspect that you will find that to be a lot more
>>>> convenient:
>>>>
>>>> This basically means that you can do the thing you were interested in
>>>> like this:
>>>>
>>>> genesInfo <- exons(Mus.musculus, columns=c("GENEID","SYMBOL"))
>>>>
>>>> *But* in your case, you are using mm9, so you will want to make a
>>>> custom object (the Mus.musculus object is based on mm10). This is not
>>>> hard to do, but it is an extra step. You can read how to do it in
>>>> section 2 of the following vignette:
>>>>
>>>> http://www.bioconductor.org/packages/2.12/bioc/vignettes/OrganismDbi/inst/d
>>>> oc/OrganismDbi.pdf
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thank you again for your help.
>>>>>
>>>>> Ugo
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> From: Marc Carlson <mcarlson at fhcrc.org>
>>>>>> Date: Thu, 02 May 2013 15:52:38 -0700
>>>>>> To: Ugo Borello <ugo.borello at inserm.fr>
>>>>>> Cc: <bioconductor at r-project.org>
>>>>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF
>>>>>>
>>>>>> Hi Ugo,
>>>>>>
>>>>>> The 15 GB tarball you sent me to contains several different GTF files
>>>>>> for genes. I grabbed this one as it seemed to be the most recent:
>>>>>>
>>>>>> Mus_musculus/UCSC/mm9/Annotation/Archives/archive-2013-03-06-15-01-24/Gen
>>>>>> es/ge
>>>>>>
>>>>>> nes.gtf
>>>>>>
>>>>>> So looking at this file I can reproduce the problem you
>>>>>> mentioned. And
>>>>>> it shows me two problems. The 1st problem is that the only field
>>>>>> that
>>>>>> seems to contain any information about exon positions is called phase
>>>>>> (and not "exon_number" as was in the code I see from before).
>>>>>> There is
>>>>>> not actually any field called "exon_number" in this file. Either way,
>>>>>> one thing you can check is to make sure that the string you give
>>>>>> here is
>>>>>> the same as the appropriate field name that is used by the file.
>>>>>> There
>>>>>> is no way to know this information in advance since GTF does not
>>>>>> specify
>>>>>> how to encode this information (and in fact the information is
>>>>>> entirely
>>>>>> optional).
>>>>>>
>>>>>> The second problem is that even "phase" can't work right now since
>>>>>> the
>>>>>> authors of this gtf file have decided to only associate the exon rank
>>>>>> information only with CDS and never with exons features. So there is
>>>>>> not any actual 'exon' position information in this file, only
>>>>>> information for CDS positions. Now that I see people doing these
>>>>>> files
>>>>>> in this way, I plan to enhance the parser so that it can process
>>>>>> files
>>>>>> of this kind.
>>>>>>
>>>>>>
>>>>>> Is there a reason why you wanted to use this file and not the data
>>>>>> contained in this package here?
>>>>>>
>>>>>> http://www.bioconductor.org/packages/2.12/data/annotation/html/TxDb.Mmusc
>>>>>> ulus.
>>>>>>
>>>>>> UCSC.mm9.knownGene.html
>>>>>>
>>>>>>
>>>>>> Marc
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 05/02/2013 02:59 AM, Ugo Borello wrote:
>>>>>>> Dear Marc
>>>>>>> Sorry I was not precise on the origin of the gtf annotation file;
>>>>>>> I got the
>>>>>>> gtf file from here:
>>>>>>> http://tophat.cbcb.umd.edu/igenomes.shtml
>>>>>>>
>>>>>>> And more precisely from the Mus musculus/UCSC/mm9 folder
>>>>>>> Here the description of the content of the folder:
>>>>>>> ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/README.txt
>>>>>>>
>>>>>>> I realized reading the README.txt file that actually the
>>>>>>> genes.gtf file I
>>>>>>> used is the Ensembl annotation of the mm9 release.
>>>>>>> So, I changed dataSource = "Ensembl" in the function call and I
>>>>>>> got the same
>>>>>>> error message:
>>>>>>> Error in data.frame(..., check.names = FALSE) :
>>>>>>> arguments imply differing number of rows: 541775, 0
>>>>>>>
>>>>>>> At the end of my previous email you have the result of calling:
>>>>>>> annFile<- import.gff('genes.gtf', format='gtf', asRangedData=FALSE)
>>>>>>>
>>>>>>>
>>>>>>> Thank you
>>>>>>>
>>>>>>> Ugo
>>>>>>>
>>>>>>>> From: Marc Carlson <mcarlson at fhcrc.org>
>>>>>>>> Date: Wed, 01 May 2013 15:33:02 -0700
>>>>>>>> To: <bioconductor at r-project.org>
>>>>>>>> Subject: Re: [BioC] MakeTranscriptDbFromGFF
>>>>>>>>
>>>>>>>> Hi Ugo,
>>>>>>>>
>>>>>>>> Which UCSC file was it that you were trying to process?
>>>>>>>>
>>>>>>>>
>>>>>>>> Marc
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/01/2013 02:21 AM, Ugo Borello wrote:
>>>>>>>>> Good morning,
>>>>>>>>>
>>>>>>>>> I have a little problem creating a TranscriptDb object using
>>>>>>>>> the function
>>>>>>>>> makeTranscriptDbFromGFF. I want to use this annotation to count
>>>>>>>>> the
>>>>>>>>> overlaps
>>>>>>>>> of my genomic alignments with genes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I ran:
>>>>>>>>>
>>>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>>>>> + exonRankAttributeName = "exon_number",
>>>>>>>>> + chrominfo = chrominfo,
>>>>>>>>> + dataSource = "UCSC",
>>>>>>>>> + species = "Mus musculus")
>>>>>>>>>
>>>>>>>>> And I got this error message:
>>>>>>>>> Error in data.frame(..., check.names = FALSE) :
>>>>>>>>> arguments imply differing number of rows: 541775, 0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> "chrominfo" was (info retrieved from the fasta genome file):
>>>>>>>>>
>>>>>>>>>> chrominfo
>>>>>>>>> chrom length is_circular
>>>>>>>>> 1 chr10 129993255 FALSE
>>>>>>>>> 2 chr11 121843856 FALSE
>>>>>>>>> 3 chr12 121257530 FALSE
>>>>>>>>> 4 chr13 120284312 FALSE
>>>>>>>>> 5 chr14 125194864 FALSE
>>>>>>>>> 6 chr15 103494974 FALSE
>>>>>>>>> 7 chr16 98319150 FALSE
>>>>>>>>> 8 chr17 95272651 FALSE
>>>>>>>>> 9 chr18 90772031 FALSE
>>>>>>>>> 10 chr19 61342430 FALSE
>>>>>>>>> 11 chr1 197195432 FALSE
>>>>>>>>> 12 chr2 181748087 FALSE
>>>>>>>>> 13 chr3 159599783 FALSE
>>>>>>>>> 14 chr4 155630120 FALSE
>>>>>>>>> 15 chr5 152537259 FALSE
>>>>>>>>> 16 chr6 149517037 FALSE
>>>>>>>>> 17 chr7 152524553 FALSE
>>>>>>>>> 18 chr8 131738871 FALSE
>>>>>>>>> 19 chr9 124076172 FALSE
>>>>>>>>> 20 chrM 16299 TRUE
>>>>>>>>> 21 chrX 166650296 FALSE
>>>>>>>>> 22 chrY 15902555 FALSE
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I ran it again without the "exonRankAttributeName" argument and
>>>>>>>>> I got:
>>>>>>>>>
>>>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>>>>> + chrominfo = chrominfo,
>>>>>>>>> + dataSource = "UCSC",
>>>>>>>>> + species = "Mus musculus")
>>>>>>>>> extracting transcript information
>>>>>>>>> Estimating transcript ranges.
>>>>>>>>> Extracting gene IDs
>>>>>>>>> Processing splicing information for gtf file.
>>>>>>>>> Deducing exon rank from relative coordinates provided
>>>>>>>>> Prepare the 'metadata' data frame ... metadata: OK
>>>>>>>>> Error in .checkForeignKey(transcripts_tx_chrom, NA,
>>>>>>>>> "transcripts$tx_chrom",
>>>>>>>>> :
>>>>>>>>> all the values in 'transcripts$tx_chrom' must be present in
>>>>>>>>> 'chrominfo$chrom'
>>>>>>>>> In addition: Warning message:
>>>>>>>>> In .deduceExonRankings(exs, format = "gtf") :
>>>>>>>>> Infering Exon Rankings. If this is not what you expected,
>>>>>>>>> then please
>>>>>>>>> be
>>>>>>>>> sure that you have provided a valid attribute for
>>>>>>>>> exonRankAttributeName
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Without the "chrominfo" argument I got the same error message
>>>>>>>>> as the first
>>>>>>>>> time:
>>>>>>>>>
>>>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>>>>> + exonRankAttributeName = "exon_number",
>>>>>>>>> + dataSource = "UCSC",
>>>>>>>>> + species = "Mus musculus")
>>>>>>>>>
>>>>>>>>> Error in data.frame(..., check.names = FALSE) :
>>>>>>>>> arguments imply differing number of rows: 541775, 0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Finally when I eliminated both the "exonRankAttributeName" and the
>>>>>>>>> "chrominfo" arguments it worked but the warning reminded me of the
>>>>>>>>> "exonRankAttributeName" argument and the chromosome names are
>>>>>>>>> now different
>>>>>>>>> from the ones in the genome file and there is no info on their
>>>>>>>>> length
>>>>>>>>>
>>>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>>>>> + dataSource = "UCSC",
>>>>>>>>> + species = "Mus musculus")
>>>>>>>>> extracting transcript information
>>>>>>>>> Estimating transcript ranges.
>>>>>>>>> Extracting gene IDs
>>>>>>>>> Processing splicing information for gtf file.
>>>>>>>>> Deducing exon rank from relative coordinates provided
>>>>>>>>> Prepare the 'metadata' data frame ... metadata: OK
>>>>>>>>> Now generating chrominfo from available sequence names. No
>>>>>>>>> chromosome
>>>>>>>>> length
>>>>>>>>> information is available.
>>>>>>>>> Warning messages:
>>>>>>>>> 1: In .deduceExonRankings(exs, format = "gtf") :
>>>>>>>>> Infering Exon Rankings. If this is not what you expected,
>>>>>>>>> then please
>>>>>>>>> be
>>>>>>>>> sure that you have provided a valid attribute for
>>>>>>>>> exonRankAttributeName
>>>>>>>>> 2: In matchCircularity(chroms, circ_seqs) :
>>>>>>>>> None of the strings in your circ_seqs argument match your
>>>>>>>>> seqnames.
>>>>>>>>>
>>>>>>>>>> seqinfo(txdb)
>>>>>>>>> Seqinfo of length 32
>>>>>>>>> seqnames seqlengths isCircular genome
>>>>>>>>> chr13 <NA> FALSE <NA>
>>>>>>>>> chr9 <NA> FALSE <NA>
>>>>>>>>> chr6 <NA> FALSE <NA>
>>>>>>>>> chrX <NA> FALSE <NA>
>>>>>>>>> chr17 <NA> FALSE <NA>
>>>>>>>>> chr2 <NA> FALSE <NA>
>>>>>>>>> chr7 <NA> FALSE <NA>
>>>>>>>>> chr18 <NA> FALSE <NA>
>>>>>>>>> chr8 <NA> FALSE <NA>
>>>>>>>>> ... ... ... ...
>>>>>>>>> chrY_random <NA> FALSE <NA>
>>>>>>>>> chrX_random <NA> FALSE <NA>
>>>>>>>>> chr5_random <NA> FALSE <NA>
>>>>>>>>> chr4_random <NA> FALSE <NA>
>>>>>>>>> chrY <NA> FALSE <NA>
>>>>>>>>> chr7_random <NA> FALSE <NA>
>>>>>>>>> chr17_random <NA> FALSE <NA>
>>>>>>>>> chr13_random <NA> FALSE <NA>
>>>>>>>>> chr1_random <NA> FALSE <NA>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What am I doing wrong in my original call to
>>>>>>>>> makeTranscriptDbFromGFF?
>>>>>>>>>
>>>>>>>>> txdb <-makeTranscriptDbFromGFF(file = annFile, format = "gtf",
>>>>>>>>> exonRankAttributeName = "exon_number",
>>>>>>>>> chrominfo = chrominfo,
>>>>>>>>> dataSource = "UCSC",
>>>>>>>>> species = "Mus musculus")
>>>>>>>>>
>>>>>>>>> Why am I getting this unfair error message?
>>>>>>>>> Thank you for your help
>>>>>>>>> Ugo
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> FYI, this is my annFile (My gtf annotation file was downloaded
>>>>>>>>> together
>>>>>>>>> with a fasta file containing the mouse genome from UCSC):
>>>>>>>>>
>>>>>>>>>> annFile
>>>>>>>>> GRanges with 595632 ranges and 9 metadata columns:
>>>>>>>>> seqnames ranges strand | source
>>>>>>>>> type
>>>>>>>>> score phase
>>>>>>>>> <Rle> <IRanges> <Rle> | <factor>
>>>>>>>>> <factor>
>>>>>>>>> <numeric> <integer>
>>>>>>>>> [1] chr1 [3204563, 3207049] - | unknown
>>>>>>>>> exon
>>>>>>>>> <NA> <NA>
>>>>>>>>> [2] chr1 [3206103, 3206105] - | unknown
>>>>>>>>> stop_codon
>>>>>>>>> <NA> <NA>
>>>>>>>>> [3] chr1 [3206106, 3207049] - | unknown
>>>>>>>>> CDS
>>>>>>>>> <NA> 2
>>>>>>>>> [4] chr1 [3411783, 3411982] - | unknown
>>>>>>>>> CDS
>>>>>>>>> <NA> 1
>>>>>>>>> [5] chr1 [3411783, 3411982] - | unknown
>>>>>>>>> exon
>>>>>>>>> <NA> <NA>
>>>>>>>>> ... ... ... ... ... ...
>>>>>>>>> ...
>>>>>>>>> ... ...
>>>>>>>>> [595628] chrY_random [54422360, 54422362] + | unknown
>>>>>>>>> stop_codon
>>>>>>>>> <NA> <NA>
>>>>>>>>> [595629] chrY_random [58501955, 58502946] + | unknown
>>>>>>>>> exon
>>>>>>>>> <NA> <NA>
>>>>>>>>> [595630] chrY_random [58502132, 58502812] + | unknown
>>>>>>>>> CDS
>>>>>>>>> <NA> 0
>>>>>>>>> [595631] chrY_random [58502132, 58502134] + | unknown
>>>>>>>>> start_codon
>>>>>>>>> <NA> <NA>
>>>>>>>>> [595632] chrY_random [58502813, 58502815] + | unknown
>>>>>>>>> stop_codon
>>>>>>>>> <NA> <NA>
>>>>>>>>> gene_id transcript_id gene_name p_id
>>>>>>>>> tss_id
>>>>>>>>> <character> <character> <character> <character>
>>>>>>>>> <character>
>>>>>>>>> [1] Xkr4 NM_001011874 Xkr4 P2739
>>>>>>>>> TSS1881
>>>>>>>>> [2] Xkr4 NM_001011874 Xkr4 P2739
>>>>>>>>> TSS1881
>>>>>>>>> [3] Xkr4 NM_001011874 Xkr4 P2739
>>>>>>>>> TSS1881
>>>>>>>>> [4] Xkr4 NM_001011874 Xkr4 P2739
>>>>>>>>> TSS1881
>>>>>>>>> [5] Xkr4 NM_001011874 Xkr4 P2739
>>>>>>>>> TSS1881
>>>>>>>>> ... ... ... ... ...
>>>>>>>>> ...
>>>>>>>>> [595628] LOC100039753 NM_001017394 LOC100039753 P10196
>>>>>>>>> TSS19491
>>>>>>>>> [595629] LOC100039614 NM_001160137_4 LOC100039614 P22060
>>>>>>>>> TSS4342
>>>>>>>>> [595630] LOC100039614 NM_001160137_4 LOC100039614 P22060
>>>>>>>>> TSS4342
>>>>>>>>> [595631] LOC100039614 NM_001160137_4 LOC100039614 P22060
>>>>>>>>> TSS4342
>>>>>>>>> [595632] LOC100039614 NM_001160137_4 LOC100039614 P22060
>>>>>>>>> TSS4342
>>>>>>>>> ---
>>>>>>>>> seqlengths:
>>>>>>>>> chr1 chr10 chr11 chr12 ...
>>>>>>>>> chrX_random
>>>>>>>>> chrY chrY_random
>>>>>>>>> NA NA NA NA ... NA
>>>>>>>>> NA NA
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> sessionInfo()
>>>>>>>>> R version 3.0.0 (2013-04-03)
>>>>>>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>>>>>>
>>>>>>>>> locale:
>>>>>>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>>>>>>
>>>>>>>>> attached base packages:
>>>>>>>>> [1] parallel stats graphics grDevices utils datasets
>>>>>>>>> methods
>>>>>>>>> base
>>>>>>>>>
>>>>>>>>> other attached packages:
>>>>>>>>> [1] rtracklayer_1.20.1 Rbowtie_1.0.2 Rsamtools_1.12.2
>>>>>>>>> Biostrings_2.28.0
>>>>>>>>> [5] GenomicFeatures_1.12.0 AnnotationDbi_1.22.1 Biobase_2.20.0
>>>>>>>>> GenomicRanges_1.12.2
>>>>>>>>> [9] IRanges_1.18.0 BiocGenerics_0.6.0
>>>>>>>>>
>>>>>>>>> loaded via a namespace (and not attached):
>>>>>>>>> [1] BiocInstaller_1.10.0 biomaRt_2.16.0 bitops_1.0-5
>>>>>>>>> BSgenome_1.28.0
>>>>>>>>> [5] DBI_0.2-5 grid_3.0.0 hwriter_1.3
>>>>>>>>> lattice_0.20-15
>>>>>>>>> [9] RCurl_1.95-4.1 RSQLite_0.11.3 ShortRead_1.18.0
>>>>>>>>> stats4_3.0.0
>>>>>>>>> [13] tools_3.0.0 XML_3.95-0.2 zlibbioc_1.6.0
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives:
>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>
>>>
>>> ----------------------------------------------------------------
>>> This message was sent using IMP, the Internet Messaging Program.
>>>
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list