[BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF
Katja Hebestreit
katjah at stanford.edu
Tue Apr 15 23:02:17 CEST 2014
Thank you so much for pointing that out!!
Unfortunately, there are a bunch of genes with exomes on different chromosomes. And this is also the case for newer versions of that .gtf file (refFlat from UCSC). Do you have an idea when you will fix that?
Cheers,
Katja
----- Original Message -----
From: "Martin Morgan" <mtmorgan at fhcrc.org>
To: katjah at stanford.edu, bioconductor at r-project.org
Sent: Tuesday, April 15, 2014 1:27:30 PM
Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF
On 04/15/2014 10:53 AM, Maintainer wrote:
> Dear Bioconductor maintainers,
>
> I still have a problem using makeTranscriptDbFromGFF:
>
> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf",
> format="gtf")
> extracting transcript information
> Estimating transcript ranges.
> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
> number of items to replace is not a multiple of replacement
>
If I
options(error=recover)
then
> xx = makeTranscriptDbFromGFF(file="mm9_test_noSpace.gtf", format="gtf")
extracting transcript information
Estimating transcript ranges.
Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
number of items to replace is not a multiple of replacement length
Enter a frame number, or 0 to exit
1: makeTranscriptDbFromGFF(file = "mm9_test_noSpace.gtf", format = "gtf")
2: .prepareGTFTables(gff, exonRankAttributeName)
3: .prepareGTFtranscripts(data)
4: .deduceTranscriptsFromGTF(transcripts)
Selection: 4
Called from: .deduceTranscriptsFromGTF(transcripts)
Browse[1]>
then look around a little, e.g.,
Browse[1]> .deduceTranscriptsFromGTF
## ...
I spot the offending line
for (i in seq_len(length(trns))) {
res[i, ] <- .deduceTranscriptRangeData(subs[[i]])
and look at subs[[i]] and .deduceTranscriptRangeData(subs[[i]])
Browse[1]> subs[[i]]
seqnames start end strand type gene_id transcript_id exon_rank
47369 chr2 36245431 36245515 + exon Mir684-1 Mir684-1 NA
147043 chr5 23763346 23763430 + exon Mir684-1 Mir684-1 NA
Browse[1]> .deduceTranscriptRangeData(subs[[i]])
[1] "Mir684-1" "chr2" "chr5" "+" "23763346" "36245515" "Mir684-1"
I guess the problem is that the exons are coming from two separate chromosomes,
and .deduceTranscriptRangeData does not handle that case successfully.
Hopefully the code will be made more robust / efficient, but in the mean time
you could edit your file to remove the troublesome Mir684-1.
Martin
> I uploaded the file here:
> https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf
>
> I tried to reduce the file further, in order to get a smaller example file and to check which lines cause the problem, but:
> mm9_test_noSpace.gtf is 200,000 lines long and it results in an error.
> When I am using the first 100,000 (or, last 100,000 lines, respectively) it works!
>
> head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
> --> works!
> tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
> --> works!
>
> I have no idea what is going on. Also, the .gtf file is from UCSC.
>
> Do you have any idea what the problem is?
> Thank you so much!
> Katja
>
>
> ----- Forwarded Message -----
> From: "Michael Lawrence" <lawrence.michael at gene.com>
> To: "Katja Hebestreit" <katjah at stanford.edu>
> Cc: "Michael Lawrence" <lawrence.michael at gene.com>
> Sent: Monday, April 14, 2014 7:37:23 PM
> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF
>
> Sorry, I hadn't tested the makeTranscriptDbFromGFF function. I didn't write
> that one, so the maintainers of GenomicFeatures will need to help now.
>
>
> On Mon, Apr 14, 2014 at 5:31 PM, Katja Hebestreit <katjah at stanford.edu>wrote:
>
>> Great! Thank you so much, Micheal!
>>
>> Unfortunately, I still have problems with this file:
>>
>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf",
>> format="gtf")
>> extracting transcript information
>> Estimating transcript ranges.
>> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
>> number of items to replace is not a multiple of replacement
>>
>> To determine the gene lengths I wanted to use the exact .gtf file we used
>> for counting the reads per gene some time ago.
>>
>>
>> ----- Original Message -----
>> From: "Michael Lawrence" <lawrence.michael at gene.com>
>> To: "Katja Hebestreit" <katjah at stanford.edu>
>> Cc: "Vincent Carey" <stvjc at channing.harvard.edu>, "Michael Lawrence" <
>> lawrence.michael at gene.com>, "Rsamtools Maintainer" <
>> maintainer at bioconductor.org>, bioconductor at r-project.org
>> Sent: Monday, April 14, 2014 2:55:06 PM
>> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF
>>
>> I've made rtracklayer 1.25.1 ignore whitespace after the last attribute.
>> After that change, both the truncated and full file parse successfully.
>> Btw, which genes are these? The UCSC genes are already available as a TxDb
>> package.
>>
>>
>> On Mon, Apr 14, 2014 at 2:36 PM, Katja Hebestreit <katjah at stanford.edu
>>> wrote:
>>
>>> Okay, I removed the whitspaces with
>>> sed 's/.$//' mm9_test.gtf > mm9_test_noSpace.gtf
>>>
>>> Here is the file:
>>> https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf
>>>
>>> But still, I get an error:
>>>
>>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf",
>>> format="gtf")
>>> extracting transcript information
>>> Estimating transcript ranges.
>>> Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) :
>>> number of items to replace is not a multiple of replacement
>>>
>>>
>>> I tried to reduce the file further, in order to get a smaller example
>> file
>>> and to check which lines cause the problem, but:
>>>
>>> mm9_test_noSpace.gtf is 200,000 lines long and it results in an error.
>>> When I am using the first 100,000 (or, last 100,000 lines, respectively)
>> it
>>> works!
>>>
>>> head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
>>> --> works!
>>> tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
>>> --> works!
>>>
>>> I have no idea what is going on. Also, the .gtf file is from UCSC.
>>>
>>> Thanks again for helping with that annoying problem!
>>> Katja
>>>
>>>
>>> ----- Original Message -----
>>> From: "Vincent Carey" <stvjc at channing.harvard.ed
>>> To: "Katja Hebestreit" <katjah at stanford.edu>
>>> Cc: "Michael Lawrence" <lawrence.michael at gene.com>, "Rsamtools
>>> Maintainer" <maintainer at bioconductor.org>, bioconductor at r-project.org
>>> Sent: Monday, April 14, 2014 11:45:30 AM
>>> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF
>>>
>>> remove the trailing whitespace at the end of every line
>>>
>>>
>>> On Mon, Apr 14, 2014 at 2:24 PM, Katja Hebestreit <katjah at stanford.edu
>>>> wrote:
>>>
>>>> You can download the file here:
>>>>
>>>> https://www.dropbox.com/s/04nck83jq6r91bc/mm9_test.gtf
>>>>
>>>> Using file I get the error:
>>>>
>>>> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test.gtf", format="gtf")
>>>> Error in .parse_attrCol(attrCol, file, colnames) :
>>>> Some attributes do not conform to 'tag value' format
>>>>
>>>> Thank you so much for helping!!
>>>> Katja
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: "Michael Lawrence" <lawrence.michael at gene.com>
>>>> To: "Katja Hebestreit" <katjah at stanford.edu>
>>>> Cc: "Michael Lawrence" <lawrence.michael at gene.com>,
>>>> bioconductor at r-project.org, "Rsamtools Maintainer" <
>>>> maintainer at bioconductor.org>
>>>> Sent: Monday, April 14, 2014 7:27:26 AM
>>>> Subject: Re: [BioC] GenomicFeatures: Problem with
>> makeTranscriptDbFromGFF
>>>>
>>>> Well, I copied the text and replaced the spaces with tabs as
>> appropriate
>>>> and everything seemed to work fine, so you might to attach that
>> fragment
>>> of
>>>> the file, just to be sure it isn't a formatting issue.
>>>>
>>>> Does import("file.gtf") work for you? If so, that should be good enough
>>> for
>>>> your use case.
>>>>
>>>> Michael
>>>>
>>>>
>>>> On Sun, Apr 13, 2014 at 10:14 PM, Katja Hebestreit <
>> katjah at stanford.edu
>>>>> wrote:
>>>>
>>>>> Actually, the error was not reproducible with the lines I attached.
>> But
>>>> it
>>>>> is reproducible with those lines (four additional lines):
>>>>>
>>>>> chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000
>>> -
>>>>> . gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat CDS 3206106 3207049 0.000000 -
>>> 2
>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat exon 3204563 3207049 0.000000 -
>>> .
>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat CDS 3411783 3411982 0.000000 -
>>> 1
>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat exon 3411783 3411982 0.000000 -
>>> .
>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat CDS 3660633 3661429 0.000000 -
>>> 0
>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat start_codon 3661427 3661429 0.000000
>>> -
>>>>> . gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat exon 3660633 3661579 0.000000 -
>>> .
>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>> chr1 mm9_refFlat stop_codon 4283062 4283064 0.000000
>>> -
>>>>> . gene_id "Rp1"; transcript_id "Rp1";
>>>>> chr1 mm9_refFlat CDS 4283065 4283093 0.000000 -
>>> 2
>>>>> gene_id "Rp1"; transcript_id "Rp1";
>>>>>
>>>>> Let me know if you like to get the entire file.
>>>>>
>>>>> Thank you!!
>>>>> Katja
>>>>>
>>>>> ----- Original Message -----
>>>>> From: "Michael Lawrence" <lawrence.michael at gene.com>
>>>>> To: "Katja Hebestreit" <katjah at stanford.edu>
>>>>> Cc: bioconductor at r-project.org, "Rsamtools Maintainer" <
>>>>> maintainer at bioconductor.org>
>>>>> Sent: Sunday, April 13, 2014 10:02:13 PM
>>>>> Subject: Re: [BioC] GenomicFeatures: Problem with
>>> makeTranscriptDbFromGFF
>>>>>
>>>>> On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit <
>> katjah at stanford.edu
>>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I get an error when I try to import my gff file:
>>>>>>
>>>>>> txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf")
>>>>>>
>>>>>> Error in .parse_attrCol(attrCol, file, colnames) :
>>>>>> Some attributes do not conform to 'tag value' format
>>>>>>
>>>>>> This is how my file looks like:
>>>>>>
>>>>>> chr1 mm9_refFlat stop_codon 3206103 3206105 0.000000
>>>> -
>>>>>> . gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1 mm9_refFlat CDS 3206106 3207049 0.000000 -
>>>> 2
>>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1 mm9_refFlat exon 3204563 3207049 0.000000 -
>>>> .
>>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1 mm9_refFlat CDS 3411783 3411982 0.000000 -
>>>> 1
>>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1 mm9_refFlat exon 3411783 3411982 0.000000 -
>>>> .
>>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>> chr1 mm9_refFlat CDS 3660633 3661429 0.000000 -
>>>> 0
>>>>>> gene_id "Xkr4"; transcript_id "Xkr4";
>>>>>>
>>>>>> I have the feeling that this has something to do with the missing
>>> exon
>>>>>> rank information in my file. Is that true? Is there a way to import
>>>> this
>>>>>> file? All I want to do is to determine the gene lengths.
>>>>>>
>>>>>
>>>>> It is most likely as the error says: some of your attributes are
>>>> malformed.
>>>>> Is that the entire file listed above, or is there more? If you could
>>> get
>>>> me
>>>>> the file somehow I could diagnose the issue.
>>>>>
>>>>>
>>>>>>
>>>>>> Could anyone help? That would be awesome!
>>>>>> Cheers,
>>>>>> Katja
>>>>>>
>>>>>>
>>>>>> sessionInfo()
>>>>>> R version 3.1.0 (2014-04-10)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
>>>>>> [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
>>>>>> [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
>>>>>> [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel stats graphics grDevices utils datasets
>>> methods
>>>>>> [8] base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19 Biobase_2.23.6
>>>>>> [4] GenomicRanges_1.16.0 GenomeInfoDb_0.99.32 IRanges_1.21.45
>>>>>> [7] BiocGenerics_0.9.3 BiocInstaller_1.14.1
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] BatchJobs_1.2 BBmisc_1.5
>>> BiocParallel_0.6.0
>>>>>> [4] biomaRt_2.20.0 Biostrings_2.32.0 bitops_1.0-6
>>>>>> [7] brew_1.0-6 BSgenome_1.32.0
>> codetools_0.2-8
>>>>>> [10] DBI_0.2-7 digest_0.6.4 fail_1.2
>>>>>> [13] foreach_1.4.2 GenomicAlignments_1.0.0
>> iterators_1.0.7
>>>>>> [16] plyr_1.8.1 Rcpp_0.11.1 RCurl_1.95-4.1
>>>>>> [19] Rsamtools_1.16.0 RSQLite_0.11.4
>>> rtracklayer_1.24.0
>>>>>> [22] sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2
>>>>>> [25] tools_3.1.0 XML_3.98-1.1 XVector_0.4.0
>>>>>> [28] zlibbioc_1.10.0
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>
>
> ________________________________________________________________________
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list