[BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF

Katja Hebestreit katjah at stanford.edu
Mon Apr 14 23:36:20 CEST 2014


Okay, I removed the whitspaces with  
sed 's/.$//' mm9_test.gtf > mm9_test_noSpace.gtf 

Here is the file:
https://www.dropbox.com/s/x5ne8qz8sqbxvrp/mm9_test_noSpace.gtf

But still, I get an error:

txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test_noSpace.gtf", format="gtf")
extracting transcript information
Estimating transcript ranges.
Error in res[i, ] <- .deduceTranscriptRangeData(subs[[i]]) : 
  number of items to replace is not a multiple of replacement


I tried to reduce the file further, in order to get a smaller example file and to check which lines cause the problem, but:

mm9_test_noSpace.gtf is 200,000 lines long and it results in an error. When I am using the first 100,000 (or, last 100,000 lines, respectively) it works!

head -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
--> works!
tail -n 100000 mm9_test_noSpace.gtf > mm9_test_test_noSpace.gtf
--> works!

I have no idea what is going on. Also, the .gtf file is from UCSC.

Thanks again for helping with that annoying problem!
Katja


----- Original Message -----
From: "Vincent Carey" <stvjc at channing.harvard.ed
To: "Katja Hebestreit" <katjah at stanford.edu>
Cc: "Michael Lawrence" <lawrence.michael at gene.com>, "Rsamtools Maintainer" <maintainer at bioconductor.org>, bioconductor at r-project.org
Sent: Monday, April 14, 2014 11:45:30 AM
Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF

remove the trailing whitespace at the end of every line


On Mon, Apr 14, 2014 at 2:24 PM, Katja Hebestreit <katjah at stanford.edu>wrote:

> You can download the file here:
>
> https://www.dropbox.com/s/04nck83jq6r91bc/mm9_test.gtf
>
> Using file I get the error:
>
> txdb <- makeTranscriptDbFromGFF(file="Data/mm9_test.gtf", format="gtf")
> Error in .parse_attrCol(attrCol, file, colnames) :
>   Some attributes do not conform to 'tag value' format
>
> Thank you so much for helping!!
> Katja
>
>
> ----- Original Message -----
> From: "Michael Lawrence" <lawrence.michael at gene.com>
> To: "Katja Hebestreit" <katjah at stanford.edu>
> Cc: "Michael Lawrence" <lawrence.michael at gene.com>,
> bioconductor at r-project.org, "Rsamtools Maintainer" <
> maintainer at bioconductor.org>
> Sent: Monday, April 14, 2014 7:27:26 AM
> Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF
>
> Well, I copied the text and replaced the spaces with tabs as appropriate
> and everything seemed to work fine, so you might to attach that fragment of
> the file, just to be sure it isn't a formatting issue.
>
> Does import("file.gtf") work for you? If so, that should be good enough for
> your use case.
>
> Michael
>
>
> On Sun, Apr 13, 2014 at 10:14 PM, Katja Hebestreit <katjah at stanford.edu
> >wrote:
>
> > Actually, the error was not reproducible with the lines I attached. But
> it
> > is reproducible with those lines (four additional lines):
> >
> > chr1    mm9_refFlat     stop_codon      3206103 3206105 0.000000        -
> >       .       gene_id "Xkr4"; transcript_id "Xkr4";
> > chr1    mm9_refFlat     CDS     3206106 3207049 0.000000        -       2
> >       gene_id "Xkr4"; transcript_id "Xkr4";
> > chr1    mm9_refFlat     exon    3204563 3207049 0.000000        -       .
> >       gene_id "Xkr4"; transcript_id "Xkr4";
> > chr1    mm9_refFlat     CDS     3411783 3411982 0.000000        -       1
> >       gene_id "Xkr4"; transcript_id "Xkr4";
> > chr1    mm9_refFlat     exon    3411783 3411982 0.000000        -       .
> >       gene_id "Xkr4"; transcript_id "Xkr4";
> > chr1    mm9_refFlat     CDS     3660633 3661429 0.000000        -       0
> >       gene_id "Xkr4"; transcript_id "Xkr4";
> > chr1    mm9_refFlat     start_codon     3661427 3661429 0.000000        -
> >       .       gene_id "Xkr4"; transcript_id "Xkr4";
> > chr1    mm9_refFlat     exon    3660633 3661579 0.000000        -       .
> >       gene_id "Xkr4"; transcript_id "Xkr4";
> > chr1    mm9_refFlat     stop_codon      4283062 4283064 0.000000        -
> >       .       gene_id "Rp1"; transcript_id "Rp1";
> > chr1    mm9_refFlat     CDS     4283065 4283093 0.000000        -       2
> >       gene_id "Rp1"; transcript_id "Rp1";
> >
> > Let me know if you like to get the entire file.
> >
> > Thank you!!
> > Katja
> >
> > ----- Original Message -----
> > From: "Michael Lawrence" <lawrence.michael at gene.com>
> > To: "Katja Hebestreit" <katjah at stanford.edu>
> > Cc: bioconductor at r-project.org, "Rsamtools Maintainer" <
> > maintainer at bioconductor.org>
> > Sent: Sunday, April 13, 2014 10:02:13 PM
> > Subject: Re: [BioC] GenomicFeatures: Problem with makeTranscriptDbFromGFF
> >
> > On Sun, Apr 13, 2014 at 7:18 PM, Katja Hebestreit <katjah at stanford.edu
> > >wrote:
> >
> > > Hello,
> > >
> > > I get an error when I try to import my gff file:
> > >
> > > txdb <- makeTranscriptDbFromGFF(file="file.gtf", format="gtf")
> > >
> > > Error in .parse_attrCol(attrCol, file, colnames) :
> > >   Some attributes do not conform to 'tag value' format
> > >
> > > This is how my file looks like:
> > >
> > > chr1    mm9_refFlat     stop_codon      3206103 3206105 0.000000
>  -
> > >       .       gene_id "Xkr4"; transcript_id "Xkr4";
> > > chr1    mm9_refFlat     CDS     3206106 3207049 0.000000        -
>   2
> > >       gene_id "Xkr4"; transcript_id "Xkr4";
> > > chr1    mm9_refFlat     exon    3204563 3207049 0.000000        -
>   .
> > >       gene_id "Xkr4"; transcript_id "Xkr4";
> > > chr1    mm9_refFlat     CDS     3411783 3411982 0.000000        -
>   1
> > >       gene_id "Xkr4"; transcript_id "Xkr4";
> > > chr1    mm9_refFlat     exon    3411783 3411982 0.000000        -
>   .
> > >       gene_id "Xkr4"; transcript_id "Xkr4";
> > > chr1    mm9_refFlat     CDS     3660633 3661429 0.000000        -
>   0
> > >       gene_id "Xkr4"; transcript_id "Xkr4";
> > >
> > > I have the feeling that this has something to do with the missing exon
> > > rank information in my file. Is that true? Is there a way to import
> this
> > > file? All I want to do is to determine the gene lengths.
> > >
> >
> > It is most likely as the error says: some of your attributes are
> malformed.
> > Is that the entire file listed above, or is there more? If you could get
> me
> > the file somehow I could diagnose the issue.
> >
> >
> > >
> > > Could anyone help? That would be awesome!
> > > Cheers,
> > > Katja
> > >
> > >
> > > sessionInfo()
> > > R version 3.1.0 (2014-04-10)
> > > Platform: x86_64-unknown-linux-gnu (64-bit)
> > >
> > > locale:
> > >  [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C
> > >  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8
> > >  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8
> > >  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C
> > >  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> > > [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
> > >
> > > attached base packages:
> > > [1] parallel  stats     graphics  grDevices utils     datasets  methods
> > > [8] base
> > >
> > > other attached packages:
> > > [1] GenomicFeatures_1.16.0 AnnotationDbi_1.25.19  Biobase_2.23.6
> > > [4] GenomicRanges_1.16.0   GenomeInfoDb_0.99.32   IRanges_1.21.45
> > > [7] BiocGenerics_0.9.3     BiocInstaller_1.14.1
> > >
> > > loaded via a namespace (and not attached):
> > >  [1] BatchJobs_1.2           BBmisc_1.5              BiocParallel_0.6.0
> > >  [4] biomaRt_2.20.0          Biostrings_2.32.0       bitops_1.0-6
> > >  [7] brew_1.0-6              BSgenome_1.32.0         codetools_0.2-8
> > > [10] DBI_0.2-7               digest_0.6.4            fail_1.2
> > > [13] foreach_1.4.2           GenomicAlignments_1.0.0 iterators_1.0.7
> > > [16] plyr_1.8.1              Rcpp_0.11.1             RCurl_1.95-4.1
> > > [19] Rsamtools_1.16.0        RSQLite_0.11.4          rtracklayer_1.24.0
> > > [22] sendmailR_1.1-2         stats4_3.1.0            stringr_0.6.2
> > > [25] tools_3.1.0             XML_3.98-1.1            XVector_0.4.0
> > > [28] zlibbioc_1.10.0
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at r-project.org
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > Search the archives:
> > > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > >
> >
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list