[Bioc-devel] parsing embedded FASTA data

Hervé Pagès hpages at fhcrc.org
Tue Mar 18 19:42:42 CET 2014


Hi,

On 03/18/2014 10:04 AM, Michael Lawrence wrote:
>
>
>
> On Tue, Mar 18, 2014 at 7:54 AM, Gabriel Becker <gmbecker at ucdavis.edu
> <mailto:gmbecker at ucdavis.edu>> wrote:
>
>     Or going the positive declarative route but arguably more
>     informative: skip.to.fasta or fasta.only
>
>
> skip.to.fasta might work. A different algorithm that would work for GFF3
> would be skip.to.pragma="##FASTA", which would skip until it hit a line
> matching "##FASTA".
>
>
>     I don't know the GFF format spec, are we guaranteed that there will
>     be only one embedded fasta file and that it will be contiguous
>     within the file?
>
>
> Yes, it is guaranteed that after a certain point in the file (that
> pragma), all data is FASTA formatted.

Thanks for the suggestions. I think I'll go for 'seek.first.rec', just
to keep it generic and not tied to the specifics of the GFF, FASTA, or
FASTQ formats.

H.

>
>     If not the skip.to._ terminology would not technically be correct.
>
>     ~G
>
>
>     On Mon, Mar 17, 2014 at 8:17 PM, Michael Lawrence
>     <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>
>         For direct reading of the sequence, the skip.non.fasta idea
>         sounds good. An
>         alternative for the name would be "skip.to.first.record". Up to you.
>
>         Michael
>
>
>         On Mon, Mar 17, 2014 at 5:33 PM, Hervé Pagès <hpages at fhcrc.org
>         <mailto:hpages at fhcrc.org>> wrote:
>
>          > Hi Michael,
>          >
>          >
>          > On 03/17/2014 04:15 PM, Michael Lawrence wrote:
>          >
>          >> Hi Herve,
>          >>
>          >> What would be a clean way for rtracklayer to extract the
>         (optional) FASTA
>          >> data embedded in a GFF3 file and parse it as an XStringSet?
>         Is there a
>          >> low-level way to pass in-memory data to the parser in
>         Biostrings?
>          >>
>          >
>          > Not that it can be used here, but readDNAStringSet() has the
>         'skip' arg
>          > which is analogous to the 'skip' arg of read.table(), except
>         that, in
>          > the case of readDNAStringSet(), it needs to be specified as
>         the number
>          > of records (FASTA or FASTQ) to skip before beginning to read in
>          > records. So the assumption is that everything before the
>         first record
>          > to read is valid FASTA (or FASTQ). Which is of course not the
>         case
>          > with those GFF3 files with embedded FASTA data.
>          >
>          > However it would be easy to add another arg, say
>         'skip.non.fasta.lines',
>          > to automatically skip lines that don't look like the header
>         of a FASTA
>          > record (i.e. that don't start with '>').
>          >
>          >
>          >
>          >> In terms of the API, import,GFFFile could return a GRanges
>         with the
>          >> DNAStringSet in the metadata(). Or there could be a method for
>          >> readDNAStringSet on GFF3File that returns the DNAStringSet
>         directly.
>          >>
>          >
>          > The readDNAStringSet,GFF3File method seems cleaner than the
>         metadata()
>          > solution. It's also lower-level and would be needed behind
>         the scene by
>          > import,GFFFile, so I think it would make sense to start with it.
>          > Implementing readDNAStringSet,GFF3File will be trivial once
>         we have
>          > something like the 'skip.non.fasta' arg. Should I go for it?
>         Any better
>          > suggestion for the name of this arg?
>          >
>          > Thanks,
>          > H.
>          >
>          >
>          >> It turns out this functionality is useful when working with
>         microbial
>          >> genomes, where information tends to be passed around as
>         Genbank files. For
>          >> right now the easiest path seems to be to convert Genbank to
>         GFF, but a
>          >> Genbank parser in Bioc could be an eventual goal. It's a
>         very complex file
>          >> format.
>          >>
>          >> Michael
>          >>
>          >>         [[alternative HTML version deleted]]
>          >>
>          >> _______________________________________________
>          >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         mailing list
>          >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>          >>
>          >>
>          > --
>          > Hervé Pagès
>          >
>          > Program in Computational Biology
>          > Division of Public Health Sciences
>          > Fred Hutchinson Cancer Research Center
>          > 1100 Fairview Ave. N, M1-B514
>          > P.O. Box 19024
>          > Seattle, WA 98109-1024
>          >
>          > E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>          > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>          > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>          >
>
>                  [[alternative HTML version deleted]]
>
>
>         _______________________________________________
>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>         mailing list
>         https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
>
>
>     --
>     Gabriel Becker
>     Graduate Student
>     Statistics Department
>     University of California, Davis
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list