[Bioc-devel] parsing embedded FASTA data

Mon Mar 24 18:08:36 CET 2014

Hi Michael,

This is now supported in Biostrings 2.31.17.

Cheers,
H.

On 03/18/2014 11:42 AM, Hervé Pagès wrote:
> Hi,
>
> On 03/18/2014 10:04 AM, Michael Lawrence wrote:
>>
>>
>>
>> On Tue, Mar 18, 2014 at 7:54 AM, Gabriel Becker <gmbecker at ucdavis.edu
>> <mailto:gmbecker at ucdavis.edu>> wrote:
>>
>>     Or going the positive declarative route but arguably more
>>     informative: skip.to.fasta or fasta.only
>>
>>
>> skip.to.fasta might work. A different algorithm that would work for GFF3
>> would be skip.to.pragma="##FASTA", which would skip until it hit a line
>> matching "##FASTA".
>>
>>
>>     I don't know the GFF format spec, are we guaranteed that there will
>>     be only one embedded fasta file and that it will be contiguous
>>     within the file?
>>
>>
>> Yes, it is guaranteed that after a certain point in the file (that
>> pragma), all data is FASTA formatted.
>
> Thanks for the suggestions. I think I'll go for 'seek.first.rec', just
> to keep it generic and not tied to the specifics of the GFF, FASTA, or
> FASTQ formats.
>
> H.
>
>>
>>     If not the skip.to._ terminology would not technically be correct.
>>
>>     ~G
>>
>>
>>     On Mon, Mar 17, 2014 at 8:17 PM, Michael Lawrence
>>     <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>>
>>         For direct reading of the sequence, the skip.non.fasta idea
>>         sounds good. An
>>         alternative for the name would be "skip.to.first.record". Up
>> to you.
>>
>>         Michael
>>
>>
>>         On Mon, Mar 17, 2014 at 5:33 PM, Hervé Pagès <hpages at fhcrc.org
>>         <mailto:hpages at fhcrc.org>> wrote:
>>
>>          > Hi Michael,
>>          >
>>          >
>>          > On 03/17/2014 04:15 PM, Michael Lawrence wrote:
>>          >
>>          >> Hi Herve,
>>          >>
>>          >> What would be a clean way for rtracklayer to extract the
>>         (optional) FASTA
>>          >> data embedded in a GFF3 file and parse it as an XStringSet?
>>         Is there a
>>          >> low-level way to pass in-memory data to the parser in
>>         Biostrings?
>>          >>
>>          >
>>          > Not that it can be used here, but readDNAStringSet() has the
>>         'skip' arg
>>          > which is analogous to the 'skip' arg of read.table(), except
>>         that, in
>>          > the case of readDNAStringSet(), it needs to be specified as
>>         the number
>>          > of records (FASTA or FASTQ) to skip before beginning to
>> read in
>>          > records. So the assumption is that everything before the
>>         first record
>>          > to read is valid FASTA (or FASTQ). Which is of course not the
>>         case
>>          > with those GFF3 files with embedded FASTA data.
>>          >
>>          > However it would be easy to add another arg, say
>>         'skip.non.fasta.lines',
>>          > to automatically skip lines that don't look like the header
>>         of a FASTA
>>          > record (i.e. that don't start with '>').
>>          >
>>          >
>>          >
>>          >> In terms of the API, import,GFFFile could return a GRanges
>>         with the
>>          >> DNAStringSet in the metadata(). Or there could be a method
>> for
>>          >> readDNAStringSet on GFF3File that returns the DNAStringSet
>>         directly.
>>          >>
>>          >
>>          > The readDNAStringSet,GFF3File method seems cleaner than the
>>         metadata()
>>          > solution. It's also lower-level and would be needed behind
>>         the scene by
>>          > import,GFFFile, so I think it would make sense to start
>> with it.
>>          > Implementing readDNAStringSet,GFF3File will be trivial once
>>         we have
>>          > something like the 'skip.non.fasta' arg. Should I go for it?
>>         Any better
>>          > suggestion for the name of this arg?
>>          >
>>          > Thanks,
>>          > H.
>>          >
>>          >
>>          >> It turns out this functionality is useful when working with
>>         microbial
>>          >> genomes, where information tends to be passed around as
>>         Genbank files. For
>>          >> right now the easiest path seems to be to convert Genbank to
>>         GFF, but a
>>          >> Genbank parser in Bioc could be an eventual goal. It's a
>>         very complex file
>>          >> format.
>>          >>
>>          >> Michael
>>          >>
>>          >>         [[alternative HTML version deleted]]
>>          >>
>>          >> _______________________________________________
>>          >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>         mailing list
>>          >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>          >>
>>          >>
>>          > --
>>          > Hervé Pagès
>>          >
>>          > Program in Computational Biology
>>          > Division of Public Health Sciences
>>          > Fred Hutchinson Cancer Research Center
>>          > 1100 Fairview Ave. N, M1-B514
>>          > P.O. Box 19024
>>          > Seattle, WA 98109-1024
>>          >
>>          > E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>          > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>          > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>          >
>>
>>                  [[alternative HTML version deleted]]
>>
>>
>>         _______________________________________________
>>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>         mailing list
>>         https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>>
>>     --
>>     Gabriel Becker
>>     Graduate Student
>>     Statistics Department
>>     University of California, Davis
>>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319