[Bioc-devel] parsing embedded FASTA data
Hervé Pagès
hpages at fhcrc.org
Mon Mar 24 18:08:36 CET 2014
Hi Michael,
This is now supported in Biostrings 2.31.17.
Cheers,
H.
On 03/18/2014 11:42 AM, Hervé Pagès wrote:
> Hi,
>
> On 03/18/2014 10:04 AM, Michael Lawrence wrote:
>>
>>
>>
>> On Tue, Mar 18, 2014 at 7:54 AM, Gabriel Becker <gmbecker at ucdavis.edu
>> <mailto:gmbecker at ucdavis.edu>> wrote:
>>
>> Or going the positive declarative route but arguably more
>> informative: skip.to.fasta or fasta.only
>>
>>
>> skip.to.fasta might work. A different algorithm that would work for GFF3
>> would be skip.to.pragma="##FASTA", which would skip until it hit a line
>> matching "##FASTA".
>>
>>
>> I don't know the GFF format spec, are we guaranteed that there will
>> be only one embedded fasta file and that it will be contiguous
>> within the file?
>>
>>
>> Yes, it is guaranteed that after a certain point in the file (that
>> pragma), all data is FASTA formatted.
>
> Thanks for the suggestions. I think I'll go for 'seek.first.rec', just
> to keep it generic and not tied to the specifics of the GFF, FASTA, or
> FASTQ formats.
>
> H.
>
>>
>> If not the skip.to._ terminology would not technically be correct.
>>
>> ~G
>>
>>
>> On Mon, Mar 17, 2014 at 8:17 PM, Michael Lawrence
>> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>>
>> For direct reading of the sequence, the skip.non.fasta idea
>> sounds good. An
>> alternative for the name would be "skip.to.first.record". Up
>> to you.
>>
>> Michael
>>
>>
>> On Mon, Mar 17, 2014 at 5:33 PM, Hervé Pagès <hpages at fhcrc.org
>> <mailto:hpages at fhcrc.org>> wrote:
>>
>> > Hi Michael,
>> >
>> >
>> > On 03/17/2014 04:15 PM, Michael Lawrence wrote:
>> >
>> >> Hi Herve,
>> >>
>> >> What would be a clean way for rtracklayer to extract the
>> (optional) FASTA
>> >> data embedded in a GFF3 file and parse it as an XStringSet?
>> Is there a
>> >> low-level way to pass in-memory data to the parser in
>> Biostrings?
>> >>
>> >
>> > Not that it can be used here, but readDNAStringSet() has the
>> 'skip' arg
>> > which is analogous to the 'skip' arg of read.table(), except
>> that, in
>> > the case of readDNAStringSet(), it needs to be specified as
>> the number
>> > of records (FASTA or FASTQ) to skip before beginning to
>> read in
>> > records. So the assumption is that everything before the
>> first record
>> > to read is valid FASTA (or FASTQ). Which is of course not the
>> case
>> > with those GFF3 files with embedded FASTA data.
>> >
>> > However it would be easy to add another arg, say
>> 'skip.non.fasta.lines',
>> > to automatically skip lines that don't look like the header
>> of a FASTA
>> > record (i.e. that don't start with '>').
>> >
>> >
>> >
>> >> In terms of the API, import,GFFFile could return a GRanges
>> with the
>> >> DNAStringSet in the metadata(). Or there could be a method
>> for
>> >> readDNAStringSet on GFF3File that returns the DNAStringSet
>> directly.
>> >>
>> >
>> > The readDNAStringSet,GFF3File method seems cleaner than the
>> metadata()
>> > solution. It's also lower-level and would be needed behind
>> the scene by
>> > import,GFFFile, so I think it would make sense to start
>> with it.
>> > Implementing readDNAStringSet,GFF3File will be trivial once
>> we have
>> > something like the 'skip.non.fasta' arg. Should I go for it?
>> Any better
>> > suggestion for the name of this arg?
>> >
>> > Thanks,
>> > H.
>> >
>> >
>> >> It turns out this functionality is useful when working with
>> microbial
>> >> genomes, where information tends to be passed around as
>> Genbank files. For
>> >> right now the easiest path seems to be to convert Genbank to
>> GFF, but a
>> >> Genbank parser in Bioc could be an eventual goal. It's a
>> very complex file
>> >> format.
>> >>
>> >> Michael
>> >>
>> >> [[alternative HTML version deleted]]
>> >>
>> >> _______________________________________________
>> >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>> >>
>> > --
>> > Hervé Pagès
>> >
>> > Program in Computational Biology
>> > Division of Public Health Sciences
>> > Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Ave. N, M1-B514
>> > P.O. Box 19024
>> > Seattle, WA 98109-1024
>> >
>> > E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>> >
>>
>> [[alternative HTML version deleted]]
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>>
>> --
>> Gabriel Becker
>> Graduate Student
>> Statistics Department
>> University of California, Davis
>>
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list