[Bioc-devel] parsing embedded FASTA data

Hervé Pagès hpages at fhcrc.org
Tue Mar 18 01:33:44 CET 2014


Hi Michael,

On 03/17/2014 04:15 PM, Michael Lawrence wrote:
> Hi Herve,
>
> What would be a clean way for rtracklayer to extract the (optional) FASTA
> data embedded in a GFF3 file and parse it as an XStringSet? Is there a
> low-level way to pass in-memory data to the parser in Biostrings?

Not that it can be used here, but readDNAStringSet() has the 'skip' arg
which is analogous to the 'skip' arg of read.table(), except that, in
the case of readDNAStringSet(), it needs to be specified as the number
of records (FASTA or FASTQ) to skip before beginning to read in
records. So the assumption is that everything before the first record
to read is valid FASTA (or FASTQ). Which is of course not the case
with those GFF3 files with embedded FASTA data.

However it would be easy to add another arg, say 'skip.non.fasta.lines',
to automatically skip lines that don't look like the header of a FASTA
record (i.e. that don't start with '>').

>
> In terms of the API, import,GFFFile could return a GRanges with the
> DNAStringSet in the metadata(). Or there could be a method for
> readDNAStringSet on GFF3File that returns the DNAStringSet directly.

The readDNAStringSet,GFF3File method seems cleaner than the metadata()
solution. It's also lower-level and would be needed behind the scene by
import,GFFFile, so I think it would make sense to start with it.
Implementing readDNAStringSet,GFF3File will be trivial once we have
something like the 'skip.non.fasta' arg. Should I go for it? Any better
suggestion for the name of this arg?

Thanks,
H.

>
> It turns out this functionality is useful when working with microbial
> genomes, where information tends to be passed around as Genbank files. For
> right now the easiest path seems to be to convert Genbank to GFF, but a
> Genbank parser in Bioc could be an eventual goal. It's a very complex file
> format.
>
> Michael
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list