[Bioc-devel] Random access to sequences in fasta files

Thu Jan 29 17:15:07 CET 2015

Thanks Martin

This was thought as a feauture request/discussion of biostrings, which is why I posted it here. Thought biostrings io capabilities was behind most other fasts readers on bioconductor...

/Thomas

> Den 29/01/2015 kl. 15.45 skrev Martin Morgan <mtmorgan at fredhutch.org>:
> 
>> On 01/29/2015 06:41 AM, Thomas Lin Pedersen wrote:
>> Hi
>> 
>> I’m querying on whether there are any plans on supporting random access reading of fasta files in the sense that it is possible to upfront specify the indexes of sequences that should be read in.
>> 
>> I’m working on a package for comparative microbial genomics and it would be a huge speed improvement if it was possible to quickly read in 1000’s of sequences distributed on as many files. Currently the proper, vectorised approach requires all files to be read in at once and then subsetted, but this can result in XStringSet’s in the Gb range, just to access some sequences. The slow, un-R way would be to loop through each file (or each sequence using skip and nrec to only read in relevant sequences). I’m preferentially looking for an interface like:
>> 
>> readXStringSet(files, rec)
>> 
>> Where rec is either a vector that would index into the XStringSet as if everything from files had been read in, or a list with the same length as files, containing the indexes of interest for each file.
> 
> Hi Thomas -- this should really be posted to support.bioconductor.org, but see Rsamtools::FaFile and rtracklayer::TwoBitFile access through getSeq.
> 
> Martin
> 
>> with best wishes
>> 
>> Thomas
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 
> 
> -- 
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793