[Bioc-devel] Sequences from non-disk sources
alevchuk at gmail.com
Wed Aug 5 21:37:39 CEST 2009
Thank you Martin.
I checked, your method works:
s <- read.AAStringSet(file("stdin"), "fasta")
In C-level, a developer can read files in a non-sequential fashion,
skipping to various places around the file. This would cause a C-level
error if the input is coming from stdin because stdin is implemented
as a sequential stream.
It would make the Biostrings more stable, if an official argument
(e.g. ' "" ' for file) is documented under ?read.AAStringSet and
others, because C-level developers will then avoid the non-sequential
fashion of reading files.
On Wed, Aug 5, 2009 at 9:59 AM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
> Michael Lawrence <mflawren at fhcrc.org> writes:
>> On Mon, Aug 3, 2009 at 7:17 PM, Aleksandr Levchuk <alevchuk at gmail.com>wrote:
>>> Dear BioC developers,
>>> Some of my sequences come from non-disk sources:
>>> Other tool arranged as piplines
>>> I was able to stream such sources into R without touching the disk:
>>> #!/usr/bin/env Rscript
>>> s <- read.AAStringSet("/dev/stdin", "fasta")
>>> # operate on s
>>> Assuming the above file is called my.R, I can run:
>>> chmod +x my.R
>>> cat my.fasta.gz | gzip -dc | ./my.R
>>> Very powerful and flexible.
>>> But I would like to would eliminate my "hackish" /dev/stdin fifo approach.
> Hi Alex
> from ?stdin it would appear that your hackish approach is close to R's
> recommendation; file("stdin") is documented to access the C-level
> stdin. For other connections on linux it seems like one needs, e.g.,
> gzfile("/dev/stdin"); I don't know about other OS.
> The reason this works for things like read.AAStringSet is that at it's
> root it uses R's built-in functions like 'scan', 'read.table', and
> 'readLines'. These make use of connections (the thing returned by
> file()) without any additional effort on the part of the package
> Most package developers write parsers that are expecting a character
> string naming a file, and then using C's fopen or the like to connect
> to a simple files. This is partly because the C-level interface to
> 'connection' objects is not developer friendly. The two challenges are
> thus a) connections are not generally available for all parsers and b)
> developers are not likely to be in a position to implement them even
> if provided a good use case (and your use case is a really nice
> illustration that it would be useful for this to work). These are
> general statements, and there might be tweaks to existing code that
> would allow more flexible use of connections.
> If I'm mistaken and there really is an easy way to use connections in
> C, then please correct me!
>> What about R connections? There's a gzfile() connection that would handle
>> the case above, as well as network connections, url().
>> Just as untested example:
>> s <- read.AAStringSet(gzfile("my.fasta.gz"), "fasta")
>>> I noticed that functions 'write.XStringSet' and 'write.XStringViews'
>>> have an official documented way that allows writing to standard
>>> Would it be difficult to add an argument to the Biostrings read
>>> functions to allow reading sequences from standard input?
>>> Aleksandr Levchuk
>>> Bioinformatic Systems and Databases
>>> University of California, Riverside
>>> Institute for Integrative Genome Biology
>>> Bioc-devel at stat.math.ethz.ch mailing list
>> [[alternative HTML version deleted]]
>> Bioc-devel at stat.math.ethz.ch mailing list
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
Bioinformatic Systems and Databases
Cell Phone: (951) 368-0004
Institute for Integrative Genome Biology
University of California, Riverside
More information about the Bioc-devel