[Bioc-devel] Sequences from non-disk sources

Wed Aug 5 21:37:39 CEST 2009

Thank you Martin.

I checked, your method works:
  s <- read.AAStringSet(file("stdin"), "fasta")

In C-level, a developer can read files in a non-sequential fashion,
skipping to various places around the file. This would cause a C-level
error if the input is coming from stdin because stdin is implemented
as a sequential stream.

It would make the Biostrings more stable, if an official argument
(e.g. ' "" ' for file) is documented under ?read.AAStringSet and
others, because C-level developers will then avoid the non-sequential
fashion of reading files.

Alex

On Wed, Aug 5, 2009 at 9:59 AM, Martin Morgan<mtmorgan at fhcrc.org> wrote:
> Michael Lawrence <mflawren at fhcrc.org> writes:
>
>> On Mon, Aug 3, 2009 at 7:17 PM, Aleksandr Levchuk <alevchuk at gmail.com>wrote:
>>
>>> Dear BioC developers,
>>>
>>> Some of my sequences come from non-disk sources:
>>>  Network
>>>  Un-compressors
>>>  Other tool arranged as piplines
>>>
>>> I was able to stream such sources into R without touching the disk:
>>> =========================
>>> #!/usr/bin/env Rscript
>>>
>>> library(Biostrings)
>>> s <- read.AAStringSet("/dev/stdin", "fasta")
>>>
>>> #
>>> # operate on s
>>> #
>>> =========================
>>>
>>> Assuming the above file is called my.R, I can run:
>>>  chmod +x my.R
>>>  cat my.fasta.gz |  gzip -dc | ./my.R
>>>
>>> Very powerful and flexible.
>>>
>>>
>>> But I would like to would eliminate my "hackish" /dev/stdin fifo approach.
>
> Hi Alex
>
> from ?stdin it would appear that your hackish approach is close to R's
> recommendation; file("stdin") is documented to access the C-level
> stdin.  For other connections on linux it seems like one needs, e.g.,
> gzfile("/dev/stdin"); I don't know about other OS.
>
> The reason this works for things like read.AAStringSet is that at it's
> root it uses R's built-in functions like 'scan', 'read.table', and
> 'readLines'. These make use of connections (the thing returned by
> file()) without any additional effort on the part of the package
> developer.
>
> Most package developers write parsers that are expecting a character
> string naming a file, and then using C's fopen or the like to connect
> to a simple files. This is partly because the C-level interface to
> 'connection' objects is not developer friendly. The two challenges are
> thus a) connections are not generally available for all parsers and b)
> developers are not likely to be in a position to implement them even
> if provided a good use case (and your use case is a really nice
> illustration that it would be useful for this to work). These are
> general statements, and there might be tweaks to existing code that
> would allow more flexible use of connections.
>
> If I'm mistaken and there really is an easy way to use connections in
> C, then please correct me!
>
> Martin
>
>>
>> What about R connections? There's a gzfile() connection that would handle
>> the case above, as well as network connections, url().
>>
>> Just as untested example:
>> s <- read.AAStringSet(gzfile("my.fasta.gz"), "fasta")
>>
>>
>>>
>>> I noticed that functions 'write.XStringSet' and 'write.XStringViews'
>>> have an official documented way that allows writing to standard
>>> output.
>>>
>>>
>>> Would it be difficult to add an argument to the Biostrings read
>>> functions to allow reading sequences from standard input?
>>>
>>>
>>> Alex
>>>
>>> --
>>> ---------------------------------------------------------------
>>> Aleksandr Levchuk
>>> Bioinformatic Systems and Databases
>>>
>>> University of California, Riverside
>>> Institute for Integrative Genome Biology
>>>
>>> _______________________________________________
>>> Bioc-devel at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>       [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

-- 
---------------------------------------------------------------
Aleksandr Levchuk
Bioinformatic Systems and Databases
Cell Phone: (951) 368-0004

Institute for Integrative Genome Biology
University of California, Riverside