[Bioc-devel] Sequences from non-disk sources

Martin Morgan mtmorgan at fhcrc.org
Wed Aug 5 18:59:13 CEST 2009


Michael Lawrence <mflawren at fhcrc.org> writes:

> On Mon, Aug 3, 2009 at 7:17 PM, Aleksandr Levchuk <alevchuk at gmail.com>wrote:
>
>> Dear BioC developers,
>>
>> Some of my sequences come from non-disk sources:
>>  Network
>>  Un-compressors
>>  Other tool arranged as piplines
>>
>> I was able to stream such sources into R without touching the disk:
>> =========================
>> #!/usr/bin/env Rscript
>>
>> library(Biostrings)
>> s <- read.AAStringSet("/dev/stdin", "fasta")
>>
>> #
>> # operate on s
>> #
>> =========================
>>
>> Assuming the above file is called my.R, I can run:
>>  chmod +x my.R
>>  cat my.fasta.gz |  gzip -dc | ./my.R
>>
>> Very powerful and flexible.
>>
>>
>> But I would like to would eliminate my "hackish" /dev/stdin fifo approach.

Hi Alex

from ?stdin it would appear that your hackish approach is close to R's
recommendation; file("stdin") is documented to access the C-level
stdin.  For other connections on linux it seems like one needs, e.g.,
gzfile("/dev/stdin"); I don't know about other OS.

The reason this works for things like read.AAStringSet is that at it's
root it uses R's built-in functions like 'scan', 'read.table', and
'readLines'. These make use of connections (the thing returned by
file()) without any additional effort on the part of the package
developer.

Most package developers write parsers that are expecting a character
string naming a file, and then using C's fopen or the like to connect
to a simple files. This is partly because the C-level interface to
'connection' objects is not developer friendly. The two challenges are
thus a) connections are not generally available for all parsers and b)
developers are not likely to be in a position to implement them even
if provided a good use case (and your use case is a really nice
illustration that it would be useful for this to work). These are
general statements, and there might be tweaks to existing code that
would allow more flexible use of connections.

If I'm mistaken and there really is an easy way to use connections in
C, then please correct me!

Martin

>
> What about R connections? There's a gzfile() connection that would handle
> the case above, as well as network connections, url().
>
> Just as untested example:
> s <- read.AAStringSet(gzfile("my.fasta.gz"), "fasta")
>
>
>>
>> I noticed that functions 'write.XStringSet' and 'write.XStringViews'
>> have an official documented way that allows writing to standard
>> output.
>>
>>
>> Would it be difficult to add an argument to the Biostrings read
>> functions to allow reading sequences from standard input?
>>
>>
>> Alex
>>
>> --
>> ---------------------------------------------------------------
>> Aleksandr Levchuk
>> Bioinformatic Systems and Databases
>>
>> University of California, Riverside
>> Institute for Integrative Genome Biology
>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list