[BioC] readFastq, writeFastq, compressed files and unix pipes.
Martin Morgan
mtmorgan at fhcrc.org
Sat Dec 15 06:31:07 CET 2012
Hi Ivan --
On 12/13/2012 06:40 AM, Martin Morgan wrote:
> On 12/13/2012 01:22 AM, Ivan Gregoretti wrote:
>> Hello ShortRead developers,
>>
>> I recently tried to create an R script that could run as an ordinary
>> programme at the command line. It was something like
>>
>> $ R --vanilla --slave -f ./myscript.R --input=input.fastq.gz
>> --output=output.fastq.gz
>>
>> I tried and failed but I learned something in the process: readFastq()
>> and writeFastq() are not symmetrical.
>>
>> Specifically:
>>
>> 1) readFastq() can tell the difference between a plain text FASTQ file
>> and a gzipped FASTQ file (.gz). Unlike that, writeFastq() always
>> outputs in plain text regardless of the suffix passed.
>
> FastqStreamer is intended to work on R connections, and from the note in ?stdin
> + a little googling + tolerating a warning about an already open connection one
> can read a compressed file from stdin using
>
> fin <- gzcon(file("stdin", "rb"))
>
> (this seems to be the R paradigm, nothing special to fastq files) followed by
>
> strm <- FastqStreamer(fin)
> object <- yield(strm)
>
> see ?FastqStreamer for how many records are read at a time.
>
>> 2) writeFastq() understands unix pipes because it conveniently accepts
>> the file argument "/dev/stdout". Unlike that, readFastq() does not
>> accept the file argument "/dev/stdin", so, no pipes are possible.
>
> For me, writeFastq(object, "/dev/stdout") fails with
>
> > writeFastq(object, "/dev/stdout")
> Error: UserArgumentMismatch
> file '/dev/stdout' exists, but mode is not 'a'
>
> an easy way to get the full support of R's connections is
>
> setMethod(writeFastq, c("ShortReadQ", "connection"),
> function(object, file, mode="w", full = FALSE, ...)
> {
> outp <- character(length(object) * 4L)
> outp[c(TRUE, FALSE, full, FALSE)] <- as.character(id(object))
> outp[c(FALSE, TRUE, FALSE, FALSE)] <- as.character(sread(object))
> outp[c(FALSE, FALSE, TRUE, FALSE)] <- "+"
> outp[c(FALSE, FALSE, FALSE, TRUE)] <- as.character(quality(quality(object)))
> writeLines(outp, file)
> })
>
> which has obvious limitations. Writing a compressed stream to /dev/stdout could
> be arranged with
>
> fout <- gzcon(file("/dev/stdout", "wb"))
>
> (which seems like a hack; maybe there's a better R way?) and then
>
> writeFastq(object, fout)
> close(fout)
>
> and processing a file would be
>
> fin <- gzcon(file("stdin", "rb"))
> fout <- gzcon(file("/dev/stdout", "wb"))
>
> strm <- FastqStreamer(fin)
> while(length(object <- yield(strm))) {
> writeFastq(object, fout)
> }
> close(fin); close(fout)
>
> piping to stdout seems like it will be problematic, e.g., if some R command
> writes to stdout then it will be inserted in the output stream. Maybe better to
> use a named pipe (fifo)
>
>> It would be great if readFastq() and writeFastq() were equally smart.
>> Please consider adding these functionalities.
>>
>> Thank you,
>>
>> Ivan
>>
>>
>>
>>
>>
>> Ivan Gregoretti, PhD
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list