[BioC] writeFastq to gzip compressed files
Harris A. Jaffee
hj at jhu.edu
Wed Sep 25 22:23:43 CEST 2013
I believe gzfile only applies when reading.
With gzcon, you may have to flush or close the connection (not sure which
one - I think your fout2), or your whole session, to get the file written.
On Sep 25, 2013, at 4:07 PM, Marcus Davy wrote:
> Thanks for the responses, Martin I will give your work-around a try.
>
>
> I did also try out the following but in both cases they failed, and only
> produced a zero byte sized file (equivalent to bash 'touch' utility on the
> command line) using OS X.
>
> fout <- gzfile("test.fq.gz", "wb")
> print(fout)
> writeFastq(rfq, fout)
>
> fout2 <- gzcon(file("test.fq.gz", "wb"))
> print(fout2)
> writeFastq(rfq, fout)
>
>
> cheers,
>
>
> Marcus
>
>
> On Thu, Sep 26, 2013 at 6:01 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>
>> On 09/25/2013 09:09 AM, Michael Lawrence wrote:
>>
>>> Ideally, it would be:
>>> writeFastq(rfq, gzfile(fout))
>>>
>>> But I don't think that works. In theory, it would be feasible to support
>>> this now that R has exposed its connection interface...
>>>
>>
>> That'll be an interesting learning experience, and I'll come up with a
>> proper implementation. Here's a work-around, With
>>
>> writeFastq_con(rfq, stdout())
>> yy = writeFastq_con(rfq, gzfile(tempfile()))
>> readFastq(yy)
>>
>> Martin
>>
>> .write1fastq <-
>> function(x, con, ..., full=FALSE)
>> {
>> i <- paste0("@", as.character(id(x)), "\n")
>> fill <- if (full)
>> paste0("\n+", as.character(id(x)), "\n")
>> else "\n+\n"
>> s <- as.character(sread(x))
>> q <- as.character(quality(quality(**x)))
>> cat(paste(i, s, fill, q, sep=""), file=con, sep="\n")
>> length(i)
>> }
>>
>> writeFastq_con <-
>> function(object, file, ..., chunkSize=1000000L)
>> ## object: ShortReadQ instance
>> ## file: connection, e.g., stdout(), gzfile(<...>, "wb")
>> ## ...: additional arguments, in particular
>> ## full=TRUE: logical(1), TRUE: replicate id tag on '+' line
>> ## chunkSize: size of each chunk processed for output, smaller
>> ## values use less memory but are slower
>>
>> {
>> if (!isOpen(file)) {
>> open(file, "wb")
>> on.exit(close(file))
>> }
>>
>> len <- length(object)
>> start <- 1L
>> tot <- 0L
>> while (start < len) {
>> end <- min(start + chunkSize - 1L, len)
>> tot <- tot + .write1fastq(object[seq(start, end)], file, ...)
>> start <- end + 1L
>> }
>> if (tot != len)
>> warning("output length does not equal input length")
>>
>> summary(file)$description
>>
>> }
>>
>>
>>> Michael
>>>
>>>
>>> On Wed, Sep 25, 2013 at 7:39 AM, Marcus Davy <mdavy86 at gmail.com> wrote:
>>>
>>> Hi,
>>>> as fastq files are often compressed now, I would like to be able to write
>>>> out to compressed 'fq.gz' files, rather than raw text.
>>>>
>>>> I have searched the BioC threads and found a similar topic at;
>>>>
>>>>
>>>> http://article.gmane.org/**gmane.science.biology.**
>>>> informatics.conductor/45306/**match=writeFastq<http://article.gmane.org/gmane.science.biology.informatics.conductor/45306/match=writeFastq>
>>>>
>>>> A poor example of what I would like to be able to achieve follows;
>>>>
>>>> example(readFastq)
>>>> print(rfq)
>>>> fout <- "test.fq"
>>>> writeFastq(rfq, fout)
>>>> system(paste("gzip", fout))
>>>>
>>>> but this does not compress the data as it is streaming out to file. I
>>>> would
>>>> like to be able to do something like below, so this is a potential
>>>> feature
>>>> request;
>>>>
>>>> fout <- "test.fq.gz"
>>>> writeFastq(rfq, fout, compress=TRUE)
>>>>
>>>> Is there an obvious way to compress using writeFastq that I am missing?
>>>>
>>>>
>>>> cheers,
>>>>
>>>> Marcus
>>>>
>>>>
>>>> sessionInfo()
>>>> R version 3.0.0 (2013-04-03)
>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>>
>>>> locale:
>>>> [1] C
>>>>
>>>> attached base packages:
>>>> [1] parallel splines stats graphics grDevices utils datasets
>>>> [8] methods base
>>>>
>>>> other attached packages:
>>>> [1] ShortRead_1.18.0 latticeExtra_0.6-24 RColorBrewer_1.0-5
>>>> [4] Rsamtools_1.12.3 lattice_0.20-15 Biostrings_2.28.0
>>>> [7] GenomicRanges_1.12.2 IRanges_1.18.0 BiocGenerics_0.6.0
>>>> [10] BiocInstaller_1.10.3 limma_3.16.7 Hmisc_3.12-2
>>>> [13] Formula_1.1-1 survival_2.37-4
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>
>>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list