[BioC] qrqc with variable length of short reads? - readSeqFile could not handle a 2GB zipped file.

Sang Chul Choi schoi at cornell.edu
Wed Jun 6 20:14:00 CEST 2012


Thank you for offering help. Appended are the R script and its output mostly from sessionInfo() function.

Thank you,

SangChul

The R script:
========================================================================================
library(qrqc)
args <- commandArgs(trailingOnly = TRUE)
if (length(args) != 3)
{
  cat ("Rscript job-stat.R 1.fq.gz 1.fq.RData sanger\n")
  quit("yes")
}
fq.name <- args[1]
fq.quality <- args[3]
sessionInfo()
# fq.file <- readSeqFile(fq.name,quality=fq.quality,hash=FALSE)
# toplot <- qualPlot(fq.file)
# fq.plot <- args[2]
# save(list="toplot", file = fq.plot)
========================================================================================

Output from the R script above:
=========================================================================================
Loading required package: reshape
Loading required package: plyr

Attaching package: 'reshape'

The following object(s) are masked from 'package:plyr':

    rename, round_any

Loading required package: ggplot2
Loading required package: methods
Loading required package: Biostrings
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'

The following object(s) are masked from 'package:stats':

    xtabs

The following object(s) are masked from 'package:base':

    anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find,
    get, intersect, lapply, Map, mapply, mget, order, paste, pmax,
    pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int,
    rownames, sapply, setdiff, table, tapply, union, unique

Loading required package: IRanges

Attaching package: 'IRanges'

The following object(s) are masked from 'package:reshape':

    rename

The following object(s) are masked from 'package:plyr':

    compact, desc, rename

Loading required package: biovizBase
Loading required package: brew
Loading required package: xtable
Loading required package: Rsamtools
Loading required package: GenomicRanges
Loading required package: testthat
R Under development (unstable) (2012-04-01 r58897)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C                  
 [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915    
 [5] LC_MONETARY=en_US.iso885915    LC_MESSAGES=en_US.iso885915   
 [7] LC_PAPER=C                     LC_NAME=C                     
 [9] LC_ADDRESS=C                   LC_TELEPHONE=C                
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C           

attached base packages:
[1] methods   stats     graphics  grDevices utils     datasets  base     

other attached packages:
 [1] qrqc_1.10.0         testthat_0.6        Rsamtools_1.8.5    
 [4] GenomicRanges_1.8.6 xtable_1.7-0        brew_1.0-6         
 [7] biovizBase_1.4.2    Biostrings_2.24.1   IRanges_1.14.3     
[10] BiocGenerics_0.2.0  ggplot2_0.9.1       reshape_0.8.4      
[13] plyr_1.7.1         

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.18.1  Biobase_2.16.0        biomaRt_2.12.0       
 [4] bitops_1.0-4.1        BSgenome_1.24.0       cluster_1.14.2       
 [7] colorspace_1.1-1      DBI_0.2-5             dichromat_1.2-4      
[10] digest_0.5.2          evaluate_0.4.2        GenomicFeatures_1.8.1
[13] grid_2.16.0           Hmisc_3.9-3           labeling_0.1         
[16] lattice_0.20-6        MASS_7.3-18           memoise_0.1          
[19] munsell_0.3           proto_0.3-9.2         RColorBrewer_1.0-5   
[22] RCurl_1.91-1          reshape2_1.2.1        RSQLite_0.11.1       
[25] rtracklayer_1.16.1    scales_0.2.1          stats4_2.16.0        
[28] stringr_0.6           tools_2.16.0          XML_3.9-4            
[31] zlibbioc_1.2.0       
=========================================================================================


On Jun 5, 2012, at 3:03 PM, Vince S. Buffalo wrote:

> Hi SangChul,
> 
> Can you attach your sessionInfo()? I will take a look into this issue.
> 
> best,
> Vince
> 
> On Tue, Jun 5, 2012 at 12:01 PM, Sang Chul Choi <schoi at cornell.edu> wrote:
> I have tried to tunn off the option when reading sequences of variable lengths in a gzipped FASTQ file (2GB) using readSeqFile. The computer has 16 GB memory, and it used up all of the memory, leaving R in "Dead" or not running any more.  Is there a way of sidestepping this problem?
> 
> Thank you,
> 
> SangChul
> 
> On Jun 1, 2012, at 4:55 PM, Vince Buffalo wrote:
> 
> > Hi SangChul,
> >
> > By default readSeqFile hashes a proportion of the reads to check against many being non-unique. Specify hash=FALSE to turn this off and your memory usage will decrease.
> >
> > Best,
> > Vince
> >
> > Sent from my iPhone
> >
> > On Jun 1, 2012, at 1:23 PM, Sang Chul Choi <schoi at cornell.edu> wrote:
> >
> >> Hi,
> >>
> >> I am using qrqc to plot base quality of a short read fastq file. When the FASTQ file has short reads of the same length, the readSeqFile could read in the FASTQ file (25 millions of 100bp reads) with a couple of GB of memory. I trimmed 3' end of the short reads, which would lead to short reads of variable length because of different base quality at the 3' end.  Then, I tried to read in this second FASTQ file of reads of variable length.  It used up all of the 16 GB memory, and not using CPUs at all.  It seems there are some efficient code in readSeqFile as mentioned in the readSeqFile help message.  It seems to fall apart when short reads are of different size.
> >>
> >> I wish to see how the trimming change the base-quality plots, and this is a problem.  I am wondering if there is a way of sidestepping this problem.
> >>
> >> Thank you,
> >>
> >> SangChul
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> 
> 
> -- 
> Vince Buffalo
> Statistical Programmer
> Bioinformatics Core
> UC Davis Genome Center
> vincebuffalo.com twitter.com/vsbuffalo
> 



More information about the Bioconductor mailing list