[Bioc-sig-seq] Slow/hanged QA on Illumina Data

Fri Sep 11 00:16:06 CEST 2009

Pratap, Abhishek wrote:
> Hi Ivan
> 
> I suspected it but not 100% sure. My % cpu for R process fluctuates btwn (60-100) and swap usage looks ok to me.
> 
> I remember there was some talk on the mailing list that dev version (R/ShortRead) is a lot faster.

Hi Abhishek,

My experience is in the 5 minutes / lane range for qa, so it would seem
to be running a long time. The ... arguments to qa are passed to the
function that reads individual files (readAligned), so you can include a
verbose=TRUE argument for a bit more chat. You might write a short
script along the lines of

  gcinfo(TRUE)
  library(ShortRead)
  dirPath <- "some/directory"
  pattern <- "<some_pattern>"
  stopifnot(list.files(dir, pattern) != <files I'm expecting>
  qa <- qa(dirPath, pattern, type=<my type>, verbose=TRUE)
  save(qa, file=<some file>)

Try running this from the command line

  R -f MyScript.R

the gcinfo(TRUE) will cause R to start printing messages about

Garbage collection 3 = 2+0+1 (level 0) ...
7.4 Mbytes of cons cells used (39%)
1.3 Mbytes of vectors used (21%)
Garbage collection 4 = 3+0+1 (level 0) ...
10.3 Mbytes of cons cells used (55%)

which indicates that R is busy managing it's memory even before starting
to do real work. So give R more memory until it quiets down

  R --min-nsize=20M --min-vsize=4G -f MyScript.R

(these values are my best guess at what is appropriate, the M is
'million', the 'G' Giga).

qa() should be reading one file at a time, so the memory requirement is
for the largest (product of reads and cycles) lane. You should be able
to get a handle on the size of that using readAligned().

How many reads and cycles are there in your data?

Martin

> Thanks,
> -Abhi
> 
> -----Original Message-----
> From: Ivan Gregoretti [mailto:ivangreg at gmail.com] 
> Sent: Thursday, September 10, 2009 5:01 PM
> To: Pratap, Abhishek
> Cc: Martin Morgan; bioc-sig-sequencing at r-project.org
> Subject: Re: [Bioc-sig-seq] Slow/hanged QA on Illumina Data
> 
> It sounds like you may have run out of memory in your linux box.
> 
> When I run qa() in my 16GB machine, it usually uses ~14GB just for
> this qa() process.
> 
> That is for 36 bases. May be, it you are running 75 bases, you just
> used all the RAM.
> 
> Is the processor running 100%? Check it issuing 'top' at the command
> line. If it is, then you are good.
> 
> 'top' can also tell you is you are swapping wildly. (swapping is when
> your machine runs out of RAM memory and starts storing data in a
> temporary location in you hard drive to avoid crashing.)
> 
> Ivan
> 
> 
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1592
> Fax: 1-301-496-9878
> 
>> sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-08-12 r49169)
> x86_64-unknown-linux-gnu
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> 
> 
> 
> 
> 
> 
> On Thu, Sep 10, 2009 at 4:35 PM, Pratap, Abhishek
> <APratap at som.umaryland.edu> wrote:
>> Hi  Martin
>>
>>
>>
>> I am noticing a lethargic or may be hanged processing with qa()
>> function in ShortRead. I know I have raised this question before.
>> Recently I have updated my R to dev version and installed latest
>> bioconductor.Currently I am trying to run qa()  on 8 lanes of data for
>> 75 bp reads.  The CPU is 16 cores with 16 GB RAM.
>>
>>
>>
>> It has been two hours since the processing has been going on. Is it
>> usually takes so long. I am not sure.  Will using Rmpi help ?
>>
>>
>>
>>
>>
>> Thanks,
>>
>> -Abhi
>>
>>
>>
>> sessionInfo()
>>
>> R version 2.9.2 (2009-08-24)
>>
>> x86_64-unknown-linux-gnu
>>
>>
>>
>> locale:
>>
>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.U
>> TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=
>> C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATI
>> ON=C
>>
>>
>>
>> attached base packages:
>>
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>>
>>
>>
>> other attached packages:
>>
>> [1] ShortRead_1.2.1   lattice_0.17-25   BSgenome_1.12.3
>> Biostrings_2.12.8
>>
>> [5] IRanges_1.2.3
>>
>>
>>
>> loaded via a namespace (and not attached):
>>
>> [1] Biobase_2.4.1 grid_2.9.2    hwriter_1.1
>>
>>
>>
>>
>>
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>