[Bioc-sig-seq] Slow/hanged QA on Illumina Data

Fri Sep 11 16:56:51 CEST 2009

On UNIX/Linux, which seems to be the case here, you can follow your R  
process
externally by attaching with strace (or truss, trace -- whatever it  
might be
called), i.e. do "strace -p <pid>".  This should tell you which file  
is being
read at the moment, and how well.  You can also do "ls -l /proc/<pid>/ 
fd" for
a snapshot of the open files.  Or try top or ps (with the right  
options).

On Sep 10, 2009, at 6:16 PM, Martin Morgan wrote:

> Pratap, Abhishek wrote:
>> Hi Ivan
>>
>> I suspected it but not 100% sure. My % cpu for R process  
>> fluctuates btwn (60-100) and swap usage looks ok to me.
>>
>> I remember there was some talk on the mailing list that dev  
>> version (R/ShortRead) is a lot faster.
>
> Hi Abhishek,
>
> My experience is in the 5 minutes / lane range for qa, so it would  
> seem
> to be running a long time. The ... arguments to qa are passed to the
> function that reads individual files (readAligned), so you can  
> include a
> verbose=TRUE argument for a bit more chat. You might write a short
> script along the lines of
>
>   gcinfo(TRUE)
>   library(ShortRead)
>   dirPath <- "some/directory"
>   pattern <- "<some_pattern>"
>   stopifnot(list.files(dir, pattern) != <files I'm expecting>
>   qa <- qa(dirPath, pattern, type=<my type>, verbose=TRUE)
>   save(qa, file=<some file>)
>
> Try running this from the command line
>
>   R -f MyScript.R
>
> the gcinfo(TRUE) will cause R to start printing messages about
>
> Garbage collection 3 = 2+0+1 (level 0) ...
> 7.4 Mbytes of cons cells used (39%)
> 1.3 Mbytes of vectors used (21%)
> Garbage collection 4 = 3+0+1 (level 0) ...
> 10.3 Mbytes of cons cells used (55%)
>
> which indicates that R is busy managing it's memory even before  
> starting
> to do real work. So give R more memory until it quiets down
>
>   R --min-nsize=20M --min-vsize=4G -f MyScript.R
>
> (these values are my best guess at what is appropriate, the M is
> 'million', the 'G' Giga).
>
> qa() should be reading one file at a time, so the memory  
> requirement is
> for the largest (product of reads and cycles) lane. You should be able
> to get a handle on the size of that using readAligned().
>
> How many reads and cycles are there in your data?
>
> Martin
>
>> Thanks,
>> -Abhi
>>
>> -----Original Message-----
>> From: Ivan Gregoretti [mailto:ivangreg at gmail.com]
>> Sent: Thursday, September 10, 2009 5:01 PM
>> To: Pratap, Abhishek
>> Cc: Martin Morgan; bioc-sig-sequencing at r-project.org
>> Subject: Re: [Bioc-sig-seq] Slow/hanged QA on Illumina Data
>>
>> It sounds like you may have run out of memory in your linux box.
>>
>> When I run qa() in my 16GB machine, it usually uses ~14GB just for
>> this qa() process.
>>
>> That is for 36 bases. May be, it you are running 75 bases, you just
>> used all the RAM.
>>
>> Is the processor running 100%? Check it issuing 'top' at the command
>> line. If it is, then you are good.
>>
>> 'top' can also tell you is you are swapping wildly. (swapping is when
>> your machine runs out of RAM memory and starts storing data in a
>> temporary location in you hard drive to avoid crashing.)
>>
>> Ivan
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>> sessionInfo()
>> R version 2.10.0 Under development (unstable) (2009-08-12 r49169)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Sep 10, 2009 at 4:35 PM, Pratap, Abhishek
>> <APratap at som.umaryland.edu> wrote:
>>> Hi  Martin
>>>
>>>
>>>
>>> I am noticing a lethargic or may be hanged processing with qa()
>>> function in ShortRead. I know I have raised this question before.
>>> Recently I have updated my R to dev version and installed latest
>>> bioconductor.Currently I am trying to run qa()  on 8 lanes of  
>>> data for
>>> 75 bp reads.  The CPU is 16 cores with 16 GB RAM.
>>>
>>>
>>>
>>> It has been two hours since the processing has been going on. Is it
>>> usually takes so long. I am not sure.  Will using Rmpi help ?
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> -Abhi
>>>
>>>
>>>
>>> sessionInfo()
>>>
>>> R version 2.9.2 (2009-08-24)
>>>
>>> x86_64-unknown-linux-gnu
>>>
>>>
>>>
>>> locale:
>>>
>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_ 
>>> US.U
>>> TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N 
>>> AME=
>>> C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFI 
>>> CATI
>>> ON=C
>>>
>>>
>>>
>>> attached base packages:
>>>
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>>
>>>
>>>
>>> other attached packages:
>>>
>>> [1] ShortRead_1.2.1   lattice_0.17-25   BSgenome_1.12.3
>>> Biostrings_2.12.8
>>>
>>> [5] IRanges_1.2.3
>>>
>>>
>>>
>>> loaded via a namespace (and not attached):
>>>
>>> [1] Biobase_2.4.1 grid_2.9.2    hwriter_1.1
>>>
>>>
>>>
>>>
>>>
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing