[Bioc-sig-seq] Slow/hanged QA on Illumina Data
Harris A. Jaffee
hj at jhu.edu
Fri Sep 11 16:56:51 CEST 2009
On UNIX/Linux, which seems to be the case here, you can follow your R
process
externally by attaching with strace (or truss, trace -- whatever it
might be
called), i.e. do "strace -p <pid>". This should tell you which file
is being
read at the moment, and how well. You can also do "ls -l /proc/<pid>/
fd" for
a snapshot of the open files. Or try top or ps (with the right
options).
On Sep 10, 2009, at 6:16 PM, Martin Morgan wrote:
> Pratap, Abhishek wrote:
>> Hi Ivan
>>
>> I suspected it but not 100% sure. My % cpu for R process
>> fluctuates btwn (60-100) and swap usage looks ok to me.
>>
>> I remember there was some talk on the mailing list that dev
>> version (R/ShortRead) is a lot faster.
>
> Hi Abhishek,
>
> My experience is in the 5 minutes / lane range for qa, so it would
> seem
> to be running a long time. The ... arguments to qa are passed to the
> function that reads individual files (readAligned), so you can
> include a
> verbose=TRUE argument for a bit more chat. You might write a short
> script along the lines of
>
> gcinfo(TRUE)
> library(ShortRead)
> dirPath <- "some/directory"
> pattern <- "<some_pattern>"
> stopifnot(list.files(dir, pattern) != <files I'm expecting>
> qa <- qa(dirPath, pattern, type=<my type>, verbose=TRUE)
> save(qa, file=<some file>)
>
> Try running this from the command line
>
> R -f MyScript.R
>
> the gcinfo(TRUE) will cause R to start printing messages about
>
> Garbage collection 3 = 2+0+1 (level 0) ...
> 7.4 Mbytes of cons cells used (39%)
> 1.3 Mbytes of vectors used (21%)
> Garbage collection 4 = 3+0+1 (level 0) ...
> 10.3 Mbytes of cons cells used (55%)
>
> which indicates that R is busy managing it's memory even before
> starting
> to do real work. So give R more memory until it quiets down
>
> R --min-nsize=20M --min-vsize=4G -f MyScript.R
>
> (these values are my best guess at what is appropriate, the M is
> 'million', the 'G' Giga).
>
> qa() should be reading one file at a time, so the memory
> requirement is
> for the largest (product of reads and cycles) lane. You should be able
> to get a handle on the size of that using readAligned().
>
> How many reads and cycles are there in your data?
>
> Martin
>
>> Thanks,
>> -Abhi
>>
>> -----Original Message-----
>> From: Ivan Gregoretti [mailto:ivangreg at gmail.com]
>> Sent: Thursday, September 10, 2009 5:01 PM
>> To: Pratap, Abhishek
>> Cc: Martin Morgan; bioc-sig-sequencing at r-project.org
>> Subject: Re: [Bioc-sig-seq] Slow/hanged QA on Illumina Data
>>
>> It sounds like you may have run out of memory in your linux box.
>>
>> When I run qa() in my 16GB machine, it usually uses ~14GB just for
>> this qa() process.
>>
>> That is for 36 bases. May be, it you are running 75 bases, you just
>> used all the RAM.
>>
>> Is the processor running 100%? Check it issuing 'top' at the command
>> line. If it is, then you are good.
>>
>> 'top' can also tell you is you are swapping wildly. (swapping is when
>> your machine runs out of RAM memory and starts storing data in a
>> temporary location in you hard drive to avoid crashing.)
>>
>> Ivan
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>> sessionInfo()
>> R version 2.10.0 Under development (unstable) (2009-08-12 r49169)
>> x86_64-unknown-linux-gnu
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Sep 10, 2009 at 4:35 PM, Pratap, Abhishek
>> <APratap at som.umaryland.edu> wrote:
>>> Hi Martin
>>>
>>>
>>>
>>> I am noticing a lethargic or may be hanged processing with qa()
>>> function in ShortRead. I know I have raised this question before.
>>> Recently I have updated my R to dev version and installed latest
>>> bioconductor.Currently I am trying to run qa() on 8 lanes of
>>> data for
>>> 75 bp reads. The CPU is 16 cores with 16 GB RAM.
>>>
>>>
>>>
>>> It has been two hours since the processing has been going on. Is it
>>> usually takes so long. I am not sure. Will using Rmpi help ?
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> -Abhi
>>>
>>>
>>>
>>> sessionInfo()
>>>
>>> R version 2.9.2 (2009-08-24)
>>>
>>> x86_64-unknown-linux-gnu
>>>
>>>
>>>
>>> locale:
>>>
>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_
>>> US.U
>>> TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N
>>> AME=
>>> C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFI
>>> CATI
>>> ON=C
>>>
>>>
>>>
>>> attached base packages:
>>>
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>>
>>>
>>>
>>> other attached packages:
>>>
>>> [1] ShortRead_1.2.1 lattice_0.17-25 BSgenome_1.12.3
>>> Biostrings_2.12.8
>>>
>>> [5] IRanges_1.2.3
>>>
>>>
>>>
>>> loaded via a namespace (and not attached):
>>>
>>> [1] Biobase_2.4.1 grid_2.9.2 hwriter_1.1
>>>
>>>
>>>
>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
More information about the Bioc-sig-sequencing
mailing list