[BioC] DiffBind -error with dba.counts
    Gordon Brown 
    Gordon.Brown at cruk.cam.ac.uk
       
    Wed Sep 18 10:48:52 CEST 2013
    
    
  
Hi, Anitha,
It's almost certainly running out of memory, then.  If your reads are BAM
format, you can try the "bLowMem" option on dba.count, which reduces the
memory usage significantly, at some cost in performance (though in your
case it should speed things up dramatically).  (The format of the peaks
doesn't matter, but the reads must be sorted, indexed BAM.)  From the
dba.count documentation:
"bLowMem: logical indicating that the low-memory options should be used
for counting (using ŒsummarizeOverlaps¹). This option is slower but memory
use does not increase with the number of reads to count. If ŒTRUE¹, all
read files must be BAM (.bam extension), with associated index files
(.bam.bai extension). ŒinsertLength¹ must absent."
Also try "bParallel=FALSE".  By default dba.count runs as many parallel
threads for counting as there are processors in your computer;
"bParallel=FALSE" ensures that it only runs one at a time, hence using
much less memory.
Hope this helps.  We plan that the next release will remove the
requirement that reads be BAM format for the bLowMem option.
Cheers,
 - Gord
On 2013-09-17 18:01, "Anitha Sundararajan" <asundara at ncgr.org> wrote:
>Hi Gordon
>
>Please see below the session info:
>
> > sessionInfo()
>R version 3.0.1 (2013-05-16)
>Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
>locale:
>[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
>attached base packages:
>[1] parallel  stats     graphics  grDevices utils     datasets methods
>base
>
>other attached packages:
>[1] DiffBind_1.6.2       Biobase_2.20.1       GenomicRanges_1.12.5
>IRanges_1.18.3       BiocGenerics_0.6.0   BiocInstaller_1.10.3
>
>loaded via a namespace (and not attached):
>  [1] amap_0.8-7         edgeR_3.2.4        gdata_2.13.2
>gplots_2.11.3      gtools_3.0.0       limma_3.16.7 RColorBrewer_1.0-5
>stats4_3.0.1
>  [9] tools_3.0.1        zlibbioc_1.6.0
>
>
>I have anywhere from 30-55 million reads for my samples. Yes, everything
>else on the machine does slow down quite a bit.
>
>I am running R locally now as we do not have R 3.0.1 installed on
>command line. Not sure if that matters.
>
>Thanks for all your help.
>
>Anitha
>
>On 9/17/13 3:05 AM, Gordon Brown wrote:
>> Hi, Anitha,
>>
>> What version of Bioconductor/DiffBind are you running, and how much
>>memory
>> does your computer have?  Older versions of DiffBind use a *lot* of
>>memory
>> in the counting stage, so if your computer is short on RAM, it could
>> easily run out of memory and start swapping to disk, which will slow it
>> down by orders of magnitude.  Does everything else on the machine slow
>> down as well?
>>
>> Can you pass along the output from the "sessionInfo()" command?
>>
>> And if possible, upgrade to the latest version of DiffBind (if you're
>>not
>> there already) and try the "bLowMem" option on dba.count.
>>
>> Other than that, I can't think of any reason it should take hours,
>>unless
>> you have *really* big data files.  How many reads are in them, roughly?
>>
>>   - Gord
>>
>>
>> On 2013-09-16 21:21, "Anitha Sundararajan" <asundara at ncgr.org> wrote:
>>
>>> Sorry, I did try the minOverlap=2 (didnt rectify when I wrote the
>>>email,
>>> my bad)
>>>
>>>
>>> On 9/16/13 1:59 PM, Anitha Sundararajan wrote:
>>>> Hi Gordon
>>>>
>>>> I am now trying to run both reps for each sample, despite their low
>>>> correlation.  When I try the
>>>>
>>>>> B73.H3K4=dba.count(B73.H3K4, minOverlap=3)
>>>> the R-session just freezes and there is no response for hours.  I am
>>>> not sure if there is anything wrong with any of my input files.  The
>>>> sample sheet gets read in fine without any errors.
>>>>
>>>> Just FYI, my bed file (form MACS2) looks like:
>>>>
>>>>
>>>> chr1    9128    9552    MACS_peak_1     105.25
>>>> chr1    9918    10127   MACS_peak_2     4.72
>>>> chr1    79482   79691   MACS_peak_3     5.10
>>>> chr1    86963   87514   MACS_peak_4     50.23
>>>> chr1    94579   94781   MACS_peak_5     5.10
>>>> chr1    103763  103997  MACS_peak_6     5.10
>>>> chr1    110722  111047  MACS_peak_7     97.69
>>>> chr1    144929  145568  MACS_peak_8     127.78
>>>> chr1    161344  162320  MACS_peak_9     136.89
>>>> chr1    222479  223058  MACS_peak_10    77.67
>>>> chr1    227130  227628  MACS_peak_11    17.02
>>>> chr1    263835  263971  MACS_peak_12    12.60
>>>> chr1    264068  264518  MACS_peak_13    58.01
>>>> chr1    264625  265056  MACS_peak_14    68.16
>>>> chr1    270509  271086  MACS_peak_15    47.15
>>>> chr1    277629  277789  MACS_peak_16    13.25
>>>>
>>>> Not sure if this is the problem?
>>>>
>>>> Thanks so much.
>>>>
>>>> Anitha
>>>>
>>>> On 9/16/13 3:51 AM, Gordon Brown wrote:
>>>>> Hi, Anitha,
>>>>>
>>>>> The basic problem is that you have two samples, but you're asking
>>>>>for a
>>>>> minOverlap of 3 (i.e. for peaks which occur in at least 3 samples).
>>>>>No
>>>>> locations can satisfy that criterion, so you end up with an empty set
>>>>> of
>>>>> peaks.
>>>>>
>>>>> The message is obscure, I will admit.  (It happens because DiffBind
>>>>> writes
>>>>> out the unified set of peaks and reads it back in, for tedious
>>>>> implementation reasons, and when it reads it back in, there are no
>>>>> peaks,
>>>>> hence "no lines available in input".)
>>>>>
>>>>> Try using minOverlap=2.   But... having said that, I'm not sure how
>>>>> useful
>>>>> DiffBind will be to you, without replicates.
>>>>>
>>>>> Cheers,
>>>>>
>>>>>    - Gord Brown
>>>>>
>>>>>
>>>>>
>>>>>> Message: 22
>>>>>> Date: Fri, 13 Sep 2013 12:21:02 -0600
>>>>>> From: Anitha Sundararajan <asundara at ncgr.org>
>>>>>> To: bioconductor at r-project.org
>>>>>> Subject: [BioC] DiffBind -error with dba.counts
>>>>>> Message-ID: <5233578E.3090701 at ncgr.org>
>>>>>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I have been trying to use DiffBind to analyze our Chip-seq data and
>>>>>> have
>>>>>> been running into some errors repeatedly.
>>>>>>
>>>>>> I first created a samplesheet.csv describing my samples and it looks
>>>>>> like this:
>>>>>>
>>>>>>
>>>>>> 
>>>>>>SampleID,Tissue,Factor,Condition,Replicate,bamReads,bamControl,Peaks,
>>>>>>Pe
>>>>>> akC
>>>>>>
>>>>>> aller
>>>>>>
>>>>>>
>>>>>> 
>>>>>>meio.1,meiocytes,H3K4me3,N,1,M_meiocytes_H3K4me3.bam,InM_input_meiocy
>>>>>>te
>>>>>> s.b
>>>>>>
>>>>>> am,meio.vs.in.rep1.def_peaks.bed,MACS
>>>>>>
>>>>>>
>>>>>> 
>>>>>>seed.1,seedlings,H3K4me3,N,1,S_seedling_H3K4me3.bam,InS_input_seedlin
>>>>>>g.
>>>>>> bam
>>>>>>
>>>>>> ,seed.vs.in.rep1.def_peaks.bed,MACS
>>>>>>
>>>>>>
>>>>>> I only have two samples (and their respective inputs) with one rep
>>>>>> each
>>>>>> and the peaks were called using MACS v2. The peak caller generated
>>>>>> .bed
>>>>>> files which was used in DiffBind.
>>>>>>
>>>>>>
>>>>>> I defined the working directory in R first.
>>>>>>
>>>>>> I then read the sample sheet in :
>>>>>>> H3K4.B73=dba(sampleSheet='samplesheet2.csv',peakFormat='bed')
>>>>>>> H3K4.B73
>>>>>> 2 Samples, 38870 sites in matrix (45304 total):
>>>>>>         ID    Tissue  Factor Condition Replicate Peak.caller
>>>>>>Intervals
>>>>>> 1 meio.1 meiocytes H3K4me3        N         1        MACS 44124
>>>>>> 2 seed.1 seedlings H3K4me3         N         1        MACS 41596
>>>>>>
>>>>>> generated a plot,
>>>>>>> plot(H3K4.B73)
>>>>>> And then when I tried to perform dba.counts, it continuously fails
>>>>>>on
>>>>>> me.  I went through the thread to find similar posts and could not
>>>>>> find
>>>>>> a solution.  I tried the floowing command:
>>>>>>
>>>>>>> H3K4.B73=dba.count(H3K4.B73, minOverlap=3)
>>>>>> and this,
>>>>>>> H3K4.B73=dba.count(H3K4.B73, minOverlap=3, bLowMem=TRUE)
>>>>>>> H3K4.B73=dba.count(H3K4.B73, minOverlap=3, bLowMem=FALSE)
>>>>>> And they all failed.
>>>>>>
>>>>>> My error in all three cases is as follows:
>>>>>> Error in read.table(fn, skip = skipnum) : no lines available in
>>>>>>input
>>>>>>
>>>>>> Please let me know if you have any insights on it.
>>>>>>
>>>>>> Thanks so much for your help in advance.
>>>>>>
>>>>>> Anitha Sundararajan Ph.D.
>>>>>> Research Scientist
>>>>>> National Center for Genome Resources
>>>>>> Santa Fe, NM 87505
>
    
    
More information about the Bioconductor
mailing list