[Bioc-sig-seq] ShortRead: puzzled with memory issues over readAligned, readFasta and qa

Martin Morgan mtmorgan at fhcrc.org
Wed Aug 26 15:41:48 CEST 2009


Hi Leonardo --

Leonardo Collado Torres wrote:
> Hello BioC-sig community,
> 
> Today I tried using a few read functions on the ShortRead package on two
> computers with our data for the first time and I'm a puzzled over it :S
> 
> First, due to the type of experiment, I want to use the chipseq package
> and its functions for my reads that I aligned with Bowtie:
> reads <- readAligned(dirPath=".", pattern="s4out.map$", type="Bowtie") #
> Map file created with Bowtie, 1 211 771 708 bytes
> Works fine on both computers, and it does let me do:
> cs <- as(reads, "GenomeData") # Need it so I can use the chipseq package
> rm(reads)
> Though I did try first to read the map file for all 3 lanes and it
> crashed. That one is 3 253 091 186 bytes big.

yes usually you'll want to read and 'pre-process' one lane at a time.

> pat <- "^s4unaligned"
> read.fa <- readFasta(".", pattern=pat) # s4unaligned is 98 272 233
> bytes, s6unaligned is 162 923 413
> # Get the alphabetByCycle, then use:
> rm(read.fa)
> However, this fails with s7unaligned whose size is 389 127 532 bytes. I

I guess you mean that you receive an out-of-memory error?

readFasta uses Biostrings::readFASTA, which is a 'general purpose'
reader. Probably your fasta file is very regular, with just 2 lines for
each read. In this case you can try

  fa = readLines("myfile.fasta")
  sr = ShortRead(sread=DNAStringSet(fa[c(FALSE, TRUE)]),
                 id=BStringSet(fa[c(TRUE, FALSE)]))

this will also work with a compressed file, replacing "myfile.fasta"
with gzfile("myfile.fasta.gz")

I've recently found that setting minimum memory limits for R can be a
big performance boost, at least the first time R reads large objects.
For instance

  % R --silent --min-nsize=13M --min-vsize=1G -e \
      "gcinfo(TRUE); fa = readLines('my.fasta'); gc()"

takes about 45 seconds, whereas without the memory limits it takes about
4.5 minutes. The output of gc() is

           used  (Mb) gc trigger   (Mb)  max used  (Mb)
Ncells 10101399 539.5   13252057  707.8  10660571 569.4
Vcells 76386505 582.8  141008614 1075.9 117938816 899.9

and the idea is to tune --min-nsize so that 'Ncells' 'gc trigger' is
larger than 'Ncells 'max used', and likewise for --min-vsize, and that
gcinfo(TRUE) doesn't report any garbage collection. see ?Memory and R --help

> created these files with Bowtie and the s7 tail looks like this (looks
> normal to me):
> "
>>HWUSI-EAS636:7:100:1790:1731#0/1
> AGAGTTCTACAGTCCGACGATCAGAGTTCTACAGTC
>>HWUSI-EAS636:7:100:1790:1921#0/1
> AGAGTTCTACAGTCCGACGATCAGAGTTCTACAGTC
>>HWUSI-EAS636:7:100:1790:365#0/1
> AGAGTTCTACAGTTCGTATGCCTTCTTCTGCTTGAA
>>HWUSI-EAS636:7:100:1790:1753#0/1
> ATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAA
>>HWUSI-EAS636:7:100:1790:285#0/1
> ANTTCTACAGTCCGACGATCAGTTCTACAGTCCGAC
> "
> I know from these two threads
> http://article.gmane.org/gmane.comp.lang.r.sequencing/398/match=shortread
> and
> http://article.gmane.org/gmane.comp.lang.r.sequencing/368/match=shortread
> that you need around 5x memory, but ~400 mb (~2 GB in ram) should still
> run on the two computers I'm using. Plus, I can readAligned some ~1.2 GB
> files. Although the files are in different formats, why do I get stuck
> with the smaller files?
> 
> Then, if I use:
> file <- "s4.out.map" # Map file created with Bowtie, 1 211 771 708 bytes
> qa <- qa(".", pattern=file, type="Bowtie")
> report(qa, type="html")
> It works fine on my lap top but it hangs and dies on the Mac, which has
> more ram! I didn't expect this at all. Through qa I can get the alphabet

I'm not sure why this is, partly because I don't know what 'hangs and
dies' means. Is it failing during qa()? I'd check carefully that
list.files(".", pattern=file) returns exactly what you want;
"^s4.out.map$" is the pattern that matches an exact file name, from
beginning to end and nothing more.

> by cycle for the reads that got aligned, but not for those that didn't.
> Hence why I was using the readFasta function above (fine for s4 and s6
> but not s7). I want to repeat this once I get access to the fastq files :)
> 
> Any pointers will be greatly appreciated :)
> 
> Thanks,
> Leonardo
> 
> 
> 
> Computer info:
> 
> # Mac comp with 2 quads and 8 GB memory
> sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-08-16 r49268)
> i386-apple-darwin9.7.0
> 
> locale:
> [1] C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base   
> other attached packages:
> [1] ShortRead_1.3.25   lattice_0.17-25    BSgenome_1.13.10  
> Biostrings_2.13.32
> [5] IRanges_1.3.53  
> loaded via a namespace (and not attached):
> [1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1
> 
> 
> # My laptop with 2 GB memory and 4 GB swap memory
> sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-07-25 r48998)
> i686-pc-linux-gnu
> 
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C             [3]
> LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   [5]
> LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8  [7]
> LC_PAPER=en_US.UTF-8       LC_NAME=C                [9]
> LC_ADDRESS=C               LC_TELEPHONE=C           [11]
> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C     
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base   
> other attached packages:
> [1] ShortRead_1.3.22   lattice_0.17-25    BSgenome_1.13.10  
> Biostrings_2.13.29
> [5] IRanges_1.3.48  
> loaded via a namespace (and not attached):
> [1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1



More information about the Bioc-sig-sequencing mailing list