[Bioc-sig-seq] ShortRead: puzzled with memory issues over readAligned, readFasta and qa

Wed Aug 26 08:10:00 CEST 2009

Hello BioC-sig community,

Today I tried using a few read functions on the ShortRead package on two 
computers with our data for the first time and I'm a puzzled over it :S

First, due to the type of experiment, I want to use the chipseq package 
and its functions for my reads that I aligned with Bowtie:
reads <- readAligned(dirPath=".", pattern="s4out.map$", type="Bowtie") # 
Map file created with Bowtie, 1 211 771 708 bytes
Works fine on both computers, and it does let me do:
cs <- as(reads, "GenomeData") # Need it so I can use the chipseq package
rm(reads)
Though I did try first to read the map file for all 3 lanes and it 
crashed. That one is 3 253 091 186 bytes big.

pat <- "^s4unaligned"
read.fa <- readFasta(".", pattern=pat) # s4unaligned is 98 272 233 
bytes, s6unaligned is 162 923 413
# Get the alphabetByCycle, then use:
rm(read.fa)
However, this fails with s7unaligned whose size is 389 127 532 bytes. I 
created these files with Bowtie and the s7 tail looks like this (looks 
normal to me):
"
 >HWUSI-EAS636:7:100:1790:1731#0/1
AGAGTTCTACAGTCCGACGATCAGAGTTCTACAGTC
 >HWUSI-EAS636:7:100:1790:1921#0/1
AGAGTTCTACAGTCCGACGATCAGAGTTCTACAGTC
 >HWUSI-EAS636:7:100:1790:365#0/1
AGAGTTCTACAGTTCGTATGCCTTCTTCTGCTTGAA
 >HWUSI-EAS636:7:100:1790:1753#0/1
ATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAA
 >HWUSI-EAS636:7:100:1790:285#0/1
ANTTCTACAGTCCGACGATCAGTTCTACAGTCCGAC
"
I know from these two threads 
http://article.gmane.org/gmane.comp.lang.r.sequencing/398/match=shortread 
and 
http://article.gmane.org/gmane.comp.lang.r.sequencing/368/match=shortread 
that you need around 5x memory, but ~400 mb (~2 GB in ram) should still 
run on the two computers I'm using. Plus, I can readAligned some ~1.2 GB 
files. Although the files are in different formats, why do I get stuck 
with the smaller files?

Then, if I use:
file <- "s4.out.map" # Map file created with Bowtie, 1 211 771 708 bytes
qa <- qa(".", pattern=file, type="Bowtie")
report(qa, type="html")
It works fine on my lap top but it hangs and dies on the Mac, which has 
more ram! I didn't expect this at all. Through qa I can get the alphabet 
by cycle for the reads that got aligned, but not for those that didn't. 
Hence why I was using the readFasta function above (fine for s4 and s6 
but not s7). I want to repeat this once I get access to the fastq files :)

Any pointers will be greatly appreciated :)

Thanks,
Leonardo

Computer info:

# Mac comp with 2 quads and 8 GB memory
sessionInfo()
R version 2.10.0 Under development (unstable) (2009-08-16 r49268)
i386-apple-darwin9.7.0

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] ShortRead_1.3.25   lattice_0.17-25    BSgenome_1.13.10   
Biostrings_2.13.32
[5] IRanges_1.3.53   

loaded via a namespace (and not attached):
[1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1

# My laptop with 2 GB memory and 4 GB swap memory
sessionInfo()
R version 2.10.0 Under development (unstable) (2009-07-25 r48998)
i686-pc-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8  
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                
 [9] LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] ShortRead_1.3.22   lattice_0.17-25    BSgenome_1.13.10   
Biostrings_2.13.29
[5] IRanges_1.3.48   

loaded via a namespace (and not attached):
[1] Biobase_2.5.5 grid_2.10.0   hwriter_1.1 

-- 
Leonardo Collado Torres, Bachelor in Genomic Sciences
Professor at LCG and member of Dr. Enrique Morett's lab
UNAM Campus Cuernavaca, Mexico

Homepage: http://www.lcg.unam.mx/~lcollado/
Phone: [52] (777) 313-28-05