[BioC] Rsubread package usage and speed questions

Tue May 29 17:33:34 CEST 2012

In case others run into the same issue, here is an update:

Wei suggested that it was due to a slow I/O issue in my virtual machine, and sure enough that was the issue. More precisely, I was accessing the data from a shared folder.

On a native Linux installation, the same code below ran in about 500 seconds (but different hardware).

Thanks to Wei for the help and for this fine package.

Wade

-----Original Message-----
From: Davis, Wade [mailto:davisjwa at health.missouri.edu] 
Sent: Sunday, May 27, 2012 9:01 AM
To: bioconductor at r-project.org
Subject: [BioC] Rsubread package usage and speed questions

Dear Wei Shi,
I've just started using Rsubread, and I had a few questions.

I've got a FASTQ file with about 6.3M reads. I built my reference index (mm10).

extdataDir<-"/mnt/hgfs/Z/"
setwd(extdataDir)
buildindex(basename="mm10_rsubread_index",reference="mm10.fa",memory=10000)

I then aligned using the following code:

#https://stat.ethz.ch/pipermail/bioconductor/2012-February/043552.html
#http://permalink.gmane.org/gmane.science.biology.informatics.conductor/34709

setwd("/mnt/hgfs/Z/myprj/")
align(
index="/mnt/hgfs/Z/mm10_rsubread_index",
readfile1="cutadapt_1-Feb_ATCACG_L003_R1_001.fastq",
output_file="cutadapt_1-Feb_ATCACG_L003_R1_001.subread.sam",
nthreads=4,
indels=2,
TH1=2)

Based on the screen output, it took about 7200 seconds in total, but the "aligning" portion was about 3900s. The "saving the result portion" would then seem to have taken 3300s. This is consistent with the write speed I observed (~300 KB/s) and the SAM file size (1.16 GB).

So my questions are:

1)      Does this alignment speed seem reasonable for this situation? Based on what I read on the mailing list, I was expecting it to be a little faster. (it does seem to be faster than novoalign @9500s) I am not complaining about your package, I am just want to make sure I have the settings correct.

2)      Does the 'saving the result' time seem normal (as nebulous as that term is)? Is that step bound by disk write speed? This seem to take a very long time, so I suspect there is more going on than just writing to disk?

3)      Any recommendations/tweaks on the speed? I have about 50 files like this, and I was hoping to try it on much larger files and in parallel (different files).

I'd be happy to send you the FASTQ file if you like (991 MB).

Another question, which may impact the speed: is it OK to use Rsubread to align sequences of varying lengths? I started with 50 bp single end reads, but I needed to trim some adapter sequences. Most reads are still 50bp, but there are some shorter sequences.

In case it matters, my hardware: 68GB RAM with 2x 4-core 3.16 Ghz Xeons running Ubuntu 12.04 in a VM.

My session info  is given below.

Thanks,
Wade

sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] Rsubread_1.6.3

loaded via a namespace (and not attached):
[1] tools_2.15.0

	[[alternative HTML version deleted]]