[BioC] ShortRead QA

Thu Jul 22 16:26:21 CEST 2010

I'm dealing with some Solexa/Illumina data with ShortRead for the first
time and had a couple of questions relating to QA:

1. Memory requirements: My data comprises 7 s_N_export.txt files. Each one
comprises 10-20 million aligned reads. If I try to run qa() over the whole
directory my machine rapidly grinds to a halt. Tackling each file
individually keeps my machine running, but takes >1 hour for each one. The
ShortRead vignette says evaluating a single lane can take 'several
minutes', so I'm wondering if anyone can offer any clues as to why I'm
struggling so much? The machine in question has 6GB of RAM - do I just need
more?

2. Read distribution: The QA results I'm getting for the 'read
distribution' section don't quite look like those presented in the example
ShortRead Solexa QA report. My interpretation is that this is because my
data is actually rather high quality, but I'd appreciate a second opinion. 

To quote from the ShortRead QA report: 

'Ideally, the cumulative proportion of reads will transition sharply from
low to high. Portions to the left of the transition might correspond
roughly to sequencing or sample processing errors, and correspond to reads
that are represented relatively infrequently [...]. Portions to the right
of the transition represent reads that are over-represented compared to
expectation.'

Typically the read distribution plots I'm seeing look like this:
http://dl.dropbox.com/u/419878/readOccurences.jpg

There is a sharp transition, but no portion to the left. I interpret this
as a good sign: most of the reads are seen a small number of times (<10),
and there are relatively few over-represented reads. Is there anything
there that would worry more experienced heads?

-- 
Alex Gutteridge