[BioC] ShortRead QA

Fri Jul 23 12:23:28 CEST 2010

Alex Gutteridge <alexg at ruggedtextile.com> writes:

> I'm dealing with some Solexa/Illumina data with ShortRead for the first
> time and had a couple of questions relating to QA:
>
> 1. Memory requirements: My data comprises 7 s_N_export.txt files. Each one
> comprises 10-20 million aligned reads. If I try to run qa() over the whole
> directory my machine rapidly grinds to a halt. Tackling each file
> individually keeps my machine running, but takes >1 hour for each one. The
> ShortRead vignette says evaluating a single lane can take 'several
> minutes', so I'm wondering if anyone can offer any clues as to why I'm
> struggling so much? The machine in question has 6GB of RAM - do I just need
> more?

It's total # of bases that'll be important, but if these are 'long'
reads then yes, likely memory is limiting (we're hoping  to take a
better approach to qa and other input functions over the next release,
though that doesn't help you at the moment).

> 2. Read distribution: The QA results I'm getting for the 'read
> distribution' section don't quite look like those presented in the example
> ShortRead Solexa QA report. My interpretation is that this is because my
> data is actually rather high quality, but I'd appreciate a second opinion. 
>
> To quote from the ShortRead QA report: 
>
> 'Ideally, the cumulative proportion of reads will transition sharply from
> low to high. Portions to the left of the transition might correspond
> roughly to sequencing or sample processing errors, and correspond to reads
> that are represented relatively infrequently [...]. Portions to the right
> of the transition represent reads that are over-represented compared to
> expectation.'
>
> Typically the read distribution plots I'm seeing look like this:
> http://dl.dropbox.com/u/419878/readOccurences.jpg
>
> There is a sharp transition, but no portion to the left. I interpret this
> as a good sign: most of the reads are seen a small number of times (<10),
> and there are relatively few over-represented reads. Is there anything
> there that would worry more experienced heads?

It depends a bit on what the data is for, but your interpretation above
is accurate so if consistent with your expectations then that's good.

Martin
-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793