[BioC] questions on edgeR package

Thu Jan 30 14:57:06 CET 2014

Hi,

On Thu, Jan 30, 2014 at 2:07 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
[snip]
> I think the point is that Ming has already downloaded the v2 data, and the
> so-called "raw counts" turned out not to be counts.

I see ... I wasn't paying very close attention to this thread until my
last reply and hadn't realized that the OP has pretty much hashed
these things out.

> If you want to dig to find out what the "raw counts" are exactly, that would
> be a great service, because I am just guessing.  The TCGA documentation just
> says they are from RSEM.

After some digging: there are several pages that show the output of
the rna-seq pipeline, such as this one:

https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2

These suggest (under "File field descriptions") that there is a column
int the gene-level summaries file called "raw_counts" which are "The
number of reads mapping to this gene". Digging around a bit more, you
find that this is really RSEM expected counts. This page was helpful:

https://confluence.broadinstitute.org/download/attachments/29790363/DESCRIPTION.txt?version=1&modificationDate=1363806109000

At the bottom of the page under the "Column Headers" section is where
you get the required detail. This thread on RSEM-users mailing list
also confirms:

https://groups.google.com/forum/#!topic/rsem-users/H1cswrvvmPs

The raw_count columns are really the expected counts from RSEM (where
the authors of RSEM suggest that these rounded numbers are suitable
for edgeR / DESeq ;-)

Perhaps the OP would best use EBSeq for the TCGA "raw_counts" data,
or, as you suggest, limma::voom since these are actually the expected
counts.

If you *really* want to use edgeR, the first link I pointed to
suggests that the "raw_count" column in the "exon_quantification.txt"
are actually the raw counts to the exon (can the OP confirm that they
are actually integers?). If so, perhaps you could sum this up per gene
and then continue. I suspect you'd likely then be double counting
exon-spanning reads, which might be problematic. In principle, you
could then subtract the tallies in the junction_quantification.txt to
accommodate for that -- which all seems like a lot of work if
limma::voom and EBSeq will work just as well on the RSEM expected
counts.

Hope that helps,

-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech