[BioC] EdgeR: replicated pools, yes or not?

Thu Apr 24 16:16:34 CEST 2014

I'm glad you are aware of the Kendziorski et al.  paper, because it is the most applicable to the concept of biological vs. mathematical averaging. Also, I agree with Ryan. Several years ago I did exactly what he mentioned, looking in silico pooling vs. actual pooling, along with extensive simulations. The results were in agreement with Kendziorski using RNA-Seq, with some slight difference due to the dynamic range of RNA-Seq vs. microarrays.

An additional benefit of the in silo pooling / repeated technical measurements is the design is more robust to technical problems (e.g., one bad library prep and all you've lost are a fraction of your reads for that animal rather than all the reads from that animal.)

Also, unless one does a carefully designed (and complex) experiment like Kendziorski, then the apparent gain in power via biological pooling is a complete mirage because the within-group variance being measured is technical, and not biological. Therefore, significance tests from such experiments do not reflect what one is really after. 

Philosophically, one has to ask how meaningful such experiments are, especially if the ultimate goal is prediction at the individual level. 

Wade

-----Original Message-----
From: Ryan [mailto:rct at thompsonclan.org] 
Sent: Wednesday, April 23, 2014 12:06 PM
To: "\"Manuel J Gómez [guest]\" "
Cc: mjgomezr at cnic.es; bioconductor at r-project.org
Subject: Re: [BioC] EdgeR: replicated pools, yes or not?

Don't pool. You are throwing away information. If you're going to do 24 animals, you may as well use 24 barcodes. To see that a separate barcode for each animal provides strictly more information than pooling, note that once you have used separate barcodes, you could add the counts together to do in silico pooling and get the same result as if you had done pooling in vitro. In other words, you can get from separate barcodes to pooling by throwing away information.

For a literature reference, try "Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing." http://www.ncbi.nlm.nih.gov/pubmed/22985019

That publication doesn't directly address the issue of pooling multiple biological samples in the same barcode, but it does make clear that more biological replication results in a drastic improvement in results. You could simulate your described pooling scheme yourself: 
simply simulate 24 libraries in 2 groups with some number of true differentially expressed genes between them. Then pool them 3 at a time (by adding their counts together) to get the pooled dataset of 8 pooled libraries in 2 groups. Then perform the analysis on both datasets using your preferred tool and compute the ROC curve. I think you will find that pooling significantly diminishes your power to detect differential expression.

-Ryan Thompson

On Wed Apr 23 09:42:15 2014, "Manuel J Gómez [guest]"   wrote:
>
> Hello,
>
> I would like to ask for your opinion on whether using replicated pools in the context of RNASeq experiments makes sense, or not.
>
> Lets say that we are interested in detecting genes that are differentially expressed in two genetic backgrounds (a certain KO mutant strain and the corresponding WT), in mouse liver.
>
> We could perform an RNASeq experiment using liver tissue from four KO and four WT with the same sex, age, and diet.
>
> We would have eight samples: four biological replicates for each of the two conditions to be compared.
>
> However, we decide to pool liver tissue from three animals, to prepare each of the eight samples (we would use, therefore 24 animals: 12 KO animals pooled to produce four KO samples, and 12 WT animals pooled to produce four WT samples).
>
> We would do it following the argument that pooling samples to build biological replicates reduces variation between replicates and increases the statistical power of the analysis, resulting in a more sensitive detection of genes that are differentially expressed between conditions.
>
> However, EdgeR relies, precisely, on measuring biological variability to establish the statistical significance of differences in gene expression across conditions. Therefore, pooling samples to buid biological replicates is not correct and we are, in fact, losing statistical power. We are unable of determining whether the observed differences in gene expression are significative or not.
>
> There are some publications dealing with this issue in the context of microarrays (for example, Kendziorski et al, 2005, "On the utility of pooling biological samples in microarray experiments", PNAS, 102:4252) but I have not found anything similar in the context of RNASeq and, more specifically, of the analysis of RNASeq data with EdgeR.
>
> Any comment will be more than welcome, as well as any relevant references.
>
> Thanks a lot in advance.
>
>   -- output of sessionInfo():
>
> NA
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor