> However, assuming that my budget allows me to sequence only a limited
> number of samples at a decent coverage (for example, 8 samples at 10
> million reads per sample), which of the following would be the preferred
> solution?
>
> a) using 8 samples obtained from 8 different animals (4 KO and 4 WT);
> b) using 8 samples (4 KO and 4 WT) obtained by pooling tissue from "n"
> animals (with the same genotype, obviously).
>

The preferred solution would be to take your 8 * n animals and sequence
them all individually using the same total amount of sequencing as you
would have used for the 8 pools. Each individual sample will have n times
less coverage, but that doesn't matter because you have still done the same
total amount of sequencing per condition. I read a paper showing that
increasing the number of biological replicates for an RNA-seq experiment
while holding constant the total amount of sequencing (and therefore
reducing the sequencing per replicate) continued to give gains in
statistical power up to at least 192 biological replicates (which was the
largest number they tested). This was in simulations, of course.
Unfortunately, I can't find the citation in my ever-growing library of
articles, but maybe someone else can supply it.

So, I'm not sure whether option a or b is better, but if you have the
capability to to b, then you also probably have the capability to do 8 * n
unpooled samples, which is unquestionably better than either a or b.


> I am pretty sure that if the unique difference between the two types of
> animal (or condition) is a specific mutation, solution (a) would be THE
> correct solution because it would imply using truly biological and
> independent replicates. Solution (b) would be not just less correct, but
> blatantly incorrect, because it would eliminate biological variation
> between replicates (specially if "n" is high), and having an estimation of
> that variation is necessary to establish the significance of the
> differences observed between conditions.
>

This is not necessarily a problem, although it might be. With the pooled
samples, your estimate of biological variability will be smaller, but you
also fewer degrees of freedom than you would if you did all the samples
separately instead of pooling. I don't know which of these effects would
dominate. So your significance estimates may not be any less accurate or
unbiased, but they will probably be less precise since you are working with
fewer observations.

>
> I acknowledge that I am answering myself, but I keep finding examples in
> which pooling (in the sense that I am describing above) is not completely
> discouraged. For example, Churchill (in "Fundamentals of experimental
> design for cDNA microarrays", 2002, Nature Genetics 32) explains that "in a
> two-sample comparison, we could consider making two large pools of all
> available units and measuring each pool multiple times. This is a poor
> design, as it does not allow estimation of the between-pool variance. By
> pooling all the available samples together we have minimized the biological
> variance, but we have also eliminated all independent replication. It is
> better to use several pools and fewer technical replicates". Why does he
> write that it is better to use several pools? Wouldn't it be better to use
> no pools at all?
>

The considerations are different for microarrays. In sequencing, you can
divide up your available sequencing space into as many individual
replicates as you like. In microarrays, if you only have money to do 10
arrays, then you can only do 10 samples, so are forced to choose between 10
individuals or 10 pools.


> Similarly, a discussion in which pooling is not completely discouraged can
> be found in:
>
> http://seqanswers.com/forums/showthread.php?t=27905
>

The only place I see pooling not discouraged in that thread is the part
talking about 5 pools of 10 individuals each for 3 conditions vs 5
individuals each for 3 conditions. In that case Simon says that pooling is
acceptable because the money or labor costs of individually prepping 150
samples may be prohibitive. He still notes that this is the preferred
solution if possible, and he notes that there is a trade-off that must be
considered for the few samples vs few pools question. This echoes my answer
above in this reply.

Finally, pooling samples is often justified because of limited availability
> of RNA. In those cases pooling is mandatory, obviously. But if replicates
> have been constructed by pooling RNA from many tiny individual samples,
> shouldn't we have in mind that we have lost all information regarding
> biological variance, and that we will not be able to asses the significance
> of any differences observed between conditions?
>

You haven't lost *all* information about biological variance. There are
still different individuals going into each pool. For a concrete example,
when doing RNA-seq on C. elegans, a single worm doesn't provide sufficient
RNA, so each "sample" is actually a whole tank of worms all receiving the
same treatment, i.e. litterally a pool of individuals. I have analyzed such
an experiment, and the dispersions as estimated by edgeR were on par with
the general guide values one would expect for genetically identical
individuals. As I said above, there are the balancing factors of reducing
variability and reducing degrees of freedom, and I'm not exactly sure how
they balance out.

-Ryan

	[[alternative HTML version deleted]]

