[BioC] EdgeR: replicated pools, yes or not?

Fri Apr 25 17:18:25 CEST 2014

Hi Ryan,

I would like to pop in just to emphasize something about the current 
economics of sequencing that clearly depends on the lab or sequencing 
facility you're using.

In our institute, and it sounds to me like Manuel is in a similar 
situation, the most expensive part of doing a proper RNA-seq experiment 
is the cost of each (barcoded) library. When you reply "if you have the 
capability to do b [8 pools], then you also probably have the capability 
to do 8 * n unpooled samples" you are clearly considering that the "per 
lane" cost of sequencing will be the same, but are missing the reality 
that many labs pay quite heavily for each library prep. For me, and 
surely for others, it is quite realistic to only have enough money for a 
limited number of library preps (say 8 or 12), even though we might have 
many more individuals (animals, plants, cell cultures, what-not) at 
almost no extra cost. In these cases, Manuel's question becomes quite 
relevant: should we pool many individuals into the fixed number of 
samples to be made into libraries, or should we try to make the 
libraries reflect as best as possible unique "individuals"? Of course 
when the individual provides too little RNA the question is moot, but 
what about cases like Manuel's where a single animal or tissue is enough 
for a library?

Best,

Cei

On 4/24/14 3:24 PM, Ryan Thompson wrote:
>> However, assuming that my budget allows me to sequence only a limited
>> number of samples at a decent coverage (for example, 8 samples at 10
>> million reads per sample), which of the following would be the preferred
>> solution?
>>
>> a) using 8 samples obtained from 8 different animals (4 KO and 4 WT);
>> b) using 8 samples (4 KO and 4 WT) obtained by pooling tissue from "n"
>> animals (with the same genotype, obviously).
>>
>
> The preferred solution would be to take your 8 * n animals and sequence
> them all individually using the same total amount of sequencing as you
> would have used for the 8 pools. Each individual sample will have n times
> less coverage, but that doesn't matter because you have still done the same
> total amount of sequencing per condition. I read a paper showing that
> increasing the number of biological replicates for an RNA-seq experiment
> while holding constant the total amount of sequencing (and therefore
> reducing the sequencing per replicate) continued to give gains in
> statistical power up to at least 192 biological replicates (which was the
> largest number they tested). This was in simulations, of course.
> Unfortunately, I can't find the citation in my ever-growing library of
> articles, but maybe someone else can supply it.
>
> So, I'm not sure whether option a or b is better, but if you have the
> capability to to b, then you also probably have the capability to do 8 * n
> unpooled samples, which is unquestionably better than either a or b.
>
>
>> I am pretty sure that if the unique difference between the two types of
>> animal (or condition) is a specific mutation, solution (a) would be THE
>> correct solution because it would imply using truly biological and
>> independent replicates. Solution (b) would be not just less correct, but
>> blatantly incorrect, because it would eliminate biological variation
>> between replicates (specially if "n" is high), and having an estimation of
>> that variation is necessary to establish the significance of the
>> differences observed between conditions.
>>
>
> This is not necessarily a problem, although it might be. With the pooled
> samples, your estimate of biological variability will be smaller, but you
> also fewer degrees of freedom than you would if you did all the samples
> separately instead of pooling. I don't know which of these effects would
> dominate. So your significance estimates may not be any less accurate or
> unbiased, but they will probably be less precise since you are working with
> fewer observations.
>
>>
>> I acknowledge that I am answering myself, but I keep finding examples in
>> which pooling (in the sense that I am describing above) is not completely
>> discouraged. For example, Churchill (in "Fundamentals of experimental
>> design for cDNA microarrays", 2002, Nature Genetics 32) explains that "in a
>> two-sample comparison, we could consider making two large pools of all
>> available units and measuring each pool multiple times. This is a poor
>> design, as it does not allow estimation of the between-pool variance. By
>> pooling all the available samples together we have minimized the biological
>> variance, but we have also eliminated all independent replication. It is
>> better to use several pools and fewer technical replicates". Why does he
>> write that it is better to use several pools? Wouldn't it be better to use
>> no pools at all?
>>
>
> The considerations are different for microarrays. In sequencing, you can
> divide up your available sequencing space into as many individual
> replicates as you like. In microarrays, if you only have money to do 10
> arrays, then you can only do 10 samples, so are forced to choose between 10
> individuals or 10 pools.
>
>
>> Similarly, a discussion in which pooling is not completely discouraged can
>> be found in:
>>
>> http://seqanswers.com/forums/showthread.php?t=27905
>>
>
> The only place I see pooling not discouraged in that thread is the part
> talking about 5 pools of 10 individuals each for 3 conditions vs 5
> individuals each for 3 conditions. In that case Simon says that pooling is
> acceptable because the money or labor costs of individually prepping 150
> samples may be prohibitive. He still notes that this is the preferred
> solution if possible, and he notes that there is a trade-off that must be
> considered for the few samples vs few pools question. This echoes my answer
> above in this reply.
>
> Finally, pooling samples is often justified because of limited availability
>> of RNA. In those cases pooling is mandatory, obviously. But if replicates
>> have been constructed by pooling RNA from many tiny individual samples,
>> shouldn't we have in mind that we have lost all information regarding
>> biological variance, and that we will not be able to asses the significance
>> of any differences observed between conditions?
>>
>
> You haven't lost *all* information about biological variance. There are
> still different individuals going into each pool. For a concrete example,
> when doing RNA-seq on C. elegans, a single worm doesn't provide sufficient
> RNA, so each "sample" is actually a whole tank of worms all receiving the
> same treatment, i.e. litterally a pool of individuals. I have analyzed such
> an experiment, and the dispersions as estimated by edgeR were on par with
> the general guide values one would expect for genetically identical
> individuals. As I said above, there are the balancing factors of reducing
> variability and reducing degrees of freedom, and I'm not exactly sure how
> they balance out.
>
> -Ryan
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Dr. Cei Abreu-Goodger
Profesor Investigador
Langebio CINVESTAV
Tel: (52) 462 166 3006
cei at langebio.cinvestav.mx

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.