[BioC] A question regarding the mean of M-values.

Fri Apr 29 11:51:00 CEST 2005

Hi Johan,

maybe you are thinking of taking the arithmetic mean of the *channel 
signals* and not the *fold changes*?

Taking red-green spotted arrays with duplicates as an example. Then you 
observe duplicated channel pairs (R1,G1) and (R2,G2) for a given gene. 
Taking the log-ratios M1=log2(R1/G1) and M2=log2(R2/G2) and then the 
arithmetic mean of the log-ratios

  M' = (M1+M2)/2
     = log2[sqrt(R1/G1)] + log2[sqrt(R2/G2)]
     = log2[sqrt(R1/G1)*sqrt(R2/G2)]

but also

  M' = log2[sqrt(R1*R2)/sqrt(G1*G2)]

that is, the log-ratio of the geometric mean of the channel signals.
Now, your "wonder" makes sense if you consider the arithmetic mean of 
the channel signals;

  M'' = log2[((R1+R2)/2) / ((G1+G2)/2)]
      = log2[(R1+R2)/(G1+G2)]

Maybe you were heading towards this? Compare this with your suggestion, 
which I think Gordon made clear is not a good idea;

  M''' = log2[(R1/G1)+(R2/G2)]

So, what about M' and M''? Common practice and recommended is no doubt 
to use M', but M'' have been suggested to. Indeed, in some sense M'' is 
used when you calculate the log-ratio for a single spot/probe; the pixel 
intensities are averaged first, i.e. R1+R2+...+Rn and G1+G2+...+Gn; 
n=#pixels, where R1,...,Rn are pixel intensities. Similarly, when 
calculating probe-set summaries in Affymetrix, the probe signals are 
averaged by the arithmetic mean (or a weighted version of it). In line 
with this, it has been suggested that one should take the *median of 
pixel-intensity ratios* instead of the *ratio of median pixel 
intensities*, cf. M' and M'' [I don't think this is a good idea because 
of image alignment problems].

I do not know a detailed study comparing M' and M'', and again, it 
depends on what level you are working on and what error models you 
believe in. Thinking of replicates on the spotted arrays, the probe 
concentrations may vary between spots, but not between channels. Say you 
have, by chance (imperfectness) concentration w1=[0,1] for duplicate one 
and w2 for duplicate two, then

  M' = (log2[(w1*R1)/(w1*G1)] + log2[(w2*R2)/(w2*G2)])/2
     = (log2[R1/G1] + log2[R2/G2])/2

which illustrated one (the most common?) argument to use M', whereas

  M'' = log2[(w1*R1+w2*R2)/(w1*G1+w2*G2)]

give different weights to different spots. On the other hand, one may 
argue that if w1 < w2, then the signal-to-noise ratio for duplicate 1 is 
lower that for duplicate 2, and therefore we should indeed down weigh 
that spot when taking the average; this is done in M'', but not M'. For 
true dilution series, {wi;i=1,...,I} are not random errors, but spread 
out on purpose [which I strongly recommend]. Then, would you argue for 
M' or M''?

So, under certain circumstances, I do not think it is clear whether M' 
or M'' should be used, but use M' if you do not know.

Best wishes

Henrik Bengtsson

Gordon Smyth wrote:
> I know that "fold change" is an intuitive measure which non-mathematical 
> users like to relate things back to. Unfortunately, taking arithmetic 
> means of "fold changes" does not give sensible results. Here is a simple 
> example to show why:
> 
> Suppose you are comparing a cell line in stimulated and unstimulated 
> conditions, and you have two biological replicates. Suppose the first 
> replicate gives you 10-fold up regulation in the simulated condition, 
> and the second replicate is 10-fold down regulated. The only sensible 
> conclusion here is that there is no systematic difference between the 
> stimulated and unstimulated conditions, but that there is a lot of 
> variability between the replicates. This is exactly what the log-ratio 
> analysis would tell you.
> 
> On the other hand, if you average the fold changes, you get nonsense 
> results. The two fold changes are:
> 
> 10 and 1/10
> 
> so the "average fold change is a bit over 5. So you conclude that "on 
> average" the stimulation produces 5-fold up regulation. This is nonsense.
> 
> Worse still, if you compute the fold changes the other way around, you 
> make the opposite conclusion. A perfectly equivalent way to state the 
> results would be to say that the first replicate is 10 fold down in the 
> unstimulated condition and the second is 10 fold up. So the two fold 
> change are:
> 
> 1/10 and 10
> 
> so the "average" fold change is again a bit over 5. But now you conclude 
> that the *unstimulated condition* gives a 5-fold change over the 
> unstimulated condition. The is the opposite of what you concluded when 
> you expressed the fold changes the other way around.
> 
> It is necessary to express the fold changes on a log-ratio scale, so 
> that multiplicative changes become additive, before it makes any sense 
> to take arithmetic averages. There are a lot of good statisticians in 
> Stockholm -- why not have to talk to one of them about this?
> 
> Gordon
> 
>> Date: Thu, 28 Apr 2005 09:13:16 +0200
>> From: "Johan Lindberg" <johanl at biotech.kth.se>
>> Subject: RE: [BioC] A question regarding the mean of M-values.
>> To: <bioconductor at stat.math.ethz.ch>
>>
>>
>> Hi all.
>> I have encountered the same problem. In LIMMA it is possible to handle
>> two levels of replicates. You can use duplicateCorrelation for one level
>> (technical replicates or duplicate spots) and use the rest as biological
>> replicates to fit your model. But say that I have another level of
>> replicates. I have replicate spots, technical replicates and biological
>> replicates. I guess the right thing to do is to average over the
>> replicate spots and use duplicate correlation for the technical
>> replicates.
>> Here I started wondering since limma, when calculating a contrast
>> between two samples uses the arithmetic mean on the M-values which is
>> the same as taking the geometric mean on the fold-changes and then
>> taking the logarithm of that value, or ?!?
>>
>> Recall laws of logarithms:
>> log(xy) = log(x) + log(y)
>> log(x^n) = n*log(x)
>>
>> This means that if I take
>>
>> (log(M1)+log(M2)+log(M3))/3 this is the same as taking
>> log((M1*M2*M3)^(1/3)) which is the same as taking the geometric mean on
>> the fold changes and then taking the logarithm of that value.
>>
>> I wonder, can one motivate using geometric mean on expression data
>> instead of arithmetic? See
>> http://www.math.toronto.edu/mathnet/questionCorner/geomean.html
>> for a nice tip of when to use what mean...
>>
>> For me is seem like one should, if you want to take a mean of M-values
>> in an expression experiment, remove the logarithm, calculate the average
>> fold change and them use the logarithm of desire on that value.
>>
>> Comments appreciated to a guy with limited math-skills being out on deep
>> water....
>>
>> // Johan L
>>
>>
>>
>> -----Original Message-----
>> From: bioconductor-bounces at stat.math.ethz.ch
>> [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of marcus
>> Sent: Wednesday, April 27, 2005 5:02 PM
>> To: bioconductor at stat.math.ethz.ch
>> Subject: [BioC] A question regarding the mean of M-values.
>>
>>
>> Hello all users.
>>
>> I have a question regarding the mean calculations of the M-values in
>> LIMMA.
>>
>> I guess that the fit$coeff is the mean of the M-values used for the
>> linear
>> model. The fit$coeff has the mean value of the data derived from a
>> specific
>> RNA source (as defined in the design matrix), and the value in
>> fit$coeff[1]
>> is the same as mean(MS[1,1:2]) (if I for example had Sample 1 on 2
>> arrays in
>> my matrix containing the data.
>>
>> So...if you take the mean of two values (in the log 2 scale), for
>> example
>> M = 8 and M = 1, the mean (and hence the fit$coef ?) will be 4,5.
>>
>> If you want to look at the foldchange I guess that 2^fit$coeff is
>> correctly
>> calculated, so for the example it will be 2^4,5 = 22,6 times
>> upregulated.
>>
>> But if you look at the values independently, M=8 will give 2^8 = 256
>> times,
>> and 2^1 = 2 times upregulation. The mean of these values are (256 + 2) /
>> 2 =
>> 129 times.
>>
>> I know that the question is a bit naive, but how should one do when you
>> take
>> the mean of logarithms since the numbers are not related to each other
>> as
>> normal numbers are. E.g. the number 8 is not twice the size of 4 on a
>> logarithmic scale, it is 10000 times more (on a log10 scale).
>>
>> So....how should one do, when I want to take the average of log values?
>> Shouldn't I calculate the ratios back (not in log2 scale) and calculate
>> the
>> mean, and transform the data back, If I would like to have an average M
>> value?
>>
>> Regards
>>
>> Marcus
> 
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> 
>