(a bit late, but better late than never)

Dr. Coombes is right, of course, and I ought to have mentioned this
earlier.  It is a thorny issue, especially with epidemiological studies.
 Cancer, less so.

For differences that are real, usually they are significant on the beta
value scale *and* the logistic scale.  Changes that are only significant on
the logit scale are often artifacts.  The logistic scale is much more
sensitive to technical artifacts, even if you damp the extremes by using a
huge offset.  There are intermediate transformations (probit,
arcsin(sqrt(x)), etc.) which are intermediate between proportional and
logistic, but they haven't found much traction.

There are a couple of things that do help and can be used to filter an
unbiased fashion.  I'm hoping to get them submitted in the very near
future.  I do not know if they will be as useful for CHARM data as they
seem to be for Illumina arrays, eRRBS, and WGSBS sequence data, but I guess
we shall find out.

regarding TCGA data:

TCGA methylation data is background corrected and dye bias equalized (for
the 450k samples, at least, and as batches are updated, 27k as well) but no
batch correction is done for the level 3 data.  In the case of multi-batch
tumors it is a good idea to run ComBat or (if you must) SVA to compensate.
 It's run through normalizeMethyLumiSet(methylumi.bgcorr(the.data)) for
level 2 and 3 data, and the IDAT files are provided as level 1, so anyone
who wants to reproduce things from scratch with a different preprocessing
strategy is welcome to do so.  Methylumi and minfi happen to have the same
IDAT parsing code these days, and I would not be surprised if they
eventually merged.  I wouldn't bother using anything else, especially for
large volumes of samples.

switching from 0.1% methylated to 99.9% methylated is probably a real
effect.  Switching from 1% to 3% across the board is probably technical
artifacts.   You will see this all the time on 450k data that hasn't been
(ahem) properly background corrected, and to a significant degree in 27k
data as well.  It's not limited to TCGA; I've seen plenty of data from
other centers that benefited significantly from being re-processed
sensibly.  In any event, I pushed for, and got, changes to policy so that
Illumina methylation data for TCGA is provided as raw IDAT files, and all
of the code used for processing is available either from the Bioconductor
package repository or from GitHub (in the case of the packaging pipeline
itself).  It's not perfect but at least it is 100% transparent.

one last thing:

The Illumina annotations have been updated to a FeatureDb (one for hg19,
and another available soon for hg18) which I would like to propose as the
standard for these arrays, as they cover both 27k and 450k features (along
with some information that very few people seem to know about each).  I
think it's as fast as what minfi uses and as comprehensive as the .db0
packages, while less confusing than either.  So anyone who wants to, please
try them.


On Fri, Aug 17, 2012 at 1:05 PM, Kevin R. Coombes <kevin.r.coombes@gmail.com
> wrote:

>  I understand all the statistical reasons for converting from methylation
> "beta values" to something logistic, and am frequently tempted to do this
> myself.
>
> But I think in the context of methylation that this advice should come
> with a warning: changes in levels near 0 and 1 may have a lot of leverage
> on the final results.  For example, we have done analyses on some of the
> TCGA data where we find "statistically significant differences in
> methylation between normal and tumor" where the mean beta values are 0.03
> and 0.08.  I find it hard to believe that this level of change in
> methylation has any kind of biological meaning.  In fact, I'm not even
> convinced that we can accurately measure this amount of change using the
> technology that TCGA is using (although I might well believe that such a
> change could result from batch effects, whether in the assay or in the data
> processing).
>
> I don't have any magic solution to fix this issue; it is intrinsic in the
> shape of the logistic curve. One might want to explore shrinking the beta
> values toward 0.5 (i.e., away from 0 and 1), but I can't offer any concrete
> advice on how well this might work in practice.
>
> Best,
>     Kevin
>
>
> On 8/17/2012 12:36 PM, Tim Triche, Jr. wrote:
>
> The reason to switch from a proportion (%, beta-value, whichever; anything
> measuring M / (M+U) where M and U are surrogates for methylated and
> unmethylated cytosines) to a fold-change (logit(proportion.methylated) or
> log2(M/U)) is that the latter is far more amenable to linear models, and
> roughly parallels the expected behavior in terms of expression changes on a
> log2 or log-fold-change scale.
>
> Furthermore, the range for logit(M/U) is -Infinity to +Infinity, which is
> appropriate when you are modeling something as having Gaussian error.
>  Something with a range of 0 to 1 is neither homoskedastic (which is to
> say, such a 0-1 measurement will have a variance that depends on the mean)
> nor unbounded (this turns out to be an issue when computing maximum
> likelihood estimates, for example, as values close to the boundary will
> cause problems).
>
> In any event, logit(% methylation) is equivalent to log(M/U) which is where
> I veered off course this morning.  My brain seems to have been a bit slow.
>
>
> On Fri, Aug 17, 2012 at 9:26 AM, zeynep özkeserli <zeynep.ozkeserli@gmail.com> wrote:
>
>
>  Dear Tim,
>
> Thank you for your answer. But to my understanding, if I could get this
> answer by undoing the logit function (I tought you were doing this), we
> should use inverse logit function. Which is exp(x)/(1+exp(x))
>
> And in my case it gives:
>
>
>  exp(-0.30427)/(1+exp(-0.30427))
>
>  [1] 0.424514
>
> Ok, this seems reasonable. And it makes sense how you put this into words.
> But if we could use this one as a methylation measure, why would the
> creators make things more complicated and convert the value to a logit
> value? So, again, to my understanding, I shall learn how to interpret the
> diff thing.
>
> Thank you again,
>
> Best :)
>
> Zeynep
>
> On Fri, Aug 17, 2012 at 6:29 PM, Tim Triche, Jr. <tim.triche@gmail.com> <tim.triche@gmail.com>wrote:
>
>
>  Perhaps "on average this region has an
>
> R> 1 - exp(-0.347)
> [1] 0.2931947
>
> approximately 29.3% relative decrease in cytosine methylation after
> treatment?"
>
>
>
> On Fri, Aug 17, 2012 at 1:56 AM, zeynep özkeserli <zeynep.ozkeserli@gmail.com> wrote:
>
>
>  Dear All, Dear Dr. Aryee and Dr. Carvalho,
>
> I have a question on interpreting the results of dmrFinder function.
>
> We have performed a CHARM analysis on the data we got from NimbleGen
> Promoter Medip Arrays. The data is obtained from each patient before and
> after treatment. And after performing CHARM analysis, we got some
> differentially methylated regions (DMRs).
>
> As the samples are before and after treatment results of the same
> patient,
> the samples are treated as paired samples.
>
> My question is about interpretation of the results:
>
> After running this:
>
> dmr1_2 <- dmrFinder(rawData, p = p, groups = grp,compare = c("to", "ts"),
> cutoff=0.995,paired=TRUE,pairs=pairs)
>
> to: before treatment
> ts: after treatment
>
> - For example I have found a DMR like this (I summerized the result for
> my
> question):
>
> chr 8, diff= -0.30427 and maxdiff=0.47935
>
> As the diff value is calculated like this:   average l (logit(percentage)
> methylation if l=NULL) difference within the DMR if paired=TRUE
>
> Is it true to say that: "The region has 0.30427 times the risk of being
> methylated in samples of after treatment compared to samples of before
> treatment."
>
> I know that it does not look meaningful to use the word "risk" when
> talking
> about something like that but I can not find a better way to say it
> truely. Is it possible to express it like a "0.30427 fold difference in
> methylation"? And also am I interpreting the "-" sign truely?
>
> Thank you for your help in advance,
>
> Best Regards,
>
> Zeynep
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing listBioconductor@r-project.orghttps://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>  --
> *A model is a lie that helps you see the truth.*
> *
> *
> Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> <http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>
>
>
> _______________________________________________
> Bioconductor mailing listBioconductor@r-project.orghttps://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>


-- 
*A model is a lie that helps you see the truth.*
*
*
Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>

	[[alternative HTML version deleted]]