[BioC] How to decide which distance metric to use for micoarray data clustering?

Wed Oct 7 18:06:16 CEST 2009

On Wed, Oct 7, 2009 at 11:53 AM, Peng Yu <pengyu.ut at gmail.com> wrote:
> On Wed, Oct 7, 2009 at 10:04 AM, Sean Davis <seandavi at gmail.com> wrote:
>>
>>
>> On Wed, Oct 7, 2009 at 10:49 AM, Peng Yu <pengyu.ut at gmail.com> wrote:
>>>
>>> Besides the distance metrics, there are other things that may also be
>>> important. For example, multiple probesets map to a same gene. I can
>>> do clustering on probeset values or on averaged probeset values of
>>> genes. What factors should I consider when I make this kind of
>>> decisions?
>>>
>>
>> It is generally best not to average probes.  You could choose one to be
>> representative of each gene, but averaging is not the best way to go.
>
> Is there any justification why it is not good to average probes?

It is pretty simple, actually.  Different probes for the same gene do
not measure the same thing.  In statistical terms, they are not drawn
from the same distribution.

>>> bioDist says something about two popular metrics, but the description
>>> is distilled. I am wondering whether there are some more detailed
>>> comparisons between metrics.
>>
>> Often, the metrics produce highly compatible pictures of the data.  The
>> actual metric you will use may be directed somewhat by the goals of the
>> analysis but, at least for hierarchical clustering, I think it is difficult
>> to argue for one "best" or "recommended" metric.
>>
>> In practice, you may want to try a few to see how they behave on your data.
>
> If the results by different metrics are different, how to do decide
> which one I should use?

If you have a gold standard or another source of information about how
samples/genes should be measured, you can justify your choice based on
subjects that are supposed to be most similar are.  Lacking such
information, there are other techniques such as looking at the cluster
stability under resampling that might be useful to think about.
Others might have more concrete suggestions about how to go about
measuring clustering effectiveness; it is a research topic of its own.

Sean

>>> On Wed, Oct 7, 2009 at 12:35 AM, Tim Triche <tim.triche at gmail.com> wrote:
>>> > look at the bioDist package for some suggestions.
>>> >
>>> > the metric to use depends on your task.
>>> >
>>> >
>>> > On Tue, Oct 6, 2009 at 8:52 PM, Peng Yu <pengyu.ut at gmail.com> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I am looking for the most appropriate distance metrics for the
>>> >> clustering of a set of microarray data. And I read Chapter 12 of
>>> >> Bioinformatics and Computational Biology Solutions Using R and
>>> >> Bioconductor, But I'm still not clear what the general guide line is
>>> >> to choose an appropriate distance metrics out of many ones list in
>>> >> that chapter. Could somebody let me know how to choose an appropriate
>>> >> distance metrics?
>>> >>
>>> >> Regards,
>>> >> Peng
>>> >>
>>> >> _______________________________________________
>>> >> Bioconductor mailing list
>>> >> Bioconductor at stat.math.ethz.ch
>>> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> >> Search the archives:
>>> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> >
>>> >
>>> >
>>> > --
>>> > Statisticians, like artists, have a bad habit of falling in love with
>>> > their
>>> > models.
>>> > --George Box
>>> >
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>