[BioC] Fwd: How to decide which distance metric to use for micoarray data clustering?

Peng Yu pengyu.ut at gmail.com
Wed Oct 7 18:31:11 CEST 2009

On Wed, Oct 7, 2009 at 11:13 AM, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
> Hi Peng,
> On Oct 7, 2009, at 11:54 AM, Peng Yu wrote:
>> On Wed, Oct 7, 2009 at 10:04 AM, Sean Davis <seandavi at gmail.com> wrote:
>>> On Wed, Oct 7, 2009 at 10:49 AM, Peng Yu <pengyu.ut at gmail.com> wrote:
>>>> Besides the distance metrics, there are other things that may also be
>>>> important. For example, multiple probesets map to a same gene. I can
>>>> do clustering on probeset values or on averaged probeset values of
>>>> genes. What factors should I consider when I make this kind of
>>>> decisions?
>>> It is generally best not to average probes.  You could choose one to be
>>> representative of each gene, but averaging is not the best way to go.
>> Is there any justification why it is not good to average probes?
> There is a very informative discussion that touches this topic on the BioC
> list from back in April 2009. I have it flagged with the intention of going
> back to it to work out some examples myself, but alas, haven't yet done so.
> Anyway, this is the thread:
> http://thread.gmane.org/gmane.science.biology.informatics.conductor/22758
> While I recommend you read the whole thing, if you go ~9 Messages deep,
> you'll find a post by James MacDonald (April 24th) with the following
> comment:
> """Yes. You are missing the fact that the data from Affy probes usually are
> not normally distributed. In fact, it is not uncommon for a given
> probeset to have widely divergent intensity levels for its component
> probes. Because of the fact that the mean is not robust to outliers,
> people long ago abandoned methods based on a normal distribution."""

Then I can use median instead of mean for all the probesets of a gene,
right? But the choice of probeset level vs. gene level is still
arbitrary to me. Is there a guideline on when probeset level data
should be used and when gene level data should be used?


More information about the Bioconductor mailing list