[BioC] Fwd: How to decide which distance metric to use for micoarray data clustering?

Wed Oct 7 19:28:02 CEST 2009

Hi,

On Oct 7, 2009, at 12:31 PM, Peng Yu wrote:

> On Wed, Oct 7, 2009 at 11:13 AM, Steve Lianoglou
<snip>
>> There is a very informative discussion that touches this topic on  
>> the BioC
>> list from back in April 2009. I have it flagged with the intention  
>> of going
>> back to it to work out some examples myself, but alas, haven't yet  
>> done so.
>>
>> Anyway, this is the thread:
>>
>> http://thread.gmane.org/gmane.science.biology.informatics.conductor/22758
>>
>> While I recommend you read the whole thing, if you go ~9 Messages  
>> deep,
>> you'll find a post by James MacDonald (April 24th) with the following
>> comment:
>>
>> """Yes. You are missing the fact that the data from Affy probes  
>> usually are
>> not normally distributed. In fact, it is not uncommon for a given
>> probeset to have widely divergent intensity levels for its component
>> probes. Because of the fact that the mean is not robust to outliers,
>> people long ago abandoned methods based on a normal distribution."""
>
> Then I can use median instead of mean for all the probesets of a gene,
> right?

I'm not sure that you'll get a direct answer to this question. It  
depends on what you're trying to do, right?

If you can appreciate what Sean mentioned earlier, and some of the  
things that came up in that thread I linked to, then you would be in a  
better position to (i) make a judgement call yourself, and (ii)  
justify it if someone wonders why you did what you did.

> But the choice of probeset level vs. gene level is still
> arbitrary to me.

Do you understand the difference between the two? Some figures (and  
perhaps even the text) in here help:

http://www.biomedcentral.com/1471-2105/7/276

Just fished out a sentence from the discussion section that you might  
find disheartening, given your hunt to find meaning in clustering:

"""For this reason, particular care must be taken when analysing  
expression data using correlation-based approaches"""

> Is there a guideline on when probeset level data
> should be used and when gene level data should be used?

There's a whole mess load of papers dealing with:

  1. microarrays
  2. their design
  3. the problems with their design
  4. how to normalize them considering (2) and (3)
  5. the flaws in papers dealing with (4)
  6. why a different type of microarray is needed (double vs. single  
channel)
  7. go to 2

  ... etc ....

Now imagine for a moment that there was such a guideline that you're  
asking for, what kind of info would be in it? Perhaps equally  
important given the pseudo-list I made above: what info would you  
exclude?

I think you're looking for easy answers to difficult problems (eg. "I  
can just use the median, right?"). As I said before, I don't think  
you'll get any, sorry[1]. As mentioned above, I'm guessing the best  
you can do is to try to appreciate issues dealing with microarray data  
and make an informed decision.

HTH,
-steve

[1] Although it would be great if some seasoned practitioner will  
chime in on the contrary, at which point I'd gladly eat my hat.

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
   |  Memorial Sloan-Kettering Cancer Center
   |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact