[BioC] Hierarchical clustering and shrinking centroids...

Tue May 25 03:57:03 CEST 2004

Dear list members,

I have been unable to resolve this conceptual problem. 

I performed hierarchical clustering on a filtered sample (cv=0.04, at
least 2 samples > level of log 9) of 80 tumor samples, and obtained
several groups. Some of these clusters were definitely more stable than
others. Subsequently, based on visual inspection, and my knowledge of
the case outcomes, I arbitrarily classified one large cluster as 'good
prognosis' and other clusters as 'bad prognosis'. 

Using this classification obtained above, I did a supervised analysis
using PAMR to obtain a gene list. However, the misclassification rate
during cross-validation for my good prognosis is fairly low and stable
(<0.05) throughout the shrinking gene list, but the misclassification
rate for my poor prognosis case is relatively higher, and also fairly
stable (approx 0.2). I examined the classification of my cases, and some
'poor prognosis' cases seemed to be persistently recognized as 'good
prognosis' cases. Evidently, there is some problem with the
classification arising from the choice of algorithm. I have tried kth
nearest neighbour, and the same problem occurs. Relooking at the HC
tree, some of these good/bad prognosis genes are clustered together,
suggesting other genes 

I wonder how I may explain this -  I suppose the clustering of these
cases is determined by genes other than those differentiating between
these two major groups. Naturally, validation by an independent set is
ideal, but I guess my question is more on this problem of
cross-validation. 

I would appreciate any advice, or pointers to any references for this!

Thanks.

Min-Han Tan

This email message, including any attachments, is for the so...{{dropped}}