[BioC] Hierarchical clustering and shrinking centroids...

Stephen Henderson s.henderson at ucl.ac.uk
Wed May 26 12:25:00 CEST 2004

Yes I'm not sure why you have started with the clustering either (though it
suggests that you are on the right track). You should classify the samples
based on their actual outcome and try PAMR and not whether they are in the
imperfect good or bad cluster. Forgive if me if I've misunderstood you.

There is a useful guide to using classification on array data (using e1071
svm) under the short courses page on Bioconductor, the Heidelberg Course
Sept 2002. I found this helpful getting started in R. The guide to the ipred
package is also excellent.


ps 0.2 error is reasonable I think for a tumour prognosis. No?

-----Original Message-----
From: Tom R. Fahland
To: Tan, MinHan; bioconductor
Sent: 5/26/04 12:52 AM
Subject: RE: [BioC] Hierarchical clustering and shrinking centroids...


I have been doing a lot of classification using PAMR, as well as LDA and
The overused phrase the data is what it is is valid here. I look at
highly correlated samples that mis-classify, and they are usually the
same with differnet classification algorithms. Sometimes I don't get
really good stability with different gene lists also. HC clustering uses
simple correlation metrics, so starting from this can be problematic. I
kow I really didn't answer anything, but thought sharing my experience
might help.

-----Original Message-----
From: Tan, MinHan [mailto:MinHan.Tan at vai.org] 
Sent: Monday, May 24, 2004 18:57
To: bioconductor at stat.math.ethz.ch
Subject: [BioC] Hierarchical clustering and shrinking centroids...

Dear list members,
I have been unable to resolve this conceptual problem. 
I performed hierarchical clustering on a filtered sample (cv=0.04, at
least 2 samples > level of log 9) of 80 tumor samples, and obtained
several groups. Some of these clusters were definitely more stable than
others. Subsequently, based on visual inspection, and my knowledge of
the case outcomes, I arbitrarily classified one large cluster as 'good
prognosis' and other clusters as 'bad prognosis'. 
Using this classification obtained above, I did a supervised analysis
using PAMR to obtain a gene list. However, the misclassification rate
during cross-validation for my good prognosis is fairly low and stable
(<0.05) throughout the shrinking gene list, but the misclassification
rate for my poor prognosis case is relatively higher, and also fairly
stable (approx 0.2). I examined the classification of my cases, and some
'poor prognosis' cases seemed to be persistently recognized as 'good
prognosis' cases. Evidently, there is some problem with the
classification arising from the choice of algorithm. I have tried kth
nearest neighbour, and the same problem occurs. Relooking at the HC
tree, some of these good/bad prognosis genes are clustered together,
suggesting other genes 

I wonder how I may explain this -  I suppose the clustering of these
cases is determined by genes other than those differentiating between
these two major groups. Naturally, validation by an independent set is
ideal, but I guess my question is more on this problem of
I would appreciate any advice, or pointers to any references for this!
Min-Han Tan

This email message, including any attachments, is for the

Bioconductor mailing list
Bioconductor at stat.math.ethz.ch

Bioconductor mailing list
Bioconductor at stat.math.ethz.ch

This email and any files transmitted with it are confidentia...{{dropped}}

More information about the Bioconductor mailing list