[BioC] Analysing categorical CGH data

Mon Jul 17 15:57:00 CEST 2006

Hi all,

This is actually a relatively broad question regarding what would be the most appropriate methods to analyse a particular dataset and by extension which packages I need to perform said methods. If this is not the appropriate place for this question then I apologise, could someone please suggest where I might be more likely to find help. 

The two main questions I am looking for answers two are:
1) Are there methods available for clustering of categorical data of the type I describe below, and if so what  are they? Even just references to other resources would be a good start.
2) Following on from this is the application of correspondence analysis a reasonable approach and have I done it correctly?

What follows is a little long winded but I couldn't express it any more briefly.
I am attempting to analyse a set of CGH data from a series of colorectal cancer samples that I have downloaded from the Progenetix CGH database (found at http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html). This dataset includes samples from colorectal adenomas, carcinomas and metastases. Each case (patient) is described in terms of gain or loss of individual chromosome bands using the 862 band resolution ISCN notation. Such that each band is scored as either 0 = no change, -1 = loss, 1 = gain, 2 = high level amplification. Essentially this results in a dataset with 440 observations (there are 440 cases) of 862 variables where each variable can take 1 of 4 possible values. Additionally each case falls into one of three possible categories adenoma, primary tumour, metastasis.

I wish to perform cluster analysis on this dataset to identify changes that are associated with particular stages in the progression from adenoma-carcinoma-metastasis. What would be the most appropriate method for this and what packages supply said method?

So far I have tried hierarchical clustering using the pvclust R package, k-means clustering from the stats package and correspondance analysis from the made4 package. Are these methods valid for categorical data? I have tried searching the web and the mailing lists for this question without finding a satisfactory answer. It seems to be implied that they are only suitable for continuous data but I could not find an explicit answer.

To try and get around this I attempted correspondence analysis, which I am led to believe from reading around is suitable for categorical data. However this method is outside my current (fairly elementary) knowledge of statistics so I wanted to confirm if I am applying it correctly. I loaded my dataset as a dataframe with cases as columns and chromosome bands as rows. I also loaded a class vector that categorised each case (column of dataframe) as 'adenoma', 'primary' or 'tumour'. I then ran the analysis using

>P.coa <- ord(Progenetix.coa,type="coa",classvec=Progenetix.class)
>plot(P.coa, classvec=Progenetix.class, arraycol=c("red","blue","yellow","green"))

This resulted in everything being clustered close to the origin with the three classes on top of each other. This suggests that there are no informative clusters in the data set however my concern is that this is caused by me applying the wrong methods.

Any advice anyone has the time to give will be much appreciated. I have deliberately not described my so far unsuccessful attempts in much detail to try and keep the length down. I can supply more detail on what I have tried if required. Thanks for your patience.

regards,
Richard

Dr Richard Birnie
Scientific Officer
Section of Pathology and Tumour Biology
Welcome Brenner Building, LIMM
St James University Hospital
Beckett St, Leeds, LS9 7TF
Tel:0113 3438624
e-mail: r.birnie at leeds.ac.uk