[BioC] Analysing categorical CGH data

aedin aedin at jimmy.harvard.edu
Fri Aug 18 23:30:57 CEST 2006


Dear Richard,
I was on holidays when you mailed BioC, and I only spotted your question 
    today.

CA is not the best approach for analysis of your factor table with 3 
categories. It is designed for analysis of count data, originally for 
analysis of contingency data (species x traits counts).

If wish to apply an ordination (ie dimension reduction) method to your 
categorical data. You could apply multiple correspondence analysis which 
is available in the ade4 package in the function dudi.acm

If you have made4 installed, it will have installed ade4 automatically.

Currently I have no wrappers between bioconductor and the function 
dudi.acm in ade4, however I could implement this as an extension to ord 
if you wish.

There are further methods available in ade4. If you can apply weight to 
the categories, there is fuzzy correspondence analysis dudi.fca, or if 
you have a mix of quantitative and factor data, you can apply dudi.mix 
or dudi.hillsmith.  See the ade4 manual for more details on these.

I have not applied any of these approaches to CGH data myself so I can't 
comment on how well they will work. However I am glad to help you if I can.

Regards
Aedin






Message: 3 Date: Mon, 17 Jul 2006 14:57:00 +0100
From: "Richard Birnie"
Subject: [BioC] Analysing categorical CGH data

Hi all, This is actually a relatively broad question regarding what 
would be the most appropriate methods to analyse a particular dataset 
and by extension which packages I need to perform said methods. If this 
is not the appropriate place for this question then I apologise, could 
someone please suggest where I might be more likely to find help. The 
two main questions I am looking for answers two are: 1) Are there 
methods available for clustering of categorical data of the type I 
describe below, and if so what are they? Even just references to other 
resources would be a good start. 2) Following on from this is the 
application of correspondence analysis a reasonable approach and have I 
done it correctly? What follows is a little long winded but I couldn't 
express it any more briefly. I am attempting to analyse a set of CGH 
data from a series of colorectal cancer samples that I have downloaded 
from the Progenetix CGH database (found at 
http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html). 
This dataset includes samples from colorectal adenomas, carcinomas and 
metastases. Each case (patient) is described in terms of gain or loss of 
individual chromosome bands using the 862 band resolution ISCN notation. 
Such that each band is scored as either 0 = no change, -1 = loss, 1 = 
gain, 2 = high level amplification. Essentially this results in a 
dataset with 440 observations (there are 440 cases) of 862 variables 
where each variable can take 1 of 4 possible values. Additionally each 
case falls into one of three possible categories adenoma, primary 
tumour, metastasis. I wish to perform cluster analysis on this dataset 
to identify changes that are associated with particular stages in the 
progression from adenoma-carcinoma-metastasis. What would be the most 
appropriate method for this and what packages supply said method? So far 
I have tried hierarchical clustering using the pvclust R package, 
k-means clustering from the stats package and correspondance analysis 
from the made4 package. Are these methods valid for categorical data? I 
have tried searching the web and the mailing lists for this question 
without finding a satisfactory answer. It seems to be implied that they 
are only suitable for continuous data but I could not find an explicit 
answer. To try and get around this I attempted correspondence analysis, 
which I am led to believe from reading around is suitable for 
categorical data. However this method is outside my current (fairly 
elementary) knowledge of statistics so I wanted to confirm if I am 
applying it correctly. I loaded my dataset as a dataframe with cases as 
columns and chromosome bands as rows. I also loaded a class vector that 
categorised each case (column of dataframe) as 'adenoma', 'primary' or 
'tumour'. I then ran the analysis using

 >>P.coa <- ord(Progenetix.coa,type="coa",classvec=Progenetix.class)
 >>plot(P.coa, classvec=Progenetix.class, 
arraycol=c("red","blue","yellow","green"))


This resulted in everything being clustered close to the origin with the 
three classes on top of each other. This suggests that there are no 
informative clusters in the data set however my concern is that this is 
caused by me applying the wrong methods.

Any advice anyone has the time to give will be much appreciated. I have 
deliberately not described my so far unsuccessful attempts in much 
detail to try and keep the length down. I can supply more detail on what 
I have tried if required. Thanks for your patience.

regards,
Richard

Dr Richard Birnie
Scientific Officer
Section of Pathology and Tumour Biology
Welcome Brenner Building, LIMM
St James University Hospital
Beckett St, Leeds, LS9 7TF
Tel:0113 3438624
e-mail: r.birnie at leeds.ac.uk



More information about the Bioconductor mailing list