[BioC] Analysing categorical CGH data
aedin
aedin at jimmy.harvard.edu
Fri Aug 18 23:30:57 CEST 2006
Dear Richard,
I was on holidays when you mailed BioC, and I only spotted your question
today.
CA is not the best approach for analysis of your factor table with 3
categories. It is designed for analysis of count data, originally for
analysis of contingency data (species x traits counts).
If wish to apply an ordination (ie dimension reduction) method to your
categorical data. You could apply multiple correspondence analysis which
is available in the ade4 package in the function dudi.acm
If you have made4 installed, it will have installed ade4 automatically.
Currently I have no wrappers between bioconductor and the function
dudi.acm in ade4, however I could implement this as an extension to ord
if you wish.
There are further methods available in ade4. If you can apply weight to
the categories, there is fuzzy correspondence analysis dudi.fca, or if
you have a mix of quantitative and factor data, you can apply dudi.mix
or dudi.hillsmith. See the ade4 manual for more details on these.
I have not applied any of these approaches to CGH data myself so I can't
comment on how well they will work. However I am glad to help you if I can.
Regards
Aedin
Message: 3 Date: Mon, 17 Jul 2006 14:57:00 +0100
From: "Richard Birnie"
Subject: [BioC] Analysing categorical CGH data
Hi all, This is actually a relatively broad question regarding what
would be the most appropriate methods to analyse a particular dataset
and by extension which packages I need to perform said methods. If this
is not the appropriate place for this question then I apologise, could
someone please suggest where I might be more likely to find help. The
two main questions I am looking for answers two are: 1) Are there
methods available for clustering of categorical data of the type I
describe below, and if so what are they? Even just references to other
resources would be a good start. 2) Following on from this is the
application of correspondence analysis a reasonable approach and have I
done it correctly? What follows is a little long winded but I couldn't
express it any more briefly. I am attempting to analyse a set of CGH
data from a series of colorectal cancer samples that I have downloaded
from the Progenetix CGH database (found at
http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html).
This dataset includes samples from colorectal adenomas, carcinomas and
metastases. Each case (patient) is described in terms of gain or loss of
individual chromosome bands using the 862 band resolution ISCN notation.
Such that each band is scored as either 0 = no change, -1 = loss, 1 =
gain, 2 = high level amplification. Essentially this results in a
dataset with 440 observations (there are 440 cases) of 862 variables
where each variable can take 1 of 4 possible values. Additionally each
case falls into one of three possible categories adenoma, primary
tumour, metastasis. I wish to perform cluster analysis on this dataset
to identify changes that are associated with particular stages in the
progression from adenoma-carcinoma-metastasis. What would be the most
appropriate method for this and what packages supply said method? So far
I have tried hierarchical clustering using the pvclust R package,
k-means clustering from the stats package and correspondance analysis
from the made4 package. Are these methods valid for categorical data? I
have tried searching the web and the mailing lists for this question
without finding a satisfactory answer. It seems to be implied that they
are only suitable for continuous data but I could not find an explicit
answer. To try and get around this I attempted correspondence analysis,
which I am led to believe from reading around is suitable for
categorical data. However this method is outside my current (fairly
elementary) knowledge of statistics so I wanted to confirm if I am
applying it correctly. I loaded my dataset as a dataframe with cases as
columns and chromosome bands as rows. I also loaded a class vector that
categorised each case (column of dataframe) as 'adenoma', 'primary' or
'tumour'. I then ran the analysis using
>>P.coa <- ord(Progenetix.coa,type="coa",classvec=Progenetix.class)
>>plot(P.coa, classvec=Progenetix.class,
arraycol=c("red","blue","yellow","green"))
This resulted in everything being clustered close to the origin with the
three classes on top of each other. This suggests that there are no
informative clusters in the data set however my concern is that this is
caused by me applying the wrong methods.
Any advice anyone has the time to give will be much appreciated. I have
deliberately not described my so far unsuccessful attempts in much
detail to try and keep the length down. I can supply more detail on what
I have tried if required. Thanks for your patience.
regards,
Richard
Dr Richard Birnie
Scientific Officer
Section of Pathology and Tumour Biology
Welcome Brenner Building, LIMM
St James University Hospital
Beckett St, Leeds, LS9 7TF
Tel:0113 3438624
e-mail: r.birnie at leeds.ac.uk
More information about the Bioconductor
mailing list