Bill.Venables at csiro.au
Bill.Venables at csiro.au
Wed Feb 20 00:58:46 CET 2008
Distance matrices are not usually and end in themselves but a means to
some other end. Rather than ask what is the best way to calculate such
a huge distance matrix, maybe the question you should ask yourself is
what are you going to do with it if ever you did manage to calculate it.
Maybe you can bypass the distance matrix calculation and get to the end
point by some other means. For example, if the eventual goal is
clustering, then perhaps something like clara() in the 'cluster' package
will do the job more effectively. It is designed to handle situations
of this kind.
***********reading in data**********
data<-read.table("microarray.txt",header=T, sep="\t")
head(data)
dim(data)
attach(data)
***********creating matrix and calculating variance across
probesets********
x<-1:20000
y<-2:141
data.matrix<-data.matrix(data[,y])
variableprobe<-apply(data.matrix[x,],1,var)
hist(variableprobe)
**************filter out low variance*************
data.sub = data.matrix[order(variableprobe,decreasing=TRUE),][1:10000,]
dim(data.sub)
[1] 10000 140
What is the best way to calculate the distances between the samples
using
the euclidean or manhattan distance metrics?
any suggestions?
