[R] pvclust missing values problem
R.Birnie at leeds.ac.uk
Mon Jul 10 12:56:13 CEST 2006
I posted a question to this list last week and received no response. I am unsure if this means no-one knows the answer or if I posed the question badly. I'm going to assume I posed the question badly and try again. I am new to R so it is quite likely it's a very naive question, however if there is something blindingly obvious that I am missing or if there is another resource I should consult that I haven't seen would someone be kind enough to point it out because it isn't obvious to me. Although my data is from biological experiments I think my problem is with R rather than the nature of the data, but I may be wrong.
I am attempting to use the pvclust package to do some hierarchical clustering on some CGH data I have downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html). The data is in tab delimited format, each column is a single sample each row is a chromosome band some example dummy data is shown below.
band sample1 sample2 sample3 sample4
1p36_33 1 0 0 1
1p36_32 -1 0 -1 0
1p36_31 0 1 1 1
1p36_22 0 -1 -1 -1
etc.... where 0 = no change, 1 = gain, -1 = loss
I have read this file into R using:
> ProgenetixCRC.all.noXY <- read.table("/home/marraydb/Progenetix/Data/CRCall_noXY.txt", header=TRUE, sep="\t", row.names="band")
based on the pvclust documentation I came up with this:
>ProgenetixCRC.all.pvclust <- pvclust(ProgenetixCRC.all, method.dist="cor", method.hclust="average",use.cor="pairwise.complete.obs",nboot=1000)
this results in an error
Error in hclust(distance, method = method.hclust) :
NA/NaN/Inf in foreign function call (arg 11)
Digging through the mailing list archives I've discovered this means that my dataset has missing values. This is very confusing because I have checked and there are no missing values. Running is.na() over the data matrix results in all false values which I take to mean none of the values are NA. I tried various options for the use.cor argument all with the same result.
Since I originally posted this question I tried changing method.dist to euclidean, in this form the function executes without any errors. This is not to say the results actually mean anything of course. I am at a loss as to how to proceed any input from someone more experienced would be gratefully appreciated. If there is some reason why I should not be doing this analysis this way in the first place then I'd appreciate having that pointed out also. I've tried not to put excess information in here but if more is needed then let me know what and I'll post it.
I suspect the problem is me, however if it really is the case that no-one knows how to answer this then could anyone suggest another mailing list where I might get a better response. Would bioconductor be a better option for example?
Apologies for any offence caused by posting the same question but it's difficult for me to proceed until I get some kind of response, even if it is that this list is not the right place for this question.
Thanks for your patience,
Dr Richard Birnie
Section of Pathology and Tumour Biology
Welcome Brenner Building, LIMM
St James University Hospital
Beckett St, Leeds, LS9 7TF
e-mail: r.birnie at leeds.ac.uk
More information about the R-help