[BioC] problem with impute.knn in the impute package
He, Yiwen (NIH/CIT)
heyiwen at mail.nih.gov
Fri Apr 29 19:21:43 CEST 2005
Hi,
I have R version 2.0.1 and bioconductor 1.5 on both PC and Unix. I was
trying to use the impute.knn function of the impute package on a dataset of
7332 genes and 3 arrays:
> library(impute)
> dim(dd)
[1] 7332 3
> is.matrix(dd)
[1] TRUE
> dd.imputed <- impute.knn(dd)
When run on PC (windows XP), the R program crashes after a few seconds. When
run on a unix box, I can see such output:
Cluster size 7332 broken into 5667 1665
Cluster size 5667 broken into 4141 1526
Cluster size 4141 broken into 1796 2345
Cluster size 1796 broken into 840 956
Done cluster 840
Done cluster 956
Done cluster 1796
And R session was closed. So the clustering was started but aborted
somewhere in the middle.
I searched the archive and found another report of such problem, for a
dataset of 30000 x 2, but with no answers.
I have some interesting findings playing around with the parameters and data
size:
1).
> impute.knn(dd, k=3) works, but for k bigger than 3, R crashes as
described.
2).
> dd2 <- cbind(dd,dd)
> dim(dd2)
[1] 7332 6
> impute.knn(dd2, k=8) works, but for k bigger than 8, R crashes.
3).
> dd3 <- cbind(dd, dd, dd)
> dim(dd3)
[1] 7332 9
> impute.knn(dd3) works. (k defaults to 10)
> impute.knn(dd3, k=17) R crashes.
I also played around with other parameters but they didn't help.
My conclusion is that the number of neighbors (k) is critical here. However,
it's not straightforward how to set it based on data size.
Can anybody help, or at least point me to the maintainer of the impute
package?
Thanks, Yiwen
Yiwen He
Contractor
Center for Information Technology
National Institute of Health
More information about the Bioconductor
mailing list