[R] Correlated Columns in data frame
Nataraj
nataraj at biotech2.sastra.edu
Sat May 17 07:40:51 CEST 2008
Dear all,
Sorry to post my query once again in the list, since I did
not get attention from anyone in my previous mail to this
list.
Now I make it simple here that please give me a code for
find out the columns of a dataframe whose correlation
coefficient is below a pre-determined threshold. (For
detailed query please see my previous message to this list,
pasted hereunder)
Thanks and regards,
B.Nataraj
Following is my previous message to this list to which I do
not get any reply.
Dear all,
For removing correlated columns in a data frame,df.
I found a code written in R in the page
http://cheminfo.informatics.indiana.edu/~rguha/code/R/ of
Mr.Rajarshi Guha.
The code is
#################
r2test <- function(df, cutoff=0.8) {
if (cutoff > 1 || cutoff <= 0) {
stop(" 0 <= cutoff < 1")
}
if (!is.matrix(d) && !is.data.frame(d)) {
stop("Must supply a data.frame or matrix")
}
r2cut = sqrt(cutoff);
cormat <- cor(d);
bad.idx <- which(abs(cormat)>r2cut,arr.ind=T);
bad.idx <- matrix( bad.idx[bad.idx[,1] > bad.idx[,2]],
ncol=2);
drop.idx <- ifelse(runif(nrow(bad.idx)) > .5,
bad.idx[,1], bad.idx [,2]);
if (length(drop.idx) == 0) {
1:ncol(d)
} else {
(1:ncol(d))[-unique(drop.idx)]
}
}
############################################
Now the problem is the code return different output (i.e.
different column number) for a different call. I could not
understood why it happens from that code, but I can
understand the logic in code except the line
********************************************
drop.idx <- ifelse(runif(nrow(bad.idx)) > .5, bad.idx[,1],
bad.idx [,2]);
****************************************
what it means by comparing > 0.5 of nrow(bad.idx).
So I am looking for anyone to help me for different output
generation between the different function call as well as
meaning of the line which I mentioned above.
Thanks!
B.Nataraj
More information about the R-help
mailing list