[R] Remove highly correlated variables from a data frame or matrix

Sat Nov 16 17:10:07 CET 2019

Hi Peter,

Thank you so much!!! I will use complete linkage clustering because
Mendelian Randomization function
(https://cran.r-project.org/web/packages/MendelianRandomization/vignettes/Vignette_MR.pdf)
I plan to use allows for correlations but not as high as 0.9 or more.
I got 40 SNPs out of 246 so improvement!

Regards,
Ana

On Fri, Nov 15, 2019 at 8:01 PM Peter Langfelder
<peter.langfelder using gmail.com> wrote:
>
> Try hclust(as.dist(1-calc.rho), method = "average").
>
> Peter
>
> On Fri, Nov 15, 2019 at 10:02 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
> >
> > HI Peter,
> >
> > Thank you for getting back to me and shedding light on this. I see
> > your point, doing Jim's method:
> >
> > > keeprows<-apply(calc.rho,1,function(x) return(sum(x>0.8)<3))
> > > ro246.lt.8<-calc.rho[keeprows,keeprows]
> > > ro246.lt.8[ro246.lt.8 == 1] <- NA
> > > (mmax <- max(abs(ro246.lt.8), na.rm=TRUE))
> > [1] 0.566
> >
> > Which is good in general, correlations in my matrix  should not be
> > exceeding 0.8. I need to run Mendelian Rendomization on it later on so
> > I can not be having there highly correlated SNPs. But with Jim's
> > method I am only left with 17 SNPs (out of 246) and that means that
> > both pairs of highly correlated SNPs are removed and it would be good
> > to keep one of those highly correlated ones.
> >
> > I tried to do your code:
> > > tree = hclust(1-calc.rho, method = "average")
> > Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor
> > exceed 65536") :
> >   missing value where TRUE/FALSE needed
> >
> > Please advise.
> >
> > Thanks
> > Ana
> >
> > On Thu, Nov 14, 2019 at 7:37 PM Peter Langfelder
> > <peter.langfelder using gmail.com> wrote:
> > >
> > > I suspect that you want to identify which variables are highly
> > > correlated, and then keep only "representative" variables, i.e.,
> > > remove redundant ones. This is a bit of a risky procedure but I have
> > > done such things before as well sometimes to simplify large sets of
> > > highly related variables. If your threshold of 0.8 is approximate, you
> > > could simply use average linkage hierarchical clustering with
> > > dissimilarity = 1-correlation, cut the tree at the appropriate height
> > > (1-0.8=0.2), and from each cluster keep a single representative (e.g.,
> > > the one with the highest mean correlation with other members of the
> > > cluster). Something along these lines (untested)
> > >
> > > tree = hclust(1-calc.rho, method = "average")
> > > clusts = cutree(tree, h = 0.2)
> > > clustLevels = sort(unique(clusts))
> > > representatives = unlist(lapply(clustLevels, function(cl)
> > > {
> > >   inClust = which(clusts==cl);
> > >   rho1 = calc.rho[inClust, inClust, drop = FALSE];
> > >   repr = inClust[ which.max(colSums(rho1)) ]
> > >   repr
> > > }))
> > >
> > > the variable representatives now contains indices of the variables you
> > > want to retain, so you could subset the calc.rho matrix as
> > > rho.retained = calc.rho[representatives, representatives]
> > >
> > > I haven't tested the code and it may contain bugs, but something along
> > > these lines should get you where you want to be.
> > >
> > > Oh, and depending on how strict you want to be with the remaining
> > > correlations, you could use complete linkage clustering (will retain
> > > more variables, some correlations will be above 0.8).
> > >
> > > Peter
> > >
> > > On Thu, Nov 14, 2019 at 10:50 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I have a data frame like this (a matrix):
> > > > head(calc.rho)
> > > >             rs9900318 rs8069906 rs9908521 rs9908336 rs9908870 rs9895995
> > > > rs56192520      0.903     0.268     0.327     0.327     0.327     0.582
> > > > rs3764410       0.928     0.276     0.336     0.336     0.336     0.598
> > > > rs145984817     0.975     0.309     0.371     0.371     0.371     0.638
> > > > rs1807401       0.975     0.309     0.371     0.371     0.371     0.638
> > > > rs1807402       0.975     0.309     0.371     0.371     0.371     0.638
> > > > rs35350506      0.975     0.309     0.371     0.371     0.371     0.638
> > > >
> > > > > dim(calc.rho)
> > > > [1] 246 246
> > > >
> > > > I would like to remove from this data all highly correlated variables,
> > > > with correlation more than 0.8
> > > >
> > > > I tried this:
> > > >
> > > > > data<- calc.rho[,!apply(calc.rho,2,function(x) any(abs(x) > 0.80))]
> > > > > dim(data)
> > > > [1] 246   0
> > > >
> > > > Can you please advise,
> > > >
> > > > Thanks
> > > > Ana
> > > >
> > > > But this removes everything.
> > > >
> > > > ______________________________________________
> > > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal, self-contained, reproducible code.