Thanks for the reply. Sorry to be a pain, but could perhaps explain what
you mean by
"you can center each SNP variable at its mean to make the interaction
term uncorrelated with the main effects".
I'm coming from a web based programming background so often statistical
terminology doesn't click with me.
Also, I have never heard of a scores test before but some googling has
turned up the Lagrange multiplier test. Is this the one you mentioned.
Thanks again,
Davy.
On 12 March 2012 20:04, Thomas Lumley wrote:
> On Tue, Mar 13, 2012 at 7:11 AM, Davy wrote:
> > Dear useRs,
> >
> > First off, sorry about the long post. Figured it's better to give context
> > to get good answers (I hope!). Some time ago I wrote an R function that
> > will get all pairwise interactions of variables in a data frame. This
> > worked fine at the time, but now a colleague would like me to do this
> with
> > a much larger dataset. They don't know how many variables they are going
> to
> > have in the end but they are guessing approximately 2,500 - 3,000
> variables
> > with ~2000 observations for each. My function below is way too slow for
> > this (4 minutes for 100 variables). At the bottom of this post I have
> > included some timings for various numbers of variables and total numbers
> of
> > interactions. I have the results of calling Rprof() on the 100 variables
> > run of my function, so If anyone wants to take a look at it let me know.
> I
> > don't want to make a super long post any longer than it needs to be.
> >
> > What I'd like to know is if there is anything I can do to speed this
> > function up. I tried looking at going directly to glm.fit, but as far as
> I
> > understood, for that to be useful the computation of the design matrices
> > and all of that other stuff that I frankly don't understand, needs to be
> > the same for each model, which is not the case for my analysis, although
> > perhaps I am wrong about this. I also took a look at speedglm and it
> seems
> > to be for the analysis of large datasets. I think the problem is that, in
> > effect there are many datasets, so not sure if this package can help.
> >
> > Any ideas on how to make this run faster would be greatly appreciated. I
> am
> > planning on using parallelization to run the analysis in the end but I
> > don't know how many CPU's I am going to have access to but I'd say it
> won't
> > be more than 8. In addition I have been advised that the best way to
> > account for multiple testing here is by permutation test, which in this
> > context becomes almost unfeasible, since I would have to run each
> > interaction ~10,000 times.
> >
> > Thanks in advance, Cheers
> > Davy.
> >
> > getInteractions2 = function(data, fSNPcol, ccCol)
> > {
> > #fSNPcol is the number of the column that contains the first SNP
> > #ccCol is the number of the column that contains the outcome variable
> >
> > require(lmtest)
> >
> > a = data.frame()
> >
> > snps = names(data)[-1:-(fSNPcol-1)]
> >
> > names(data)[ccCol] = "PHENOTYPE"
> >
> > terms = as.data.frame(t(combn(snps,2)))
> >
> > attach(data)
> >
> > fit1 = c()
> >
> > fit2 = c()
> >
> > pval = c()
> >
> > for(i in 1:length(terms$V1))
> >
> > {
> >
> > fit1 =
> glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i])),family="binomial")
> >
> > fit2 =
> glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i]))+I(get(as.character(terms$V1[i]))*get(as.character(terms$V2[i]))),family="binomial")
> >
> > a = lrtest(fit1, fit2)
> >
> > pval = c(pval, a[2,"Pr(>Chisq)"])
> >
> > }
> >
> > detach(data)
> >
> > results = cbind(terms,pval)
> >
> > return(results)
> > }
> >
> > In the table below is the system.time results for increasing numbers of
> > variables being passed through the function. n is the number of
> variables,
> > and Ints is the number of pair-wise interactions given by that number of
> > variables.
> >
> > n Ints user.self sys.self elapsed
> >
> > time 10 45 1.20 0.00 1.30
> >
> > time 15 105 3.40 0.00 3.43
> >
> > time 20 190 6.62 0.00 6.85
> > ...
> >
> > time 90 4005 178.04 0.07 195.84
> >
> > time 95 4465 199.97 0.13 218.30
> >
> > time 100 4950 221.15 0.08 242.18
> >
> > Some code to reproduce a data frame in case you want to look at timings
> or
> > the Rprof() results. Please don't run this unless your machine is super
> > fast, or your prepared to wait for about 15-20 minutes.
> >
> > df = data.frame(paste("sid",1:2000,sep=""),rbinom(2000,1,.5))
> >
> > gtypes = matrix(nrow=2000, ncol=3000)
> >
> > gtypes = apply(gtypes,2,function(x){x=sample(0:2, 2000, replace=T);x})
> >
> > snps = paste("rs", 1000:3999,sep="")
> >
> > df = cbind(df,gtypes)
> >
> > names(df) = c("sid", "status", snps)
> >
> > times = c()
> > for(i in seq(10,100, by=5)){
> >
> > if(i==100){Rprof()}
> >
> > time = system.time((pvals = getInteractions2(df[,1:i], 3, 2)))
> >
> > print(time)
> >
> > times = rbind(times, time)
> >
> > if(i==100){Rprof(NULL)}
> > }
> >
> > numI = function(n){return(((n^2)-n)/2)}
> >
> > timings = cbind(seq(10,100,by=5), sapply(seq(10,100,by=5), numI),times)
> >
>
> The first step is to use glm.fit rather than glm. This leads to
> simpler as well as faster code in your context -- you feed it a design
> matrix rather than a formula. That's probably a factor of two or so.
>
> Using score tests rather than likelihood ratio tests would also save
> some time, but again only a factor of two or so.
>
>
> It looks as though you have SNP data. If the SNPs are not in linkage
> disequilibrium and the main effects are fairly small, you can center
> each SNP variable at its mean to make the interaction term
> uncorrelated with the main effects, and prescreen by just testing for
> correlation between the interaction variable and the phenotype. Then
> go back and fit the models for the few pairs that look promising.
>
> Note that in general you *cannot* do a permutation test to handle
> multiple testing for interactions
> [http://www.ncbi.nlm.nih.gov/pubmed/20384625]. In order for the
> permutation null hypothesis to be true you need additional structure,
> like zero main effects. You can do a parametric bootstrap, which is
> similar. Or just choose a threshold based on expected number of
> false positives rather than trying to get strong control of Type I
> error.
>
> -thomas
>
> --
> Thomas Lumley
> Professor of Biostatistics
> University of Auckland
>
--
David Kavanagh
Nephrology Research, Centre of Public Health, Queen's University Belfast, A
floor, Tower Block,
City Hospital, Lisburn Road, BT9 7AB, Belfast, United Kingdom
[[alternative HTML version deleted]]