[R] generalized linear regression - function glm - dismissed predictors - more information about simulated data

Thu Nov 18 20:55:03 CET 2010

... but do note the first line of the DESCRIPTION in the Help file:

"Compute the rank of matrix, a well-defined functional in theory,
somewhat ambigous in practice."

So as the Help file makes very clear, what you get can depend on how
you do it; and may be different than what you want; which will depend
on what the scientific context is.

In the present case, collinearity of predictors may not do any harm if
you are only concerned about predicting over the region (hypervolume)
of those predictors within the data at hand.  Unfortunately, this is
not usually the case, and the inherent ambiguiity can result in
poor(and biased) predictions outside the current predictor region.
Note that there is NO WAY to resolve this statistically -- you must
either get more data to disambiguate the confounded predictors or you
must bring in information (theory, prior results) from outside the
data.

Cheers,
Bert

P.S. Your results look fishy, even for a simulation. The extremely low
standard errors and rank deficiencies strongly suggest to me that your
methodology is faulty and that you are getting a bunch of baloney. I
would recommend seeking help from your local statistician.

On Thu, Nov 18, 2010 at 11:23 AM, David Winsemius
<dwinsemius at comcast.net> wrote:
>
> On Nov 18, 2010, at 1:31 PM, Christine SINOQUET wrote:
>
>> Hello,
>>
>> a) Thanks a lot for the answer relative to the dismissed coefficients.
>> However, the result below has been obtained in the R console and there was
>> no warning (please, see the full display below (***) if you wish).
>
> I guess my memory on that expectation was faulty.
>
>>
>>
>> "yi ~ X$V1 + X$V2 + X$V3 + X$V4 + X$V5 + X$V6 + X$V7 + X$V8 + X$V9 +
>> X$V10"
>>              Estimate   Std. Error       t value Pr(>|t|)
>> (Intercept)  0.018062134 5.624517e-17  3.211322e+14        0
>> X$V1        -0.011104627 3.084989e-17 -3.599567e+14        0
>> X$V2        -0.018536614 3.241635e-17 -5.718291e+14        0
>> X$V3         0.107341157 4.884358e-17  2.197651e+15        0
>> X$V4         0.003258493 3.286878e-17  9.913643e+13        0
>> X$V6        -0.051599957 4.203840e-17 -1.227448e+15        0
>> X$V8        -0.057207683 3.049835e-17 -1.875763e+15        0
>> X$V10        0.134643229 3.849911e-17  3.497308e+15        0
>>
>> b) Would it be wise to check if the predictors are colinear, prior to
>> running the glm function ? How could it be performed, using the R language?
>
> require(Matrix)
> rr <- rankMatrix(as.matrix(X))
> rr < length(X)
>
> --
> David.
>
>>
>> In additoon, in the present case, I would like to check for colinearity,
>> because I am puzzled by the absence of warning inthe R console.
>>
>> c) I now answer the question relative to the very low std errors and the
>> way I generated the simulated data :
>>
>> I ran the following code :
>>
>> *****************
>> drawRegressionCoefficientsAndUpdateY <- function (target,
>>                                                 marks,
>>                                                 nbInd){
>> # X (individuals x marks)
>> # y (nbTargets x individuals)
>>
>>
>> l <- length(marks)
>> sigma <- runif(1,0.03,0.08)
>> # Values close to 0 are excluded from c0 simulation interval so that the
>> effect of the predictors are detectable.
>> c01    <- runif(1,-2,-0.5)
>> c02    <- runif(1,0.5,2)
>> i <- runif(1,1:2)
>> if (i==1) c0 <- c01 else c0 <- c02
>>
>> coefficients <- c()
>> for(r in 1:l){
>>  coefficient <- runif(1,0.2+sigma,1+sigma)
>>  coefficients <- c(coefficients, coefficient)
>> }
>>
>> // TILL THAT POINT, I RELIED ON THE PROCESS DESCRIBED IN A THESIS.
>> // THERE WAS NO INDICATION RELATIVE TO THE "ADJUSTMENT" OF EPSILON.
>> epsilon <- rnorm(1, mean = 0, sd = 0.002)
>> for (ind in 1:nbInd){
>>  y[target,ind]<<- c0 + coefficients %*% X[ind,SNPs] + epsilon
>> }
>>
>> Thank you in advance for your kind help.
>>
>> C.S.
>> ------------------------------------------------
>> ------------------------------------------------
>>
>> "yi ~ X$V1 + X$V2 + X$V3 + X$V4 + X$V5 + X$V6 + X$V7 + X$V8 + X$V9 +
>> X$V10"
>>              Estimate   Std. Error       t value Pr(>|t|)
>> (Intercept)  0.018062134 5.624517e-17  3.211322e+14        0
>> X$V1        -0.011104627 3.084989e-17 -3.599567e+14        0
>> X$V2        -0.018536614 3.241635e-17 -5.718291e+14        0
>> X$V3         0.107341157 4.884358e-17  2.197651e+15        0
>> X$V4         0.003258493 3.286878e-17  9.913643e+13        0
>> X$V6        -0.051599957 4.203840e-17 -1.227448e+15        0
>> X$V8        -0.057207683 3.049835e-17 -1.875763e+15        0
>> X$V10        0.134643229 3.849911e-17  3.497308e+15        0
>>
>>
>> I am sure to have regressed the right number of variables, since I check
>> that the formula is correct:
>> "yi ~ X$V1 + X$V2 + X$V3 + X$V4 + X$V5 + X$V6 + X$V7 + X$V8 + X$V9 +
>> X$V10"
>>
>> Could somebody explain to me
>> 1) why there are mismatches between the "true" coefficients for
>> predictors 4 and 6
>> and
>> 2) why there is no information edited for predictors 5, 7 and 9 ?
>>
>> Thanks in advance for your kind help.
>>
>> C.S.
>> ------------------------------------------------
>> ------------------------------------------------
>> (***)
>> Full console display ;
>>
>> > inputoutputdir="/home/sinoquet/recherche/mcmc/output/sim_sat_02_10_10/"
>> >  inputfilesnps="snps.txt"
>> >  inputfilepheno="pheno.txt"
>> >
>> >  X <<- read.table(file=paste(inputoutputdir,inputfilesnps,sep=""),h=F) #
>> > data.frame
>> >  #genotype file (individuals x SNPs); code: 0/1/2 : number of minor
>> > alleles
>> >
>> >  y <<- read.table(file=paste(inputoutputdir,inputfilepheno,sep=""),h=F)
>> > # data.frame
>> >
>> >  #fit <- glm(yi ~ X$V1 + X$V2 + X$V3 + X$V4 + X$V5 + X$V6 + X$V7 + X$V8
>> > + X$V9 + X$V10, family = gaussian)
>> >  #coefficients( summary(fit))
>> >
>> >  #*****************************
>> >  # BEGIN check real solution:
>> >  target <- 1
>> >  inputfilepredictors="predictor_descript.txt"
>> >  f0 <- file(paste(inputoutputdir,inputfilepredictors,sep=""), open =
>> > "r")
>> >  # format of file f0:
>> >  #target <predictors>
>> >  #1 1 2 3 4 5 6 7 8 9 10
>> >  line <- readLines(f0, n = 1) # header
>> >  line <- readLines(f0, n = 1)
>> >  v <- unlist(strsplit(line, split=" "))
>> >  predictorsForThisTarget <- v[-1] # dismiss target number
>> >  print(paste("!",v,"!",sep=" "))
>> [1] "! 1 !"  "! 1 !"  "! 2 !"  "! 3 !"  "! 4 !"  "! 5 !"  "! 6 !"  "! 7 !"
>> [9] "! 8 !"  "! 9 !"  "! 10 !"
>> >
>> >  nbPredictors <- length(predictorsForThisTarget)
>> >  ch<-paste("X$V",predictorsForThisTarget[1],sep="")
>> >  for (i in 2:nbPredictors){
>> +   item<-paste("X$V",predictorsForThisTarget[i],sep="")
>> +   ch<-paste(ch,item,sep=" + ")
>> +  }
>> >
>> >  formulaString <- paste("yi",ch,sep=" ~ ")
>> >  print(formulaString)
>> [1] "yi ~ X$V1 + X$V2 + X$V3 + X$V4 + X$V5 + X$V6 + X$V7 + X$V8 + X$V9 +
>> X$V10"
>> >  formula <- as.formula(formulaString)
>> >  yi <- unlist(y[target,])
>> >  fit <- glm(formula, family = gaussian)
>> >  print("*************BEGIN SOLUTION****************")
>> [1] "*************BEGIN SOLUTION****************"
>> >  print(coefficients(summary(fit)))
>>               Estimate   Std. Error       t value Pr(>|t|)
>> (Intercept)  0.018062134 5.624517e-17  3.211322e+14        0
>> X$V1        -0.011104627 3.084989e-17 -3.599567e+14        0
>> X$V2        -0.018536614 3.241635e-17 -5.718291e+14        0
>> X$V3         0.107341157 4.884358e-17  2.197651e+15        0
>> X$V4         0.003258493 3.286878e-17  9.913643e+13        0
>> X$V6        -0.051599957 4.203840e-17 -1.227448e+15        0
>> X$V8        -0.057207683 3.049835e-17 -1.875763e+15        0
>> X$V10        0.134643229 3.849911e-17  3.497308e+15        0
>> >  print("*************END SOLUTION****************")
>> [1] "*************END SOLUTION****************"
>> >  # END check real solution:
>> >  #*****************************
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Bert Gunter
Genentech Nonclinical Biostatistics