[R] two cols in a data frame are the same factor

Andres Legarra legarra at gmail.com
Thu Mar 20 09:25:00 CET 2008


Hi,
I am afraid you misunderstood it. I do not have repeated records, but
for every record I have two, possibly different, simultaneously
present, instanciations of an explanatory variable.

My data is as follows :

yield haplo1 haplo2
100  A B
151  B A
212  A A

So I have one effect (haplo), but two copies of each affect "yield".
If I use lm() I get:
> a=data.frame(yield=c(100,151,212),haplo1=c("A","B","A"),haplo2=c("B","A","A"))
Call:
lm(formula = yield ~ -1 + haplo1 + haplo2, data = a)

Coefficients:
 haploA   haploB  haplo2B
    212      151     -112


But I get different coefficients for the two "A"s (in fact oe was set
to 0) and the Two "Bs" . That is, the model has four unknowns but in
my example I have just two!

A least-squares solution is simple to do by hand:

 X=matrix(c(1,1,1,1,2,0),ncol=2) #the incidence matrix
> X
     [,1] [,2]
[1,]    1    1
[2,]    1    2
[3,]    1    0
> solve(crossprod(X,X),crossprod(X,a$yield))
         [,1]
[1,] 184.8333
[2,] -30.5000

where [1,] is the solution for A and [2,] is the solution for B

This is not difficult to do by hand, but it is for a simple case and I
miss all the machinery in lm()

Thank you
Andres

On Wed, Mar 19, 2008 at 6:57 PM, Michael Dewey <info at aghmed.fsnet.co.uk> wrote:
> At 09:11 18/03/2008, Andres Legarra wrote:
>  >Dear all,
>  >I have a data set (QTL detection) where I have two cols of factors in
>  >the data frame that correspond logically (in my model) to the same
>  >factor. In fact these are haplotype classes.
>  >Another real-life example would be family gas consumption as a
>  >function of car company (e.g. Ford, GM, and Honda) (assuming 2 cars by
>  >family).
>
>  Unless I completely misunderstand this it looks like you have the
>  dataset in wide format when you really wanted it in long format (to
>  use the terminology of ?reshape). Then you would fit a model allowing
>  for the clustering by family.
>
>
>
>
>  >An artificial example follows:
>  >set.seed(1234)
>  >L3 <- LETTERS[1:3]
>  >(d <- data.frame( y=rnorm(10), fac=sample(L3, 10,
>  >repl=TRUE),fac1=sample(L3,10,repl=T)))
>  >
>  >  lm(y ~ fac+fac1,data=d)
>  >
>  >and I get:
>  >
>  >Coefficients:
>  >(Intercept)         facB         facC        fac1B        fac1C
>  >      0.3612      -0.9359      -0.2004      -2.1376      -0.5438
>  >
>  >However, to respect my model, I need to constrain effects in fac and
>  >fac1 to be the same, i.e. facB=fac1B and facC=fac1C. There are
>  >logically just 4 unknowns (average,A,B,C).
>  >With continuous covariates one might do y ~ I(cov1+cov2), but this is
>  >not the case.
>  >
>  >Is there any trick to do that?
>  >Thanks,
>  >
>  >Andres Legarra
>  >INRA-SAGA
>  >Toulouse, France
>
>  Michael Dewey
>  http://www.aghmed.fsnet.co.uk
>
>



More information about the R-help mailing list