[R] [FORGED] Regression with factors ?
David Winsemius
dwinsemius at comcast.net
Wed Jul 13 18:56:44 CEST 2016
> On Jul 13, 2016, at 8:01 AM, David Winsemius <dwinsemius at comcast.net> wrote:
>
>
>> On Jul 13, 2016, at 6:48 AM, stn021 <stn021 at gmail.com> wrote:
>>
>> Hello,
>>
>> so here a numerical example in R-code. Code is appended below.
>>
>> The output should be
>> 1) the numerical values of the abilities of the persons
>> 2) the multiplyer
>>
>>
>> Please note that
>>
>> 1) I have used non-linear optimization to solve this problem and got
>> the expected result, though not with R but other software.
>>
>> 2) I have applied lm() to this problem, even before I posted the
>> question. I am well aware of the syntax of formulas. I my last posting
>> I wrote the formula "freehand" so I made the previously mentioned
>> errors. Sorry about that.
>>
>>
>>
>> Unfortunately the formulas with I() as well as multiplying variables
>> before running R does not work here. I() does not apply to factors (R
>> tells me) and multiplying in advance also works only for continuous
>> variables, not for factors, because there is no known numerical value
>> to multiply.
>>
>> The latter is actually what my question is about, along with the
>> question on how to get R to treat two columns as two instances of the
>> same factor.
>>
>>
>> Just to be sure I used R to check if the data really counts as a
>> factor according to R-terminology. It really is a factor, see code
>> below.
>>
>>
>>
>> This is the code for generating the example-data:
>>
>> # --------------------------------------------------------------- #
>> pnames = c( "alice" , "bob" , "charlie" , "don" , "eve" , "freddy"
>> , "grace" , "henry" )
>> pcount = length( pnames )
>>
>> # abilities = runif( pcount )
>> abilities = (1:pcount) / 10
>>
>> persons = data.frame( name = pnames , ability = abilities )
>> persons
>>
>> # random subset of possible combinations and extra df
>> combinations = combn( nrow( persons ) , 2 ) ;
>> combinations = cbind( combinations,combinations,combinations,combinations )
>> combinations = combinations[ , runif(ncol(combinations))<0.5 ]
>> ccount = ncol( combinations )
>>
>> observed_data = data.frame(
>> idx1 = combinations[1,]
>> , idx2 = combinations[2,]
>> , p1 = ( persons$name[ combinations[1,] ] )
>> , p2 = ( persons$name[ combinations[2,] ] )
>> )
>>
>> abilities_data = data.frame(
>> a1 = persons$ability[ combinations[1,] ]
>> , a2 = persons$ability[ combinations[2,] ]
>> )
>>
>> # y = result of cooperation of each pair
>> multiplyer = runif(1) + 1
>> offset = 1
>> cat( "multiplyer = " , multiplyer , "\n" )
>> cat( "offset = " , offset , "\n" )
>>
>> y0 = multiplyer * ( offset - ( abilities_data$a1 - abilities_data$a2 ) ^ 2 )
>> noise = .05 * rnorm( ccount )
>>
>> # check variables are really factors :
>> str( observed_data$p1 )
>> dput( observed_data$p1 )
>>
>> observed_data = data.frame( y = round( y0+noise,3 ) , observed_data )
>> observed_data
>>
>> # --------------------------------------------------------------- #
>
> Is this what is intended?
>
>> observed_data$p1ab <- persons$ability[ match(observed_data$p1, persons$name) ]
>> observed_data$p2ab <- persons$ability[ match(observed_data$p2, persons$name) ]
>> head(observed_data)
> y idx1 idx2 p1 p2 p1ab p2ab
> 1 1.149 1 6 alice freddy 0.1 0.6
> 2 1.006 1 7 alice grace 0.1 0.7
> 3 1.529 2 3 bob charlie 0.2 0.3
> 4 1.404 2 5 bob eve 0.2 0.5
> 5 1.205 2 6 bob freddy 0.2 0.6
> 6 1.187 2 7 bob grace 0.2 0.7
>
>
>> lm( y ~ I( (p1ab -p2ab)^2 ), data=observed_data)
>
> Call:
> lm(formula = y ~ I((p1ab - p2ab)^2), data = observed_data)
>
> Coefficients:
> (Intercept) I((p1ab - p2ab)^2)
> 1.506 -1.435
>
>> separate_term <- lm( y ~ I( (p1ab -p2ab)^2 ), data=observed_data)
>> summary(separate_term)
>
> Call:
> lm(formula = y ~ I((p1ab - p2ab)^2), data = observed_data)
>
> Residuals:
> Min 1Q Median 3Q Max
> -0.116249 -0.030996 0.002633 0.032765 0.136282
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 1.50589 0.01067 141.08 <2e-16 ***
> I((p1ab - p2ab)^2) -1.43527 0.05863 -24.48 <2e-16 ***
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 0.05304 on 44 degrees of freedom
> Multiple R-squared: 0.9316, Adjusted R-squared: 0.93
> F-statistic: 599.2 on 1 and 44 DF, p-value: < 2.2e-16
>
> You could also have compared 2 models differing only with rest to the includion of an interaction term that was the squared difference in abilities:
>
>> full <- lm( y ~ p1ab + p2ab + I( (p1ab -p2ab)^2 ), data=observed_data)
>> reduced <- lm( y ~ p1ab + p2ab , data=observed_data)
>> anova(full,reduced)
> Analysis of Variance Table
>
> Model 1: y ~ p1ab + p2ab + I((p1ab - p2ab)^2)
> Model 2: y ~ p1ab + p2ab
> Res.Df RSS Df Sum of Sq F Pr(>F)
> 1 42 0.11823
> 2 43 0.17315 -1 -0.05492 19.509 6.892e-05 ***
I also tested whether "nesting" the I( ) calls would succeed in giving you any results from what I thought was a rather odd construction
>>>> y = f * ( o - ( p1 - p2 )^2 )
> separate_term <- lm( y ~ I( (p1ab -p2ab)^2 ), data=observed_data)
> separate_term
Call:
lm(formula = y ~ I((p1ab - p2ab)^2), data = observed_data)
Coefficients:
(Intercept) I((p1ab - p2ab)^2)
1.506 -1.435
> separate_term2 <- lm( y ~ I( 1- I( (p1ab -p2ab)^2 )), data=observed_data)
> separate_term2
Call:
lm(formula = y ~ I(1 - I((p1ab - p2ab)^2)), data = observed_data)
Coefficients:
(Intercept) I(1 - I((p1ab - p2ab)^2))
0.07062 1.43527
The sign of the "interaction term" just gets reversed. I don't see that any offset term got calculated for the "1" term. Offsets can be included but I think this particular form where you are asking for two different coefficients, they might not be separable in the linear model framework.
--
david.
>
>
> --
> David
>
>>
>>
>> 2016-07-11 19:16 GMT+02:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:
>>> Your clarification is promising. A reproducible example is always preferred, though never a guarantee. I expect to be somewhat preoccupied this week so responses may be rather delayed, but the less setup we have to the more likely that someone on the list will tackle it.
>>>
>>> Re an answer: If you can make the example simple enough that you can tell us what the right numerical result will be, we will have a better chance of understanding what you are after. E.g. if you start with a solution and use it to create sample input data with then you don't need to actually solve it to illustrate what you are after. [1]
>>>
>>> Note that I am not aware of any package dedicated to this type of problem, so unless someone else responds otherwise then you will likely have to use bootstrapping or your own statistical analysis (Bayesian?) of the result.
>>>
>>> [1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>>> --
>>> Sent from my phone. Please excuse my brevity.
>>>
>>> On July 11, 2016 7:28:41 AM PDT, stn021 <stn021 at gmail.com> wrote:
>>>> Hello,
>>>>
>>>> thank you for the replies. Sorry about the html-email, I forgot.
>>>> Should be OK with this email.
>>>>
>>>>
>>>> Don't be fooled be the apparent simplicity of the problem. I have
>>>> tried to reduce it to only a single relatively simple question.
>>>>
>>>> The idea here is to model cooperation of two persons. The model is
>>>> about one specific aspect of that cooperation, namely that two persons
>>>> with similar abilities may be able to produce better results that two
>>>> very different persons.
>>>>
>>>> That is only one part of the model with other parts modeling for
>>>> example the fact that of course two persons with a higher degree of
>>>> ability will produce better results per se.
>>>>
>>>>
>>>> It is not classic regression with factors. That can be easily done by
>>>> something like lm( y ~ (p1-p2)^2 ).
>>>>
>>>> This expands to lm( y ~ p1^2 - 2*p1*p2 + p2^2 ). This contains a
>>>> multiplicagtions and for lm() this implies interactions between the
>>>> factor-levels and produces one parameter for each combination of
>>>> factor-levels that occurs in the data. That is not what the question
>>>> is about.
>>>>
>>>> Also p1 and p2 are different levels of the same factor, while for lm()
>>>> it would be two different factors with different levels.
>>>>
>>>>
>>>> As for the sensical part: this has a real world application therefore
>>>> it makes sense.
>>>>
>>>> Also it is not so difficult to solve with non-linear optimization. I
>>>> was hoping to be able to use R for that purpose because then the
>>>> results could easily be checked with statistical tests.
>>>>
>>>> So my question is not "how to solve" but "how to solve with R".
>>>>
>>>>
>>>> As for the excess degrees of freedom, in real observations there would
>>>> of course be added noise due to either random variations or factors
>>>> not included in the model. So to generate a more reality-conforming
>>>> example I could add some random normal-distributed noise to the
>>>> dependent variable y. I previously left that part out because to me it
>>>> did not seem relevant.
>>>>
>>>>
>>>> Would you like me to make a complete example dataset with more records
>>>> and noise ?
>>>>
>>>>
>>>> The answer I look for would be the numerical values of the
>>>> factor-levels and numerical values for the multiplier (f) and the
>>>> offset (o), with p1 and p2 given as names (here: persons) and y given
>>>> as some level of achievement they reach by cooperating.
>>>>
>>>> y = f * ( o - ( p1 - p2 )^2 )
>>>>
>>>> Is that what you meant by "answer" ?
>>>>
>>>>
>>>> THX
>>>> stefan
>>>>
>>>>
>>>>
>>>>
>>>> 2016-07-10 2:27 GMT+02:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:
>>>>>
>>>>> I have seen less sensical questions.
>>>>>
>>>>> It would be nice if the example were a bit more complete (as in it
>>>> should have excess degrees of freedom and an answer) and less like a
>>>> homework problem (which are off topic here). It would of course also be
>>>> helpful if the OP were to conform to the Posting Guide, particularly in
>>>> respect to using plain text email.
>>>>>
>>>>> It looks like the kind of nonlinear optimization problem that
>>>> evolutionary algorithms are often applied to. It doesn't look (to me)
>>>> like a typical problem that factors get applied to in formulas though,
>>>> because multiple instances of the same factor variable are present.
>>>>> --
>>>>> Sent from my phone. Please excuse my brevity.
>>>>>
>>>>> On July 9, 2016 4:59:30 PM PDT, Rolf Turner <r.turner at auckland.ac.nz>
>>>> wrote:
>>>>>> On 09/07/16 20:52, stn021 wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I would like to analyse a model like this:
>>>>>>>
>>>>>>> y = 1 * ( 1 - ( x1 - x2 ) ^ 2 )
>>>>>>>
>>>>>>> x1 and x2 are not continuous variables but factors, so the
>>>>>> observation
>>>>>>> contain the level.
>>>>>>> Its numerical value is unknown and is to be estimated with the
>>>> model.
>>>>>>>
>>>>>>>
>>>>>>> The observations look like this:
>>>>>>>
>>>>>>> y x1 x2
>>>>>>> 0.96 Alice Bob
>>>>>>> 0.84 Alice Charlie
>>>>>>> 0.96 Bob Charlie
>>>>>>> 0.64 Dave Alice
>>>>>>> etc.
>>>>>>>
>>>>>>> Each person has a numerical value. Here for example Alice = 0.2
>>>> and
>>>>>> Bob =
>>>>>>> 0.4
>>>>>>>
>>>>>>> Then y = 0.96 = 1* ( 1- ( 0.2-0.4 ) ^ 2 ) , see first observation.
>>>>>>>
>>>>>>> How can this be done in R ?
>>>>>>
>>>>>>
>>>>>> This question makes about as little sense as it is possible to
>>>> imagine.
>>>>>>
>>>>>> cheers,
>>>>>>
>>>>>> Rolf Turner
>>>>>
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list