[R] Help with factor levels and reference level

Bert Gunter gunter.berton at gene.com
Sat Jun 7 07:40:07 CEST 2014


... But I think the OP should consult his/her local statistician and
not do this.

1. Categorizing a continuous variable generally loses information.
While there may be times when this may be appropriate, generally
continuous variables should be modeled as such.

2. You also lose the ordering with the categorization to letters,
probably a further loss of potentially useful information. At the
least, it should probably be modeled as an ordered factor (?ordered),
which is not what  has been done.

Just because one can easily do this does not mean that one should.

... with the obvious caveat that I am just guessing, having no
knowledge of the context.

Cheers,
Bert



Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Fri, Jun 6, 2014 at 8:35 PM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Jun 6, 2014, at 11:16 AM, Nwinters wrote:
>
>> I have a variable coded in Stata as follows:
>> **
>> *gen sat_pm25cat_=.
>> replace sat_pm25cat_= 1 if (sat_pm25>=4 & sat_pm25<=7.1 & sat_pm25!=.)
>> replace sat_pm25cat_= 2 if (sat_pm25>=7.1 & sat_pm25<=10)
>> replace sat_pm25cat_= 3 if (sat_pm25>=10.1 & sat_pm25<=11.3)
>> replace sat_pm25cat_= 4 if (sat_pm25>=11.4 & sat_pm25<=12.1)
>> replace sat_pm25cat_= 5 if (sat_pm25>=12.2 & sat_pm25<=17.1)
>
> Apparently Stata handles overlapping definitions somehow. (7.1 items would be ambiguously define.)  I suspect you can duplicate that intended effect with:
>
> sat_pm25cat_  <- findInterval(sat_pm25, c(4, 7.1 ,10, 11.4,12.2, 17.1) )
>
>
>>
>> gen satpm25catR= "A" if sat_pm25cat_==1
>> replace satpm25catR= "B" if sat_pm25cat_==2
>> replace satpm25catR= "C" if sat_pm25cat_==3
>> replace satpm25catR= "D" if sat_pm25cat_==4
>> replace satpm25catR= "E" if sat_pm25cat_==5
>
>
> satpm25catR <- factor( LETTERS[1:5][ sat_pm25cat_ ] )
>
>
>> ***
>>
>> my model for R is:
>> ##
>> *glm.PM25linB <-glm(leuk ~ satpm25catR + sex + ageR, data=leuk,
>> family=binomial, epsilon=1e-15, maxit=1000)*
>> ##
>>
>> In the summary, satpm25catR is being reported as all levels:
>>
>> <http://r.789695.n4.nabble.com/file/n4691823/Screen_Shot_2014-06-06_at_2.png>
>>
>> *What I want is to make "A" the reference level, how do I do this??*
>
> It would be the reference level by default since factors are sorted lexically.
>
> --
> David Winsemius
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list