[R] biserial correlation with pkg polycor

John Fox jfox at mcmaster.ca
Mon Oct 29 16:57:52 CET 2007

```Dear Tom,

I'm afraid that you're confusing the point-biserial correlation with the
biserial correlation. It's the latter that polyserial() in the polycor
package computes; the biserial correlation is a special case when the
(ordered) categorical variable has just two categories.

Generally, biserial correlations will be larger than point-serial
correlations, since the former estimates the correlation in an underlying
bivariate normal distribution that has been binned. I haven't looked closely
at your example, but suspect that the difference in sign is just due to the
arbitrary specification of which category is higher than the other. As for a
reference, ?polyserial gives one.

I hope this helps,
John

--------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario
905-525-9140x23604
http://socserv.mcmaster.ca/jfox
--------------------------------

> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Tom Willems
> Sent: Monday, October 29, 2007 11:01 AM
> To: r-help at r-project.org
> Subject: [R] biserial correlation with pkg polycor
>
> Dear R-ussers,
>
> While looking for a way to calculate the association between
> a countinuous and a binary variable, i found a procedure
> called point biserial corralation.
> Me, not being a mathematicion, i did my very best to
> understand what it was all about, and then i found a easily
> understandable paper (by steve
> simon) on ow to calculate this. ref ##
> http://www.childrens-mercy.org/stats/definitions/biserial.htm
> polycor package in R.
> Now i'm having troubles with the fact that the polycor pkg
> never gives me the same output as the manuals aplication of
> the formula.
>
> In the example below found, manualy  r(biserial) = 0.49
> between fb an age, and ussing function polyserial (polycor
> pkg)  r(biserial) =-0.8591.
> This is a rather big difference, no due to abriviation or
> flootingpoints.
>
> Is there someone whom is familiar with biserial correlation,
> and the appropriate way to calculate it?
>
> Kind regards,
> Tom.
>
> here is the example, at the end is the R file.
>
> 1e I create the input
>
> > library(abind)
> > library (polycor)
> > ### data input
> > no <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8)
> fb <- c(19,
> > 30, 20, 19, 29, 25, 21, 24, 50, 25, 21, 17, 15, 14, 14, 22,
> 17)
> > ss <- c(14, 41, 18, 11, 16, 24, 18, 21, 37, 17, 10, 16, 22, 12, 14,
> > 12,
> 18)
> > age <-c('elderly', 'elderly', 'elderly', 'elderly', 'elderly',
> 'elderly', 'elderly', 'elderly', 'elderly', 'young', 'young',
> 'young', 'young', 'young', 'young', 'young', 'young')
> >
> > dataset <- data.frame(no,fb,ss,age)
> > dataset <- subset(dataset,select=c(fb:age))
> > nrow(dataset)
> [1] 17
> > data_eld <- subset(dataset,age=='elderly',select=fb)
> > data_young <- subset(dataset,age=='young',select=fb)
> >
>
> here i calculate the R_bis (biserial corelation) manualy
>
> > ### point biserial correlation
> >
> >  fb <- subset(dataset,select=fb)
> >  fb0 <- subset(dataset,age!='elderly',select=fb)
> >  fb1 <- subset(dataset,age=='elderly',select=fb)
> >  meanfb0 <- mean(fb0,na.rm=T)
> >  meanfb1 <- mean(fb1,na.rm=T)
> >  sdfb<- sd(dataset\$fb,na.rm=T)
> >
> >  ss <- subset(dataset, select=ss)
> >  ss0 <- subset(dataset,age!='elderly',select=ss)
> >  ss1 <- subset(dataset,age=='elderly',select=ss)
> >  meanss0 <- mean(ss0,na.rm=T)
> >  meanss1 <- mean(ss1,na.rm=T)
> >  sdss<- sd(dataset\$ss,na.rm=T)
> >  age <- subset(dataset,select=age)
> >   n <- nrow(dataset)
> >
> this is the formula from  ref ##
> http://www.childrens-mercy.org/stats/definitions/biserial.htm
>
> > > R_bis <- function(x,x1,x0,n){  p <- (nrow(x1)/n)
> >+        (((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T))
> *sqrt(p*(1-p)))  }
>
> this is the corrected formula from  ref ##
> http://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient
> >> R_bis2 <- function(x,x1,x0,n){
> +        ((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T))
> * (sqrt(
> (nrow(x1)*nrow(x0))/(n*(n-1))))}
> >
> >>  R_bis(fb,fb1,fb0,n)
> >       fb
> >0.4798873
>
> result in paper was 0.49
>
> >>   R_bis2(fb,fb1,fb0,n)
> >       fb
> >0.4946565
>
> equals result in paper  0.49
>
> Then i use the polycor package,
> function hetcor will give all the different correlation ressults
>
>
> >> hetcor(dataset\$fb,dataset\$ss,dataset\$age ,ML=TRUE)
> >
> >Maximum-Likelihood Estimates
> >
> >Correlations/Type of Correlation:
>  >           dataset\$fb dataset.ss dataset.age
> >dataset\$fb           1    Pearson  Polyserial
> >dataset.ss       0.703          1  Polyserial
> >dataset.age    -0.8591    -0.6685           1
> >
> >Standard Errors:
> >           dataset\$fb dataset.ss
> >dataset\$fb
> >dataset.ss      0.1215
> >dataset.age     0.1106     0.2497
> >
> >n = 17
> >
> >P-values for Tests of Bivariate Normality:
> >            dataset\$fb dataset.ss
> >dataset\$fb
> >dataset.ss      0.1782
> >dataset.age     0.4269     0.4034
> > hetcor(dataset,ML=TRUE)
> >
> >Maximum-Likelihood Estimates
> >
> >Correlations/Type of Correlation:
>  >        fb      ss        age
> >fb        1 Pearson Polyserial
> >ss    0.703       1 Polyserial
> >age -0.8591 -0.6685          1
> >
> >Standard Errors:
>  >       fb     ss
> >fb
> >ss  0.1215
> >age 0.1106 0.2497
> >
> >n = 17
> >
> >P-values for Tests of Bivariate Normality:
> >       fb     ss
> >fb
> >ss  0.1782
> >age 0.4269 0.4034
>
> here a quick two step method is ussed to calculate the polyserial
> correlation
>
> > >polyserial(dataset\$fb,dataset\$age)
> >[1] -0.6205737
> >> polyserial(dataset\$fb,dataset\$age, ML=TRUE, std.err=TRUE)
>
> same method  as in hetcor, only for indecated variables
>
> >Polyserial Correlation, ML est. = -0.8591 (0.1106)
> >Test of bivariate normality: Chisquare = 4.91, df = 5, p = 0.4269
> >
> >               1
> >Threshold 0.1811
> >Std.Err.  0.1849
> >
>
>
> > ### for side to side (ss)   incase no 9 is an outlier in
> fb, this will
> not be the case in ss
> >
> >> R_bis(ss,ss1,ss0,n)
> >       ss
> >0.4153681
>
> result in paper was 0.43
>
> >> R_bis2(ss,ss1,ss0,n)
> >       ss
> >0.4281516
>
> equals result in paper  0.43
>
> >> polyserial(dataset\$ss,dataset\$age)
> >[1] -0.5371397
>
> >> polyserial(dataset\$ss,dataset\$age, ML=TRUE, std.err=TRUE)
> >
> >Polyserial Correlation, ML est. = -0.6685 (0.2497)
> >Test of bivariate normality: Chisquare = 5.103, df = 5, p = 0.4034
> >
> >               1
> >Threshold 0.1504
> >Std.Err.  0.2583
> ##############################################################
> ###################
>
> ### R-file
> ##############################################################
> ###################
>
> library(abind)
> library (polycor)
> ### data input
> no <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8)
> fb <- c(19, 30, 20, 19, 29, 25, 21, 24, 50, 25, 21, 17, 15,
> 14, 14, 22,
> 17)
> ss <- c(14, 41, 18, 11, 16, 24, 18, 21, 37, 17, 10, 16, 22,
> 12, 14, 12,
> 18)
> age <-c('elderly', 'elderly', 'elderly', 'elderly',
> 'elderly', 'elderly',
> 'elderly', 'elderly', 'elderly', 'young', 'young', 'young', 'young',
> 'young', 'young', 'young', 'young')
>
> dataset <- data.frame(no,fb,ss,age)
> dataset <- subset(dataset,select=c(fb:age))
> nrow(dataset)
> data_eld <- subset(dataset,age=='elderly',select=fb)
> data_young <- subset(dataset,age=='young',select=fb)
>
> ### point biserial correlation
>
>  fb <- subset(dataset,select=fb)
>  fb0 <- subset(dataset,age!='elderly',select=fb)
>  fb1 <- subset(dataset,age=='elderly',select=fb)
>  meanfb0 <- mean(fb0,na.rm=T)
>  meanfb1 <- mean(fb1,na.rm=T)
>  sdfb<- sd(dataset\$fb,na.rm=T)
>
>  ss <- subset(dataset, select=ss)
>  ss0 <- subset(dataset,age!='elderly',select=ss)
>  ss1 <- subset(dataset,age=='elderly',select=ss)
>  meanss0 <- mean(ss0,na.rm=T)
>  meanss1 <- mean(ss1,na.rm=T)
>  sdss<- sd(dataset\$ss,na.rm=T)
>  age <- subset(dataset,select=age)
>   n <- nrow(dataset)
>
>  R_bis <- function(x,x1,x0,n){  p <- (nrow(x1)/n)
>        (((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T))
> *sqrt(p*(1-p)))  }
> R_bis2 <- function(x,x1,x0,n){
>        ((mean(x1,na.rm=T)-mean(x0,na.rm=T))/sd(x,na.rm=T))  * (sqrt(
> (nrow(x1)*nrow(x0))/(n*(n-1))))}
>
>  R_bis(fb,fb1,fb0,n)
>   R_bis2(fb,fb1,fb0,n)
>
>
> hetcor(dataset\$fb,dataset\$ss,dataset\$age ,ML=TRUE)
> hetcor(dataset,ML=TRUE)
>
> polyserial(dataset\$fb,dataset\$age)
> polyserial(dataset\$fb,dataset\$age, ML=TRUE, std.err=TRUE)
>
> ### for side to side (ss)   incase no 9 is an outlier in fb,
> this will not
> be the case in ss
>
> R_bis(ss,ss1,ss0,n)
> R_bis2(ss,ss1,ss0,n)
>
> polyserial(dataset\$ss,dataset\$age)
> polyserial(dataset\$ss,dataset\$age, ML=TRUE, std.err=TRUE)
>
> ## END
> ##############################################################
> ############
>
> Willems Tom
>
> E-mail: towil at var.fgov.be
>
>
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help