[R] Questions about Probit Analysis

Sun Oct 31 20:14:00 CET 2010

Dear All,
I have some questions about probit regressions.
I saw a nice introduction at

http://bit.ly/bU9xL5

and I mainly have two questions.

(1) The first is almost about data manipulation. Consider the following 
snippet

##################################################

mydata <- read.csv(url("http://www.ats.ucla.edu/stat/r/dae/binary.csv"))
names(mydata) <- c("outcome","x1","x2","x3")

myprobit <- glm(mydata$outcome~mydata$x1+mydata$x2+as.factor(mydata$x3), 
family=binomial(link="probit"))

print(summary(myprobit))

#Now assume I can make a regression only on x1

myprobit2 <- glm(mydata$outcome~mydata$x1, family=binomial(link="probit"))

print(summary(myprobit2))

#express in terms of counts

md <- t(table(mydata$outcome, mydata$x1))

# create new dataframe

mydatanew <- data.frame(as.numeric(row.names(md)))

names(mydatanew) <- c("x1")

mydatanew$successes <-as.numeric(md[ ,2])

mydatanew$failures <-as.numeric(md[ ,1])

########################################################################

where first I carry out a logit regression of the binary outcome (i.e. 
taking only 0/1 as values) on 3 regressors, then I simply regress the 
outcome on the x1 variable.

Finally, I generate the data frame mydatanew (see some of its entries below)

 > mydatanew
     x1 successes failures
1  220         0        1
2  300         1        2
3  340         1        3
4  360         0        4
5  380         0        8
[...................]

where for every value of x1 I count the number of 0 and 1 outcomes 
(namely number of failures and number of successes). This is equivalent 
to having a full list of x1 values with an associated 0/1 outcome (I 
have simply counted them) hence it is all the info I need to again 
perform a logit regression of the binary outcome on x1, but the data 
format is now different. How can I actually feed R with mydatanew to 
perform again a logistic regression on x1 only?
(2) This is a bit more conceptual. Let us say that you have a set of 
products A,B,C,D,E,F. Each product has a list of features: x_A for 
product A, x_B for B etc...
Each customer has its own set of parameters (age, sex, income etc..) I 
call x_cust. Finally, the customer is confronted with two products (e.g. 
A and D; combinations may vary, I call each combination of two products 
a scenario) and asked which one he would like to buy. Bottom line: your 
data are in the format

1 x_A x_cust
0 x_D x_cust

meaning that a certain customer chose product A against product D; similarly

1 x_C x_cust
0 x_B x_cust

would mean that the customer choosing between C and B finally selected 
C.  Every customer needs to choose a product in a variety of different 
scenarios.  How would you analyze this kind of data? Is there any way I 
can express, in my probit analysis, the fact that my binary outcome (but 
this product or not) arises always from the comparison of two products 
only (customers are never given a choice between more than two products 
in a given scenario). Or should I simply run my logistic regression on 
my 0/1 outcome without any extra worry (like in the snippet above)?
Many thanks

Lorenzo