[RsR] Is it acceptable to develope a model using a fitting procedure and then use a different fitting procedure to predict from it?

Tue Oct 1 09:07:40 CEST 2013

This has  *nothing* to do with robust statistics,
so please do *not* post to this dedicated mailing list!

Martin Maechler, ETH Zurich

On Tue, Oct 1, 2013 at 4:15 AM, Agus Camacho <agus.camacho using gmail.com> wrote:
> Dear list, I have the following problem.
>
> I want to predict the value of a variable in nature, call it "ecological
> success(ES)".
>
> For doing that, I made some virtual simulations with which I reached a
> "best model" explaining ecological success. In the simulation, ES is a
> categorical variable so I use a multinomial model with the gl.multi
> function to iteratively find the best combination of factors that explain
> that variable in the simulations.
>
> The model appears in R like:  ES~var1+var2:var3, so I have main terms alone
> and also interactions.
>
> I can now fit this best model using the function "multinom" from nnet
> package and get the coefficients for each term in the model. Something like:
>
> M=multinom(ES~var1+var2:var3,data)
>
> Now, in order to predict the values in nature I would naturally use the
> function predict from the same package and real data to feed the model,
> like:
>
> predict.nnet(M,realdata)
>
> However, this gives me categorical values. Would it be there a
> statistically valid way to obtain a continuous output? This is important
> because gives me more power to discriminate differences in ES among
> species. However simulate continuous ES is nearly impossible for my system.
>
> For example, would it be valid to use a fitting function that assumes that
> ES is continuous at  some point in the process (i.e. during the obtention
> of the best model, or during the obtention of the coefficients?)
>
> There goes some reproducible example:
>
> ES =as.factor( sample( c("0","1","2"), 100, replace=TRUE, prob=c(0.1, 0.2,
> 0.65) ))
> var1=  dnorm(1:100, mean = 30, sd = 20, log = FALSE)
> var2=  as.numeric(ES)-var1
> var3= (as.numeric(ES)-var1)/var2
> simulation=data.frame(cbind(ES,var1,var2,var3))
>
> require(glmulti)
> require(nnet)
> multi.multi=function(formula, data){
>   multinom(paste(deparse(formula)), data = data)# to compare models with
> different factors use true ML not REML
> }
> # find best model for ES in the simulation (may take days or not converge)
> M=glmulti(
>   ES~var1*var2*var3,
>   data=simulation, name = "glmulti.analysis",
>   intercept = TRUE, marginality = FALSE,
>   level = 2, minsize = 0, maxsize = -1, minK = -1, maxK = -1,
>   fitfunction=multi.multi,
>   method = "g", crit = "aic", confsetsize = 100,includeobjects=TRUE
> )
>
> # determine the coefficients for the best model
> M=multinom(ES~var1*var2*var3, data=simulation)
> summary(M)
>
> #"real data"
> var1=  dnorm(1:3, mean = 30, sd = 20, log = FALSE)
> var2= dnorm(1:3, mean = 10, sd = 20, log = FALSE)
> var3= dnorm(1:3, mean = 250, sd = 20, log = FALSE)
> realdata=data.frame(cbind(var1,var2,var3))
>
> d=predict (M, realdata)# gives a lot of 1s, but  want to discriminate ES
> finer.
>
> # Would it be correct to use a permutation fitting in glmulti like:
>
> require(lmPerm)
>
> multi.multi=function(formula, data){
>   lmp(paste(deparse(formula)), data = data)#
> }
>
> # if not, would it be correct to use a permutation procedure for fitting
> the best obtained model?
>
> M=lmp(ES~var1*var2*var3, data=simulation)
> d=predict (M, realdata)# this gives a continuous ES output, but with
> warning.
>
>
>
>
>
> Thanks in advance!!
>
> --
> Agustín Camacho Guerrero.
> Doutor em Zoologia.
> Laboratório de Herpetologia, Departamento de Zoologia, Instituto de
> Biociências, USP.
> Rua do Matão, trav. 14, nº 321, Cidade Universitária,
> São Paulo - SP, CEP: 05508-090, Brasil.
>
>         [[alternative HTML version deleted]]
>
>
> _______________________________________________
> R-SIG-Robust using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-robust
>