[R] R: Optimization

Wed Sep 6 04:24:23 CEST 2006

<comments in line> 

Simone Vincenzi wrote:
> Thanks for the help.
> I thought about taking the logarithms of the expression but to cope with the
> problem of zero values both in yield and the values of the environmental
> variables another transformation is needed (I simply add one to the yield
> value and to the environmental variables values). 
Do you have 0 yield in any cases where the values of all the 
environmental variables were non-zero?   I didn't see any 0's in the 
data you have below.  I would not worry about 0's unless they actually 
occur.  If you have a larger data set with some 0's then I suggest you 
consider the cases where the 0's actually occur. 

      * If every 0 yield occurs when at least one of the environmental 
variables is 0, you can safely delete all cases with 0 yield, because 
they provide zero information to estimate any of the parameters in your 
model.  I would rerun the model without the 1's, as they make no sense 
to me. 

      * Otherwise, you need to evaluate the noise in the model.  In 
other words, the equation you wrote will not fit exactly.  The standard 
regression model (linear or nonlinear) assumes that y = f(X, b) + e, 
where "X" here is a vector of A, B, ..., b is a vector of the parameters 
to be estimated, b0, b1, ..., in the model I wrote, and the errors "e" 
are normally distributed with constant variance.  If you have cases 
where the yield is zero but none of the X's are, you have problems with 
this assumption no matter what.  You could use "nls" with the model as 
you specified it.  However, to get starting values for "nls" I suggest 
you run "lm" on the logarithms, adding 1 if you like, as you suggested. 
> I follow the suggestion
> kindly provided by Spencer Graves but I found the following problems, which
> depend probably on my understanding:
> 1) I don't understand why it is needed to test the no-constant model (which
> provides a better fitting of the model). Can I simply assume a no-constant
> model? 
>   
I was assuming yield would be a number between 0 and 1, similar to A, B, 
... .  If your inputs A, B, .., are all between 0 and 1 but yield is 
not. I suggest you carefully examine the source for that equation, 
because it makes no physical sense to me.  Your regression results below 
report that the intercept is not significantly different from 0.  
However, I would suspect that might be just an accident.  If you change 
units from kg/m2 to psi or something else, you should get a 
statistically significant intercept. 

      * If yield is between 0 and 1, then the yield equation you wrote 
makes physical sense.  And then it makes sense to test whether b0 = 0, 
because that is a test for whether there are other environmental that 
impact yield that are not in the model. 

      * Similarly, the comparisons of fit0 and fit.0 plus fit1 and fit.1 
are designed to test for other environmental variable(s) not in the 
model that affect yield but are correlated somewhat with the variables 
you already have. 

      For more, I suggest you consult a statistician.  There must be 
several at Uni Parma. 

      Hope this helps. 
      Spencer Graves
> 2) Here comes the big problem. I don't understand what it is done with the
> models fit.0 and fit.1. In both cases I have to leave out one variable from
> the linear model but I don't understand why in this way I would test the
> constraint that all the coefficients should sum to 1. And it is not possible
> to apply anova(fit0,fit.0) and anova(fit1,fit.1) because I have different
> response variables. 
>
> In conclusion, I don't actually know how to proceed and if in R this kind of
> analysis is possible.
> Any help and any further explanation would be very appreciated. Below I
> report part of the data frame I'm using for the analysis. The environmental
> variables (Salinity, Hydrodynamism, Sediment, Oxygen, Chlorophyll and
> Bathymetry) values have been transformed using suitability function (and
> thus are bounded between 0 and 1) while Yield is in kg/m2.
>
>    Sedi Sal Bathy Chl Hydro Oxy  Yield
> 1  1.00 1.00 0.51   1 0.17 0.75 1.5
> 2  0.50 0.95 1.00   1 0.09 0.94 0.4
> 3  0.50 1.00 0.17   1 0.44 0.90 1.8
> 4  1.00 0.98 1.00   1 0.10 0.89 4.5
> 5  0.13 0.84 0.73   1 0.16 0.84 0.4
> 6  0.50 0.90 0.91   1 0.22 0.84 0.4
> 7  0.13 0.75 1.00   1 0.14 0.86 0.2
> 8  0.13 0.84 0.75   1 0.10 0.83 0.3
> 9  0.13 0.78 0.97   1 0.06 0.84 0.5
> 10 0.13 0.87 0.70   1 0.45 0.85 1.0
> 11 1.00 0.77 1.00   1 0.19 0.86 1.5
> 12 1.00 0.94 0.81   1 0.47 0.86 3.0
> 13 1.00 0.93 1.00   1 0.45 0.89 2.5
> 14 0.50 1.00 1.00   1 0.54 0.84 4.0
> 15 0.50 1.00 1.00   1 0.25 0.88 2.2
> 16 1.00 1.00 0.56   1 0.25 0.90 5.0
> 17 1.00 0.90 0.56   1 0.40 0.90 1.5
> 18 0.50 0.97 1.00   1 0.22 0.95 1.0
> 19 0.54 0.96 1.00   1 0.18 0.91 0.3
> 20 1.00 0.97 0.33   1 0.39 0.90 3.0
>
>
> And here the results of the fitted models so far.
>
> summary(fit0)
>
> Call:
> lm(formula = log(Yield + 1) ~ log(Sal + 1) + log(Bathy + 1) + log(Chl + 
>     1) + log(Hydro + 1) + log(Oxy + 1) + log(Sedi + 1) - 1, data = data.df)
>
> Residuals:
>      Min       1Q   Median       3Q      Max 
> -1.05485 -0.23759  0.01331  0.18692  1.23803 
>
> Coefficients:
>               Estimate Std. Error t value Pr(>|t|)    
> log(Sal + 1)   5.07193    1.00584   5.042 1.98e-06 ***
> log(Bathy + 1)  0.05804    0.20684   0.281 0.779561    
> log(Chl + 1)  -8.53941    2.11720  -4.033 0.000106 ***
> log(Hydro + 1)  1.73835    0.28815   6.033 2.56e-08 ***
> log(Oxy + 1)   4.19951    1.98459   2.116 0.036750 *  
> log(Sedi + 1)  1.15953    0.14807   7.831 4.51e-12 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
>
> Residual standard error: 0.3735 on 103 degrees of freedom
> Multiple R-Squared: 0.9148,     Adjusted R-squared: 0.9098 
> F-statistic: 184.2 on 6 and 103 DF,  p-value: < 2.2e-16 
>
>   
>> summary(fit1)
>>     
>
> Call:
> lm(formula = log(Yield + 1) ~ log(Sal + 1) + log(Bathy + 1) + log(Chl + 
>     1) + log(Hydro + 1) + log(Oxy + 1) + log(Sedi + 1), data = (data.df))
>
> Residuals:
>       Min        1Q    Median        3Q       Max 
> -1.057766 -0.227720  0.006146  0.192200  1.222543 
>
> Coefficients:
>               Estimate Std. Error t value Pr(>|t|)    
> (Intercept)   -4.49400    4.76603  -0.943   0.3479    
> log(Sal + 1)   5.13499    1.00861   5.091 1.63e-06 ***
> log(Bathy + 1)  0.06685    0.20716   0.323   0.7476    
> log(Chl + 1)  -2.48909    6.75718  -0.368   0.7134    
> log(Hydro + 1)  1.70674    0.29025   5.880 5.22e-08 ***
> log(Oxy + 1)   4.63758    2.03928   2.274   0.0251 *  
> log(Sedi + 1)  1.14005    0.14959   7.621 1.34e-11 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
>
> Residual standard error: 0.3737 on 102 degrees of freedom
> Multiple R-Squared: 0.7589,     Adjusted R-squared: 0.7447 
> F-statistic: 53.51 on 6 and 102 DF,  p-value: < 2.2e-16 
>
>   
>> summary(fit.0)
>>     
>
> Call:
> lm(formula = log((Yield + 1)/(Sedi + 1)) ~ log((Sal + 1)/(Sedi + 
>     1)) + log((Bathy + 1)/(Sedi + 1)) + log((Chl + 1)/(Sedi + 
>     1)) + log((Hydro + 1)/(Sedi + 1)) + log((Oxy + 1)/(Sedi + 
>     1)) - 1, data = subset(data.df))
>
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -1.5762 -0.2591  0.1065  0.5023  1.4533 
>
> Coefficients:
>                            Estimate Std. Error t value Pr(>|t|)    
> log((Sal + 1)/(Sedi + 1))    3.0170     1.5304   1.971  0.05134 .  
> log((Bathy + 1)/(Sedi + 1))  -0.3982     0.3139  -1.268  0.20748    
> log((Chl + 1)/(Sedi + 1))    8.8137     2.3941   3.681  0.00037 ***
> log((Hydro + 1)/(Sedi + 1))   0.3309     0.4066   0.814  0.41760    
> log((Oxy + 1)/(Sedi + 1))  -12.5242     2.1881  -5.724 1.01e-07 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
>
> Residual standard error: 0.5766 on 104 degrees of freedom
> Multiple R-Squared: 0.523,      Adjusted R-squared: 0.5001 
> F-statistic:  22.8 on 5 and 104 DF,  p-value: 2.179e-15 
>
>
>   
>> summary(fit.1)
>>     
>
> Call:
> lm(formula = log((Yield + 1)/(Sedi + 1)) ~ log((Sal + 1)/(Sedi + 
>     1)) + log((Bathy + 1)/(Sedi + 1)) + log((Chl + 1)/(Sedi + 
>     1)) + log((Hydro + 1)/(Sedi + 1)) + log((Oxy + 1)/(Sedi + 
>     1)), data = subset(data.df))
>
> Residuals:
>      Min       1Q   Median       3Q      Max 
> -1.05552 -0.23441  0.01263  0.18355  1.24473 
>
> Coefficients:
>                             Estimate Std. Error t value Pr(>|t|)    
> (Intercept)                  1.84916    0.15474  11.950  < 2e-16 ***
> log((Sal + 1)/(Sedi + 1))    5.03863    1.00977   4.990 2.46e-06 ***
> log((Bathy + 1)/(Sedi + 1))   0.05279    0.20767   0.254   0.7999    
> log((Chl + 1)/(Sedi + 1))  -10.96682    2.27270  -4.825 4.86e-06 ***
> log((Hydro + 1)/(Sedi + 1))   1.74631    0.28980   6.026 2.64e-08 ***
> log((Oxy + 1)/(Sedi + 1))    3.95938    1.98206   1.998   0.0484 *  
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
>
> Residual standard error: 0.3751 on 103 degrees of freedom
> Multiple R-Squared: 0.5606,     Adjusted R-squared: 0.5393 
> F-statistic: 26.28 on 5 and 103 DF,  p-value: < 2.2e-16
>
>   
>> anova(fit1,fit0)
>>     
> Analysis of Variance Table
>
>   Res.Df     RSS  Df Sum of Sq      F Pr(>F)
> 1    102 14.2409                            
> 2    103 14.3650  -1   -0.1241 0.8891 0.3479
>
> Thanks for the help
>
> Simone Vincenzi
>
>
>
>
>
>
>
> _________________________________________
> Simone Vincenzi, PhD Student 
> Department of Environmental Sciences
> University of Parma
> Parco Area delle Scienze, 33/A, 43100 Parma, Italy
> Phone: +39 0521 905696
> Fax: +39 0521 906611
> e.mail: svincenz at nemo.unipr.it 
>
>
> -----Messaggio originale-----
> Da: Spencer Graves [mailto:spencer.graves at pdf.com] 
> Inviato: lunedì 4 settembre 2006 21.31
> A: Simone Vincenzi
> Cc: r-help at stat.math.ethz.ch
> Oggetto: Re: [R] Optimization
>
>       Have you considered talking logarithms of the expression you 
> mentioned: 
>
>       log(Yield) = a1*log(A)+b1*log(B)+c2*log(C)+...
>
> where a1 = a/(a+b+...), etc.  This model has two constraints not present 
> in ordinary least squares:  First, the intercept is assumed to be zero.  
> Second, the coefficients in this log formulation must sum to 1.  If I 
> were you, I might use something like "lm" to test them both. 
>
>       To explain how, I'll modify the notation, replacing A by X1, B by 
> X2, ..., up to Xkm1 (= X[k-1]) and Xk for k different environmental 
> variables.  Then I might try something like the following: 
>
>       fit0 <- lm(log(Yield) ~ log(X1) + ... + log(Xk)-1 )
>       fit1 <- lm(log(Yield) ~ log(X1) + ... + log(Xk) )
>       fit.1 <- lm(log(Yield/Xk) ~ log(X1/Xk) + ... + log(Xkm1/Xk) )
>       fit.0 <- lm(log(Yield/Xk) ~ log(X1/Xk) + ... + log(Xkm1/Xk)-1 )
>
>       anova(fit1, fit0) would test the no-constant model, and if I 
> haven't made a mistake in this, anova(fit0, fit.0) and anova(fit1, 
> fit.1) would test the constraint that all the coefficients should sum to 
> 1. 
>
>       If you would like further help from this listserve, please provide 
> commented, minimal, self-contained, reproducible code to help potential 
> respondents understand your question and concerns (as suggested in the 
> posting guide "www.R-project.org/posting-guide.html"). 
>
>       Hope this helps. 
>       Spencer Graves
>
> Simone Vincenzi wrote:
>   
>> Dear R-list,
>> I'm trying to estimate the relative importance of 6 environmental
>>     
> variables
>   
>> in determining clam yield. To estimate clam yield a previous work used the
>> function Yield = (A^a*B^b*C^c...)^1/(a+b+c+...) where A,B,C... are the
>> values of the environmental variables and the weights a,b,c... have not
>>     
> been
>   
>> calibrated on data but taken from literature. Now I'd like to estimate the
>> weights a,b,c... by using a dataset with 110 observations of yield and
>> values of the environmental variables. I'm wondering if it is feasible or
>>     
> if
>   
>> the number of observation is too low, if some data transformation is
>>     
> needed
>   
>> and which R function is the most appropriate to try to estimate the
>>     
> weights.
>   
>> Any help would be greatly appreciated.
>>
>> Simone Vincenzi 
>>
>> _________________________________________
>> Simone Vincenzi, PhD Student 
>> Department of Environmental Sciences
>> University of Parma
>> Parco Area delle Scienze, 33/A, 43100 Parma, Italy
>> Phone: +39 0521 905696
>> Fax: +39 0521 906611
>> e.mail: svincenz at nemo.unipr.it 
>>
>>
>>
>> --
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>>     
> http://www.R-project.org/posting-guide.html
>   
>> and provide commented, minimal, self-contained, reproducible code.
>>   
>>     
>
>