[R] add an automatized linear regression in a function

Fri May 4 18:45:51 CEST 2012

Em 04-05-2012 11:00, jeff6868 <geoffrey_klein at etu.u-bourgogne.fr> escreveu:
> Date: Thu, 3 May 2012 06:45:59 -0700 (PDT)
> From: jeff6868<geoffrey_klein at etu.u-bourgogne.fr>
> To:r-help at r-project.org
> Subject: [R] add an automatized linear regression in a function
> Message-ID:<1336052759474-4606047.post at n4.nabble.com>
> Content-Type: text/plain; charset=us-ascii
>
> Dear R users,
>
> For the moment, I have a script and a function which calculates correlation
> matrices between all my data files. Then, it chooses the best correlation
> for each data and take it in order to fill missing data in the analysed file
> (so the data from the best correlation file is put automatically into the
> missing data gaps of the first file (because my files are containing missing
> values (NAs))). If the best correlated file doesn't contain data , it takes
> the data from the second best correlated file.
> The problem is that for the moment, it takes raw data from the best
> correlated file.
>
> So I need to adapt this raw data to the file that is going to be filled. As
> a consequence, I'd like to automatize the calculation of a linear regression
> (after the selection of the best or the second best correlated data file)
> between the two files.
> Instead of taking the raw data from the best correlated file to fill the
> first one, it should take the estimated data from the regression to fill it
> (in order to have more precise filled data).
> The idea is so to do an lm() between these two files, to extract the
> coefficients of the straight line (from the regression) and to calculate the
> estimated data for all my file (NA included), and finally to fill the gaps
> with this estimated data. Hope you've understand my problem.
> Here's the function:
>
> process.all<- function(df.list, mat){
>          f<- function(station)
>               na.fill(df.list[[ station ]], df.list[[ max.cor[station] ]])
>
>          g<- function(station){
>          x<- df.list[[station]]
>          if(any(is.na(x$data))){
>                  mat[row(mat) == col(mat)]<- -Inf
>                  nas<- which(is.na(x$data))
>                  ord<- order(mat[station, ], decreasing = TRUE)[-c(1,
> ncol(mat))]
>                  for(i in nas){
>                          for(y in ord){
>                                  if(!is.na(df.list[[y]]$data[i])){
>                                          x$data[i]<- df.list[[y]]$data[i]
>                                          break
>                                  }
>                          }
>                  }
>          }
>          x
>      }
>
>          n<- length(df.list)
>          nms<- names(df.list)
>          max.cor<- sapply(seq.int(n), get.max.cor, corhiver2008capt1)
>          df.list<- lapply(seq.int(n), f)
>          df.list<- lapply(seq.int(n), g)
>          names(df.list)<- nms
>          df.list
>      }
>
> I succeded for a small data.frame I've created, but I don't know how to do
> it in this particular case.
> Thanks a lot for your help!
>
Statistically speaking, I don't believe in what you want, but a solution 
could be

na.fill <- function(x, y){
     i <- is.na(x$data)
     xx <- y$data
     new <- data.frame(xx=xx)
     x$data[i] <- predict(lm(x$data~xx, na.action=na.exclude), new)[i]
     x
}

and in process.all, change function g() to

     g <- function(station){
         x <- df.list[[station]]
         if(any(is.na(x$data))){
             mat[row(mat) == col(mat)] <- -Inf
             nas <- which(is.na(x$data))
             ord <- order(mat[station, ], decreasing = TRUE)[-c(1, 
ncol(mat))]
             for(y in ord){
                 if(all(!is.na(df.list[[y]]$data[nas]))){
                     xx <- df.list[[y]]$data
                     new <- data.frame(xx=xx)
                     x$data[nas] <- predict(lm(x$data~xx, 
na.action=na.exclude), new)[nas]
                     break
                 }
             }
         }
         x
     }

Hope this helps,

Rui Barradas