[R] add an automatized linear regression in a function
Rui Barradas
ruipbarradas at sapo.pt
Fri May 4 18:45:51 CEST 2012
Em 04-05-2012 11:00, jeff6868 <geoffrey_klein at etu.u-bourgogne.fr> escreveu:
> Date: Thu, 3 May 2012 06:45:59 -0700 (PDT)
> From: jeff6868<geoffrey_klein at etu.u-bourgogne.fr>
> To:r-help at r-project.org
> Subject: [R] add an automatized linear regression in a function
> Message-ID:<1336052759474-4606047.post at n4.nabble.com>
> Content-Type: text/plain; charset=us-ascii
>
> Dear R users,
>
> For the moment, I have a script and a function which calculates correlation
> matrices between all my data files. Then, it chooses the best correlation
> for each data and take it in order to fill missing data in the analysed file
> (so the data from the best correlation file is put automatically into the
> missing data gaps of the first file (because my files are containing missing
> values (NAs))). If the best correlated file doesn't contain data , it takes
> the data from the second best correlated file.
> The problem is that for the moment, it takes raw data from the best
> correlated file.
>
> So I need to adapt this raw data to the file that is going to be filled. As
> a consequence, I'd like to automatize the calculation of a linear regression
> (after the selection of the best or the second best correlated data file)
> between the two files.
> Instead of taking the raw data from the best correlated file to fill the
> first one, it should take the estimated data from the regression to fill it
> (in order to have more precise filled data).
> The idea is so to do an lm() between these two files, to extract the
> coefficients of the straight line (from the regression) and to calculate the
> estimated data for all my file (NA included), and finally to fill the gaps
> with this estimated data. Hope you've understand my problem.
> Here's the function:
>
> process.all<- function(df.list, mat){
> f<- function(station)
> na.fill(df.list[[ station ]], df.list[[ max.cor[station] ]])
>
> g<- function(station){
> x<- df.list[[station]]
> if(any(is.na(x$data))){
> mat[row(mat) == col(mat)]<- -Inf
> nas<- which(is.na(x$data))
> ord<- order(mat[station, ], decreasing = TRUE)[-c(1,
> ncol(mat))]
> for(i in nas){
> for(y in ord){
> if(!is.na(df.list[[y]]$data[i])){
> x$data[i]<- df.list[[y]]$data[i]
> break
> }
> }
> }
> }
> x
> }
>
> n<- length(df.list)
> nms<- names(df.list)
> max.cor<- sapply(seq.int(n), get.max.cor, corhiver2008capt1)
> df.list<- lapply(seq.int(n), f)
> df.list<- lapply(seq.int(n), g)
> names(df.list)<- nms
> df.list
> }
>
> I succeded for a small data.frame I've created, but I don't know how to do
> it in this particular case.
> Thanks a lot for your help!
>
Statistically speaking, I don't believe in what you want, but a solution
could be
na.fill <- function(x, y){
i <- is.na(x$data)
xx <- y$data
new <- data.frame(xx=xx)
x$data[i] <- predict(lm(x$data~xx, na.action=na.exclude), new)[i]
x
}
and in process.all, change function g() to
g <- function(station){
x <- df.list[[station]]
if(any(is.na(x$data))){
mat[row(mat) == col(mat)] <- -Inf
nas <- which(is.na(x$data))
ord <- order(mat[station, ], decreasing = TRUE)[-c(1,
ncol(mat))]
for(y in ord){
if(all(!is.na(df.list[[y]]$data[nas]))){
xx <- df.list[[y]]$data
new <- data.frame(xx=xx)
x$data[nas] <- predict(lm(x$data~xx,
na.action=na.exclude), new)[nas]
break
}
}
}
x
}
Hope this helps,
Rui Barradas
More information about the R-help
mailing list