# [R] Q: combine 2 data frames with missing values

Gabor Grothendieck ggrothendieck at gmail.com
Mon Aug 20 14:51:05 CEST 2007

```Try this:

Lines <- "case    var1    var2    var3   var4
1       9       9       13      11
2       15      9       15      13
3       na      na      12      9
4       8       6       na      na
5       14      10      na      na
6       20      15      17      15
"

# replace with DF <- read.table("myfile.dat", header = TRUE, na.strings = "na")

DF1 <- DF[-1]
kor <- cor(DF1, use = "pairwise")
kor

lm(var1 ~ var2, DF) # a sample regression

# mycoef calculates kth coefficient in regression of
# ith variable on jth variable
mycoef <- function(i, j, k) coef(lm(DF1[c(i, j)]))[k]

idx <- 1:ncol(DF1)
names(idx) <- names(DF1)

intercepts <- outer(idx, idx, Vectorize(mycoef), 1)
names(dimnames(intercepts)) <- c("y", "x")
intercepts

slopes <- outer(idx, idx, Vectorize(mycoef), 2)
names(dimnames(slopes)) <- c("y", "x")
slopes

# another approach to the above
# mycoef1 is like mycoef but has only one argument
# and outputs all coefs, not just a specified one
mycoef1 <- function(idx) coef(lm(DF1[idx]))
out <- t(apply(expand.grid(y = idx, x = idx), 1, mycoef1))
colnames(out) <- c("y", "x", "intercept", "slope")
out

# To perform SQL operations on data frames
# and also ?sqldf for many examples
library(sqldf)
sqldf("select avg(var1), avg(var2), avg(var3), avg(var4) from DF1")
colMeans(DF1, na.rm = TRUE)  # same

On 8/20/07, Tom Willems <Tom.Willems at var.fgov.be> wrote:
> hello R ussers,
>
> i have the same problem with my data,
> for aal the different variables, i have the same number of cases, but the
> are often out of detectionlimits so they produce "na's" .
> so the data looks like this:
>
> case    var1    var2    var3    var4 ...
> 1       9       9       13      11
> 2       15      9       15      13
> 3       na      na      12      9
> 4       8       6       na      na
> 5       14      10      na      na
> 6       20      15      17      15      ..
> ..
>
> What i would like to do for data exploration, is to compare each possible
> pair of variables, get their correlation coefficient, the intercept and
> the slope of regression line. yet for every variable the messurements are
> lnked thruogh theyr case. it is the same sample just a diferent test.
>
> Now  i select a subsets  of variables out of the original dataset, and use
>  :
>          value_x1 = subset(dataset_1,select=lg_value)
>          value_y1 =subset(dataset_2,select=lg_value)
>
> Then i to mold an lm model, inorder to get estimates for the slope ans
> intercept
>        model_1 <- lm (value_y1[,1]~ value_x1[,1]  )
>
> This is what R tell's me:
>                        "Error in model.frame(formula, rownames,
> variables, varnames, extras, extranames,  :
>                                  variable lengths differ (found for
> 'value_x1[, 1]')"
>
> Is there perhapes a way of binding the selected subsets together, still
> linked to their case, so that the na's can be discarded by R automaticaly?
> I have been trying to use SQLiteDF and the other sql func's of R, but i
> don't realy understand them.
> If someone out there knows how to use sql, in R, i d be delited if he or
> she could explain it to me, more understandible then the manuals i find on
> the web.
> Here is what io would want sql to do .
>
>
> My data is in columns, one column holds all the case numbers, one the
> messured values, one all the testtypes and one the timeperiod and then one
> column for the lab's that preformed the test. is is stored in a txt file.
> So it is a long 5 column data table.
> Now is it possible to make a cross table holding the case nr's, and
> timeperiod in 2 column's, and then have a different column for every test?
> so if there are 4 tests and 4 lab's, it would give 16 columns.
> I've tryed it in access, but it gave me andless loops of repeated values.
> and creating new data files is dangerous, 'litle mistakes made while
> copying ' or manipultaions made to one file and not to the other'.
> .
>
> kind regards,
> Tom
>
>
>