[R] Q: combine 2 data frames with missing values
ggrothendieck at gmail.com
Mon Aug 20 14:51:05 CEST 2007
Lines <- "case var1 var2 var3 var4
1 9 9 13 11
2 15 9 15 13
3 na na 12 9
4 8 6 na na
5 14 10 na na
6 20 15 17 15
# replace with DF <- read.table("myfile.dat", header = TRUE, na.strings = "na")
DF <- read.table(textConnection(Lines), header = TRUE, na.strings = "na")
DF1 <- DF[-1]
kor <- cor(DF1, use = "pairwise")
lm(var1 ~ var2, DF) # a sample regression
# mycoef calculates kth coefficient in regression of
# ith variable on jth variable
mycoef <- function(i, j, k) coef(lm(DF1[c(i, j)]))[k]
idx <- 1:ncol(DF1)
names(idx) <- names(DF1)
intercepts <- outer(idx, idx, Vectorize(mycoef), 1)
names(dimnames(intercepts)) <- c("y", "x")
slopes <- outer(idx, idx, Vectorize(mycoef), 2)
names(dimnames(slopes)) <- c("y", "x")
# another approach to the above
# mycoef1 is like mycoef but has only one argument
# and outputs all coefs, not just a specified one
mycoef1 <- function(idx) coef(lm(DF1[idx]))
out <- t(apply(expand.grid(y = idx, x = idx), 1, mycoef1))
colnames(out) <- c("y", "x", "intercept", "slope")
# To perform SQL operations on data frames
# see sqldf home page at http://sqldf.googlecode.com
# and also ?sqldf for many examples
sqldf("select avg(var1), avg(var2), avg(var3), avg(var4) from DF1")
colMeans(DF1, na.rm = TRUE) # same
On 8/20/07, Tom Willems <Tom.Willems at var.fgov.be> wrote:
> hello R ussers,
> i have the same problem with my data,
> for aal the different variables, i have the same number of cases, but the
> are often out of detectionlimits so they produce "na's" .
> so the data looks like this:
> case var1 var2 var3 var4 ...
> 1 9 9 13 11
> 2 15 9 15 13
> 3 na na 12 9
> 4 8 6 na na
> 5 14 10 na na
> 6 20 15 17 15 ..
> What i would like to do for data exploration, is to compare each possible
> pair of variables, get their correlation coefficient, the intercept and
> the slope of regression line. yet for every variable the messurements are
> lnked thruogh theyr case. it is the same sample just a diferent test.
> Now i select a subsets of variables out of the original dataset, and use
> value_x1 = subset(dataset_1,select=lg_value)
> value_y1 =subset(dataset_2,select=lg_value)
> Then i to mold an lm model, inorder to get estimates for the slope ans
> model_1 <- lm (value_y1[,1]~ value_x1[,1] )
> This is what R tell's me:
> "Error in model.frame(formula, rownames,
> variables, varnames, extras, extranames, :
> variable lengths differ (found for
> 'value_x1[, 1]')"
> Is there perhapes a way of binding the selected subsets together, still
> linked to their case, so that the na's can be discarded by R automaticaly?
> I have been trying to use SQLiteDF and the other sql func's of R, but i
> don't realy understand them.
> If someone out there knows how to use sql, in R, i d be delited if he or
> she could explain it to me, more understandible then the manuals i find on
> the web.
> Here is what io would want sql to do .
> My data is in columns, one column holds all the case numbers, one the
> messured values, one all the testtypes and one the timeperiod and then one
> column for the lab's that preformed the test. is is stored in a txt file.
> So it is a long 5 column data table.
> Now is it possible to make a cross table holding the case nr's, and
> timeperiod in 2 column's, and then have a different column for every test?
> so if there are 4 tests and 4 lab's, it would give 16 columns.
> I've tryed it in access, but it gave me andless loops of repeated values.
> and creating new data files is dangerous, 'litle mistakes made while
> copying ' or manipultaions made to one file and not to the other'.
> kind regards,
> Disclaimer: click here
> [[alternative HTML version deleted]]
> R-help at stat.math.ethz.ch mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help