[R] Problems with merge

Don MacQueen macq at llnl.gov
Thu Oct 7 16:46:29 CEST 2004


At 10:44 AM +0530 10/6/04, Vikas Rawal wrote:
>This issue has been discussed on this list before but the solutions 
>offerred are not satisfactory. So I thought I shall raise it again.
>
>I want to merge two datasets which have three common variables. 
>These variables DO NOT have the same names in both the files. In 
>addition, there are two variables with same name which do not 
>necessarily have exactly same data. That is, there could be some 
>discrepancy between the two datasets when it comes to these 
>variables. I do not want them to be used when I merge the datasets.
>
>The problem is that R allows you to use by.x and by.y variables to 
>specify only one variable in x dataset and one variable in y dataset 
>to merge. Otherwise, if you do not specify anything, it matches all 
>the variables that have common names to merge. This is very 
>problemmatic. In my case, the variables I want to use to match do 
>not have same names in two datasets and the ones that have same 
>names must not be used to match.
>
>One approach will be to change names of variables and then merge. 
>But that is not elegant, to say the least.
>
>If nothing else works, that is what I shall have to do. There again 
>we have some problem. How do I change the name of a particular 
>column. One solution suggested somewhere in the archives of the list 
>is to use
>
>names(data.frame)=c(list of column names)
>
>But this requires you to list all the variable names. That can 
>obviously be cumbersome when you have large number of variables. 
>What would be the syntax if I want to change just one column name.

It's not that hard to figure out the syntax, using functions like 
match(), intersect(), setdiff() and friends. Here is a suggestion:

mydf <- rename(mydf,from='oldvarname',to='newvarname')

where the rename function is this:

  rename <- function (data, from = "", to = "", info = T)
{
     dsn <- deparse(substitute(data))
     dfn <- names(data)
     if (length(from) != length(to)) {
         cat("--------- from and to not same length ---------\n")
         stop()
     }
     if (length(dfn) < length(to)) {
         cat("--------- too many new names ---------\n")
         stop()
     }
     chng <- match(from, dfn)
     frm.in <- from %in% dfn
     if (!all(frm.in)) {
         cat("---------- some of the from names not found in",
             dsn, "\n")
         stop()
     }
     if (length(to) != length(unique(to))) {
         cat("---------- New names not unique\n")
         stop()
     }
     dfn.new <- dfn
     dfn.new[chng] <- to
     if (info)
         cat("\nChanging in", dsn)
     tmp <- rbind(from, to)
     dimnames(tmp)[[1]] <- c("From:", "To:")
     dimnames(tmp)[[2]] <- rep("", length(from))
     if (info)
         print(tmp, quote = F)
     names(data) <- dfn.new
     invisible(data)
}

'from' and 'to' can be character vectors, and they must be of the same length.

It wouldn't be hard to modify it to *not* receive and return the 
entire dataframe, but I found it more convenient to use this way.

Also, I wrote that function a long time ago, when I had a lot less 
experience than I do now (just in case anyone notices some obvious 
room for improvement!)

>
>Vikas
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


-- 
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA




More information about the R-help mailing list