[R] Lining up x-y datasets based on values of x

Marc Schwartz marc_schwartz at comcast.net
Thu Feb 1 23:00:11 CET 2007


Christos,

Hmmmm....according to the Value section in ?merge:

A data frame. The rows are by default lexicographically sorted on the
common columns, but for sort=FALSE are in an unspecified order.


Looking at the code, while there is a lot of time spent on matching
things, the key sort() code seems to be near the end of the function:

          if (sort) 
            res <- res[if (all.x || all.y) 
                do.call("order", x[, 1:l.b, drop = FALSE])
            else sort.list(bx[m$xi]), , drop = FALSE]

I wonder if you could create a local version of merge(), say my.merge(),
without that code and without breaking things. A quick glance suggests
that as long as you are not merging on the rownames, I think that you
might be OK. You would want to test that hypothesis however.

HTH,

Marc

On Thu, 2007-02-01 at 16:48 -0500, Christos Hatzis wrote:
> [Sorry I meant to reply to the list]
> 
> Thanks, Marc.
> 
> That's what I have done.
> However, there seems to be a penalty from using merge repeatedly as it
> appears to internally re-sort the datasets.  In my case the datasets are
> long (~35K rows) and already sorted so this step adds considerable and
> unnecessary overhead.  There doesn't seem to be an option for disabling
> sorting. Setting 'sort=F' only affects sorting of the final data.frame.
> 
> > system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], 
> > by="V1", all=T, sort=T))
> [1] 6.96 0.00 7.24   NA   NA
> > system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], 
> > by="V1", all=T, sort=F))
> [1] 6.82 0.00 7.14   NA   NA
> > 
> 
> I was wondering if perhaps there is a parallel between this problem and
> methods for linining up time-series data, since such data are also usually
> sorted on the time dimension. 
> 
> -Christos  
> 
> -----Original Message-----
> From: Marc Schwartz [mailto:marc_schwartz at comcast.net] 
> Sent: Thursday, February 01, 2007 4:21 PM
> To: christos at nuverabio.com
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] Lining up x-y datasets based on values of x
> 
> On Thu, 2007-02-01 at 15:45 -0500, Christos Hatzis wrote:
> > Thanks Marc and Phil.
> > 
> > My dataset actually consists of 50+ individual files, so I will have 
> > to do this one column at a time in a loop...
> > I might look into SQL and outer joints as an alternative to avoid looping.
> > 
> > Thanks again.
> > -Christos
> 
> If the files conform to some naming convention and/or are all located in a
> common sub-directory, you can use list.files() to get the file names into a
> vector.  If not, you could use file.choose() interactively.
> 
> Then use either a for() loop or sapply() to loop over the filenames, read
> them in to data frames using read.table() and merge them together in the
> same loop.
> 
> When it comes to basic data manipulation like this, loops are not a bad
> thing. The overhead of a loop is typically outweighed by the file I/O and
> related considerations.
> 
> HTH,
> 
> Marc



More information about the R-help mailing list