[R] Lining up x-y datasets based on values of x
Marc Schwartz
marc_schwartz at comcast.net
Thu Feb 1 23:00:11 CET 2007
Christos,
Hmmmm....according to the Value section in ?merge:
A data frame. The rows are by default lexicographically sorted on the
common columns, but for sort=FALSE are in an unspecified order.
Looking at the code, while there is a lot of time spent on matching
things, the key sort() code seems to be near the end of the function:
if (sort)
res <- res[if (all.x || all.y)
do.call("order", x[, 1:l.b, drop = FALSE])
else sort.list(bx[m$xi]), , drop = FALSE]
I wonder if you could create a local version of merge(), say my.merge(),
without that code and without breaking things. A quick glance suggests
that as long as you are not merging on the rownames, I think that you
might be OK. You would want to test that hypothesis however.
HTH,
Marc
On Thu, 2007-02-01 at 16:48 -0500, Christos Hatzis wrote:
> [Sorry I meant to reply to the list]
>
> Thanks, Marc.
>
> That's what I have done.
> However, there seems to be a penalty from using merge repeatedly as it
> appears to internally re-sort the datasets. In my case the datasets are
> long (~35K rows) and already sorted so this step adds considerable and
> unnecessary overhead. There doesn't seem to be an option for disabling
> sorting. Setting 'sort=F' only affects sorting of the final data.frame.
>
> > system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]],
> > by="V1", all=T, sort=T))
> [1] 6.96 0.00 7.24 NA NA
> > system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]],
> > by="V1", all=T, sort=F))
> [1] 6.82 0.00 7.14 NA NA
> >
>
> I was wondering if perhaps there is a parallel between this problem and
> methods for linining up time-series data, since such data are also usually
> sorted on the time dimension.
>
> -Christos
>
> -----Original Message-----
> From: Marc Schwartz [mailto:marc_schwartz at comcast.net]
> Sent: Thursday, February 01, 2007 4:21 PM
> To: christos at nuverabio.com
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] Lining up x-y datasets based on values of x
>
> On Thu, 2007-02-01 at 15:45 -0500, Christos Hatzis wrote:
> > Thanks Marc and Phil.
> >
> > My dataset actually consists of 50+ individual files, so I will have
> > to do this one column at a time in a loop...
> > I might look into SQL and outer joints as an alternative to avoid looping.
> >
> > Thanks again.
> > -Christos
>
> If the files conform to some naming convention and/or are all located in a
> common sub-directory, you can use list.files() to get the file names into a
> vector. If not, you could use file.choose() interactively.
>
> Then use either a for() loop or sapply() to loop over the filenames, read
> them in to data frames using read.table() and merge them together in the
> same loop.
>
> When it comes to basic data manipulation like this, loops are not a bad
> thing. The overhead of a loop is typically outweighed by the file I/O and
> related considerations.
>
> HTH,
>
> Marc
More information about the R-help
mailing list