[R] Lining up x-y datasets based on values of x

Prof Brian Ripley ripley at stats.ox.ac.uk
Fri Feb 2 00:34:44 CET 2007


On Thu, 1 Feb 2007, Marc Schwartz wrote:

> Christos,
>
> Hmmmm....according to the Value section in ?merge:
>
> A data frame. The rows are by default lexicographically sorted on the
> common columns, but for sort=FALSE are in an unspecified order.

There is also a sort in the .Internal code.  But I am not buying 
that this is a major part of the time without detailed evidence from 
profiling.  Sorting 35k numbers should take a few milliseconds, and 
less if they are already sorted.

> x <- rnorm(35000)
> system.time(y <- sort(x, method="quick"))
[1] 0.003 0.001 0.004 0.000 0.000
> system.time(sort(y, method="quick"))
[1] 0.002 0.000 0.001 0.000 0.000



> Looking at the code, while there is a lot of time spent on matching
> things, the key sort() code seems to be near the end of the function:
>
>          if (sort)
>            res <- res[if (all.x || all.y)
>                do.call("order", x[, 1:l.b, drop = FALSE])
>            else sort.list(bx[m$xi]), , drop = FALSE]
>
> I wonder if you could create a local version of merge(), say my.merge(),
> without that code and without breaking things. A quick glance suggests
> that as long as you are not merging on the rownames, I think that you
> might be OK. You would want to test that hypothesis however.
>
> HTH,
>
> Marc
>
> On Thu, 2007-02-01 at 16:48 -0500, Christos Hatzis wrote:
>> [Sorry I meant to reply to the list]
>>
>> Thanks, Marc.
>>
>> That's what I have done.
>> However, there seems to be a penalty from using merge repeatedly as it
>> appears to internally re-sort the datasets.  In my case the datasets are
>> long (~35K rows) and already sorted so this step adds considerable and
>> unnecessary overhead.  There doesn't seem to be an option for disabling
>> sorting. Setting 'sort=F' only affects sorting of the final data.frame.
>>
>>> system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]],
>>> by="V1", all=T, sort=T))
>> [1] 6.96 0.00 7.24   NA   NA
>>> system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]],
>>> by="V1", all=T, sort=F))
>> [1] 6.82 0.00 7.14   NA   NA
>>>
>>
>> I was wondering if perhaps there is a parallel between this problem and
>> methods for linining up time-series data, since such data are also usually
>> sorted on the time dimension.
>>
>> -Christos
>>
>> -----Original Message-----
>> From: Marc Schwartz [mailto:marc_schwartz at comcast.net]
>> Sent: Thursday, February 01, 2007 4:21 PM
>> To: christos at nuverabio.com
>> Cc: r-help at stat.math.ethz.ch
>> Subject: Re: [R] Lining up x-y datasets based on values of x
>>
>> On Thu, 2007-02-01 at 15:45 -0500, Christos Hatzis wrote:
>>> Thanks Marc and Phil.
>>>
>>> My dataset actually consists of 50+ individual files, so I will have
>>> to do this one column at a time in a loop...
>>> I might look into SQL and outer joints as an alternative to avoid looping.
>>>
>>> Thanks again.
>>> -Christos
>>
>> If the files conform to some naming convention and/or are all located in a
>> common sub-directory, you can use list.files() to get the file names into a
>> vector.  If not, you could use file.choose() interactively.
>>
>> Then use either a for() loop or sapply() to loop over the filenames, read
>> them in to data frames using read.table() and merge them together in the
>> same loop.
>>
>> When it comes to basic data manipulation like this, loops are not a bad
>> thing. The overhead of a loop is typically outweighed by the file I/O and
>> related considerations.
>>
>> HTH,
>>
>> Marc
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list