[R] Lining up x-y datasets based on values of x

Fri Feb 2 06:35:29 CET 2007

Christos,

At least on my system, this does not appear to increase timing:

DF.X <- data.frame(X = 35000:1, Y = runif(35000))
DF.Y <- data.frame(X = 35000:1, Y = runif(35000))

> system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE))
[1] 0.238 0.012 0.256 0.000 0.000

compared to:

DF.list <- list(DF.X, DF.Y)

> str(DF.list)
List of 2
 $ :'data.frame':	35000 obs. of  2 variables:
  ..$ X: int [1:35000] 35000 34999 34998 34997 34996 34995 34994 34993 34992 34991 ...
  ..$ Y: num [1:35000] 0.720 0.855 0.216 0.817 0.534 ...
 $ :'data.frame':	35000 obs. of  2 variables:
  ..$ X: int [1:35000] 35000 34999 34998 34997 34996 34995 34994 34993 34992 34991 ...
  ..$ Y: num [1:35000] 0.68090 0.00694 0.64235 0.15728 0.27436 ...

> system.time(DF.XY.L <- merge(DF.list[[1]], DF.list[[2]], by = "X", all = TRUE))
[1] 0.251 0.005 0.262 0.000 0.000

So I am still confuzzled as to why it is taking 13 seconds on your
system.  I am missing something here.

However, I did note that using merge.zoo() appears to be helpful.

Regards,

Marc

On Thu, 2007-02-01 at 23:36 -0500, Christos Hatzis wrote:
> Marc,
> 
> The data structure is a list of data frames generated from read.table:
> 
> > class(nmr.spectra.serum)
> [1] "list"
> > class(nmr.spectra.serum[[1]])
> [1] "data.frame" 
> > dim(nmr.spectra.serum[[1]])
> [1] 32768     2
> 
> Converting the data.frames to matrices does not have much of an effect on
> timing.
> 
> -Christos
> 
> -----Original Message-----
> From: Marc Schwartz [mailto:marc_schwartz at comcast.net] 
> Sent: Thursday, February 01, 2007 11:06 PM
> To: christos at nuverabio.com
> Cc: 'Prof Brian Ripley'; r-help at stat.math.ethz.ch
> Subject: Re: [R] Lining up x-y datasets based on values of x
> 
> On Thu, 2007-02-01 at 22:46 -0500, Christos Hatzis wrote:
> > Marc,
> > 
> > I don't think the issue is duplicates in the matching columns.  The 
> > data were generated by an instrument (NMR spectrometer), processed by 
> > the instrument's software through an FFT transform and other 
> > transformations and finally reported as a sequence of chemical shift (x)
> vs intensity (y) pairs.
> > So all x values are unique.  For the example that I reported earlier:
> > 
> > > length(nmr.spectra.serum[[1]]$V1)
> > [1] 32768
> > > length(unique(nmr.spectra.serum[[1]]$V1))
> > [1] 32768
> > > length(nmr.spectra.serum[[2]]$V1)
> > [1] 32768
> > > length(unique(nmr.spectra.serum[[2]]$V1))
> > [1] 32768
> > 
> > And most of the x-values are common
> > > sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1)
> > [1] 32625
> > 
> > For this reason, merge is probably an overkill for this problem and my 
> > initial thought was to align the datasets through some simple 
> > index-shifting operation.
> > 
> > Profiling of the merge code in my case shows that most of the time is 
> > spent on data frame subsetting operations and on internal merge and 
> > rbind calls secondarily (if I read the summary output correctly).  So 
> > even if most of the time in the internal merge function is spent on 
> > sorting (haven't checked the source code), this is in the worst case a 
> > rather minor effect, as suggested by Prof. Ripley.
> >   
> > > Rprof("merge.out")
> > > zz <- merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by="V1",
> > all=T, sort=T)
> > > Rprof(NULL)
> > > summaryRprof("merge.out")
> > 

<snip>