[R] Lining up x-y datasets based on values of x
Marc Schwartz
marc_schwartz at comcast.net
Fri Feb 2 06:35:29 CET 2007
Christos,
At least on my system, this does not appear to increase timing:
DF.X <- data.frame(X = 35000:1, Y = runif(35000))
DF.Y <- data.frame(X = 35000:1, Y = runif(35000))
> system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE))
[1] 0.238 0.012 0.256 0.000 0.000
compared to:
DF.list <- list(DF.X, DF.Y)
> str(DF.list)
List of 2
$ :'data.frame': 35000 obs. of 2 variables:
..$ X: int [1:35000] 35000 34999 34998 34997 34996 34995 34994 34993 34992 34991 ...
..$ Y: num [1:35000] 0.720 0.855 0.216 0.817 0.534 ...
$ :'data.frame': 35000 obs. of 2 variables:
..$ X: int [1:35000] 35000 34999 34998 34997 34996 34995 34994 34993 34992 34991 ...
..$ Y: num [1:35000] 0.68090 0.00694 0.64235 0.15728 0.27436 ...
> system.time(DF.XY.L <- merge(DF.list[[1]], DF.list[[2]], by = "X", all = TRUE))
[1] 0.251 0.005 0.262 0.000 0.000
So I am still confuzzled as to why it is taking 13 seconds on your
system. I am missing something here.
However, I did note that using merge.zoo() appears to be helpful.
Regards,
Marc
On Thu, 2007-02-01 at 23:36 -0500, Christos Hatzis wrote:
> Marc,
>
> The data structure is a list of data frames generated from read.table:
>
> > class(nmr.spectra.serum)
> [1] "list"
> > class(nmr.spectra.serum[[1]])
> [1] "data.frame"
> > dim(nmr.spectra.serum[[1]])
> [1] 32768 2
>
> Converting the data.frames to matrices does not have much of an effect on
> timing.
>
> -Christos
>
> -----Original Message-----
> From: Marc Schwartz [mailto:marc_schwartz at comcast.net]
> Sent: Thursday, February 01, 2007 11:06 PM
> To: christos at nuverabio.com
> Cc: 'Prof Brian Ripley'; r-help at stat.math.ethz.ch
> Subject: Re: [R] Lining up x-y datasets based on values of x
>
> On Thu, 2007-02-01 at 22:46 -0500, Christos Hatzis wrote:
> > Marc,
> >
> > I don't think the issue is duplicates in the matching columns. The
> > data were generated by an instrument (NMR spectrometer), processed by
> > the instrument's software through an FFT transform and other
> > transformations and finally reported as a sequence of chemical shift (x)
> vs intensity (y) pairs.
> > So all x values are unique. For the example that I reported earlier:
> >
> > > length(nmr.spectra.serum[[1]]$V1)
> > [1] 32768
> > > length(unique(nmr.spectra.serum[[1]]$V1))
> > [1] 32768
> > > length(nmr.spectra.serum[[2]]$V1)
> > [1] 32768
> > > length(unique(nmr.spectra.serum[[2]]$V1))
> > [1] 32768
> >
> > And most of the x-values are common
> > > sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1)
> > [1] 32625
> >
> > For this reason, merge is probably an overkill for this problem and my
> > initial thought was to align the datasets through some simple
> > index-shifting operation.
> >
> > Profiling of the merge code in my case shows that most of the time is
> > spent on data frame subsetting operations and on internal merge and
> > rbind calls secondarily (if I read the summary output correctly). So
> > even if most of the time in the internal merge function is spent on
> > sorting (haven't checked the source code), this is in the worst case a
> > rather minor effect, as suggested by Prof. Ripley.
> >
> > > Rprof("merge.out")
> > > zz <- merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by="V1",
> > all=T, sort=T)
> > > Rprof(NULL)
> > > summaryRprof("merge.out")
> >
<snip>
More information about the R-help
mailing list