[R] Lining up x-y datasets based on values of x
Marc Schwartz
marc_schwartz at comcast.net
Fri Feb 2 05:06:07 CET 2007
On Thu, 2007-02-01 at 22:46 -0500, Christos Hatzis wrote:
> Marc,
>
> I don't think the issue is duplicates in the matching columns. The data
> were generated by an instrument (NMR spectrometer), processed by the
> instrument's software through an FFT transform and other transformations and
> finally reported as a sequence of chemical shift (x) vs intensity (y) pairs.
> So all x values are unique. For the example that I reported earlier:
>
> > length(nmr.spectra.serum[[1]]$V1)
> [1] 32768
> > length(unique(nmr.spectra.serum[[1]]$V1))
> [1] 32768
> > length(nmr.spectra.serum[[2]]$V1)
> [1] 32768
> > length(unique(nmr.spectra.serum[[2]]$V1))
> [1] 32768
>
> And most of the x-values are common
> > sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1)
> [1] 32625
>
> For this reason, merge is probably an overkill for this problem and my
> initial thought was to align the datasets through some simple index-shifting
> operation.
>
> Profiling of the merge code in my case shows that most of the time is spent
> on data frame subsetting operations and on internal merge and rbind calls
> secondarily (if I read the summary output correctly). So even if most of
> the time in the internal merge function is spent on sorting (haven't checked
> the source code), this is in the worst case a rather minor effect, as
> suggested by Prof. Ripley.
>
> > Rprof("merge.out")
> > zz <- merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by="V1",
> all=T, sort=T)
> > Rprof(NULL)
> > summaryRprof("merge.out")
>
> $by.self
> self.time self.pct total.time total.pct
> merge.data.frame 6.56 50.0 11.84 90.2
> [.data.frame 2.42 18.4 3.68 28.0
> merge 1.28 9.8 13.12 100.0
> rbind 1.24 9.5 1.36 10.4
> names<-.default 1.16 8.8 1.16 8.8
> row.names<-.data.frame 0.12 0.9 0.18 1.4
> duplicated.default 0.12 0.9 0.12 0.9
> make.unique 0.10 0.8 0.10 0.8
> data.frame 0.02 0.2 0.04 0.3
> * 0.02 0.2 0.02 0.2
> is.na 0.02 0.2 0.02 0.2
> match 0.02 0.2 0.02 0.2
> order 0.02 0.2 0.02 0.2
> unclass 0.02 0.2 0.02 0.2
> [ 0.00 0.0 3.68 28.0
> do.call 0.00 0.0 1.18 9.0
> names<- 0.00 0.0 1.16 8.8
> row.names<- 0.00 0.0 0.18 1.4
> any 0.00 0.0 0.14 1.1
> duplicated 0.00 0.0 0.12 0.9
> cbind 0.00 0.0 0.04 0.3
> as.vector 0.00 0.0 0.02 0.2
> seq 0.00 0.0 0.02 0.2
> seq.default 0.00 0.0 0.02 0.2
>
> $by.total
> total.time total.pct self.time self.pct
> merge 13.12 100.0 1.28 9.8
> merge.data.frame 11.84 90.2 6.56 50.0
> [.data.frame 3.68 28.0 2.42 18.4
> [ 3.68 28.0 0.00 0.0
> rbind 1.36 10.4 1.24 9.5
> do.call 1.18 9.0 0.00 0.0
> names<-.default 1.16 8.8 1.16 8.8
> names<- 1.16 8.8 0.00 0.0
> row.names<-.data.frame 0.18 1.4 0.12 0.9
> row.names<- 0.18 1.4 0.00 0.0
> any 0.14 1.1 0.00 0.0
> duplicated.default 0.12 0.9 0.12 0.9
> duplicated 0.12 0.9 0.00 0.0
> make.unique 0.10 0.8 0.10 0.8
> data.frame 0.04 0.3 0.02 0.2
> cbind 0.04 0.3 0.00 0.0
> * 0.02 0.2 0.02 0.2
> is.na 0.02 0.2 0.02 0.2
> match 0.02 0.2 0.02 0.2
> order 0.02 0.2 0.02 0.2
> unclass 0.02 0.2 0.02 0.2
> as.vector 0.02 0.2 0.00 0.0
> seq 0.02 0.2 0.00 0.0
> seq.default 0.02 0.2 0.00 0.0
>
> $sampling.time
> [1] 13.12
>
>
> Thanks again for your time in looking into this.
> -Christos
Christos,
Thanks for the follow up. Thought I had something, but apparently not.
Question: What is the actual structure of the nmr.spectra.serum objects?
The indexing approach that you have suggests they are not simple two
column objects, which may be at least partially the source of the
[.data.frame overhead.
Thanks,
Marc
More information about the R-help
mailing list