[R] Lining up x-y datasets based on values of x

Fri Feb 2 05:06:07 CET 2007

On Thu, 2007-02-01 at 22:46 -0500, Christos Hatzis wrote:
> Marc,
> 
> I don't think the issue is duplicates in the matching columns.  The data
> were generated by an instrument (NMR spectrometer), processed by the
> instrument's software through an FFT transform and other transformations and
> finally reported as a sequence of chemical shift (x) vs intensity (y) pairs.
> So all x values are unique.  For the example that I reported earlier:
> 
> > length(nmr.spectra.serum[[1]]$V1)
> [1] 32768
> > length(unique(nmr.spectra.serum[[1]]$V1))
> [1] 32768
> > length(nmr.spectra.serum[[2]]$V1)
> [1] 32768
> > length(unique(nmr.spectra.serum[[2]]$V1))
> [1] 32768
> 
> And most of the x-values are common
> > sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1)
> [1] 32625
> 
> For this reason, merge is probably an overkill for this problem and my
> initial thought was to align the datasets through some simple index-shifting
> operation. 
> 
> Profiling of the merge code in my case shows that most of the time is spent
> on data frame subsetting operations and on internal merge and rbind calls
> secondarily (if I read the summary output correctly).  So even if most of
> the time in the internal merge function is spent on sorting (haven't checked
> the source code), this is in the worst case a rather minor effect, as
> suggested by Prof. Ripley.
>   
> > Rprof("merge.out")
> > zz <- merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by="V1",
> all=T, sort=T)
> > Rprof(NULL)
> > summaryRprof("merge.out")
> 
> $by.self
>                        self.time self.pct total.time total.pct
> merge.data.frame            6.56     50.0      11.84      90.2
> [.data.frame                2.42     18.4       3.68      28.0
> merge                       1.28      9.8      13.12     100.0
> rbind                       1.24      9.5       1.36      10.4
> names<-.default             1.16      8.8       1.16       8.8
> row.names<-.data.frame      0.12      0.9       0.18       1.4
> duplicated.default          0.12      0.9       0.12       0.9
> make.unique                 0.10      0.8       0.10       0.8
> data.frame                  0.02      0.2       0.04       0.3
> *                           0.02      0.2       0.02       0.2
> is.na                       0.02      0.2       0.02       0.2
> match                       0.02      0.2       0.02       0.2
> order                       0.02      0.2       0.02       0.2
> unclass                     0.02      0.2       0.02       0.2
> [                           0.00      0.0       3.68      28.0
> do.call                     0.00      0.0       1.18       9.0
> names<-                     0.00      0.0       1.16       8.8
> row.names<-                 0.00      0.0       0.18       1.4
> any                         0.00      0.0       0.14       1.1
> duplicated                  0.00      0.0       0.12       0.9
> cbind                       0.00      0.0       0.04       0.3
> as.vector                   0.00      0.0       0.02       0.2
> seq                         0.00      0.0       0.02       0.2
> seq.default                 0.00      0.0       0.02       0.2
> 
> $by.total
>                        total.time total.pct self.time self.pct
> merge                       13.12     100.0      1.28      9.8
> merge.data.frame            11.84      90.2      6.56     50.0
> [.data.frame                 3.68      28.0      2.42     18.4
> [                            3.68      28.0      0.00      0.0
> rbind                        1.36      10.4      1.24      9.5
> do.call                      1.18       9.0      0.00      0.0
> names<-.default              1.16       8.8      1.16      8.8
> names<-                      1.16       8.8      0.00      0.0
> row.names<-.data.frame       0.18       1.4      0.12      0.9
> row.names<-                  0.18       1.4      0.00      0.0
> any                          0.14       1.1      0.00      0.0
> duplicated.default           0.12       0.9      0.12      0.9
> duplicated                   0.12       0.9      0.00      0.0
> make.unique                  0.10       0.8      0.10      0.8
> data.frame                   0.04       0.3      0.02      0.2
> cbind                        0.04       0.3      0.00      0.0
> *                            0.02       0.2      0.02      0.2
> is.na                        0.02       0.2      0.02      0.2
> match                        0.02       0.2      0.02      0.2
> order                        0.02       0.2      0.02      0.2
> unclass                      0.02       0.2      0.02      0.2
> as.vector                    0.02       0.2      0.00      0.0
> seq                          0.02       0.2      0.00      0.0
> seq.default                  0.02       0.2      0.00      0.0
> 
> $sampling.time
> [1] 13.12
> 
> 
> Thanks again for your time in looking into this.
> -Christos

Christos,

Thanks for the follow up.  Thought I had something, but apparently not.

Question: What is the actual structure of the nmr.spectra.serum objects?
The indexing approach that you have suggests they are not simple two
column objects, which may be at least partially the source of the
[.data.frame overhead.

Thanks,

Marc