[R] Lining up x-y datasets based on values of x

Fri Feb 2 05:36:47 CET 2007

Marc,

The data structure is a list of data frames generated from read.table:

> class(nmr.spectra.serum)
[1] "list"
> class(nmr.spectra.serum[[1]])
[1] "data.frame" 
> dim(nmr.spectra.serum[[1]])
[1] 32768     2

Converting the data.frames to matrices does not have much of an effect on
timing.

-Christos

-----Original Message-----
From: Marc Schwartz [mailto:marc_schwartz at comcast.net] 
Sent: Thursday, February 01, 2007 11:06 PM
To: christos at nuverabio.com
Cc: 'Prof Brian Ripley'; r-help at stat.math.ethz.ch
Subject: Re: [R] Lining up x-y datasets based on values of x

On Thu, 2007-02-01 at 22:46 -0500, Christos Hatzis wrote:
> Marc,
> 
> I don't think the issue is duplicates in the matching columns.  The 
> data were generated by an instrument (NMR spectrometer), processed by 
> the instrument's software through an FFT transform and other 
> transformations and finally reported as a sequence of chemical shift (x)
vs intensity (y) pairs.
> So all x values are unique.  For the example that I reported earlier:
> 
> > length(nmr.spectra.serum[[1]]$V1)
> [1] 32768
> > length(unique(nmr.spectra.serum[[1]]$V1))
> [1] 32768
> > length(nmr.spectra.serum[[2]]$V1)
> [1] 32768
> > length(unique(nmr.spectra.serum[[2]]$V1))
> [1] 32768
> 
> And most of the x-values are common
> > sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1)
> [1] 32625
> 
> For this reason, merge is probably an overkill for this problem and my 
> initial thought was to align the datasets through some simple 
> index-shifting operation.
> 
> Profiling of the merge code in my case shows that most of the time is 
> spent on data frame subsetting operations and on internal merge and 
> rbind calls secondarily (if I read the summary output correctly).  So 
> even if most of the time in the internal merge function is spent on 
> sorting (haven't checked the source code), this is in the worst case a 
> rather minor effect, as suggested by Prof. Ripley.
>   
> > Rprof("merge.out")
> > zz <- merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by="V1",
> all=T, sort=T)
> > Rprof(NULL)
> > summaryRprof("merge.out")
> 
> $by.self
>                        self.time self.pct total.time total.pct
> merge.data.frame            6.56     50.0      11.84      90.2
> [.data.frame                2.42     18.4       3.68      28.0
> merge                       1.28      9.8      13.12     100.0
> rbind                       1.24      9.5       1.36      10.4
> names<-.default             1.16      8.8       1.16       8.8
> row.names<-.data.frame      0.12      0.9       0.18       1.4
> duplicated.default          0.12      0.9       0.12       0.9
> make.unique                 0.10      0.8       0.10       0.8
> data.frame                  0.02      0.2       0.04       0.3
> *                           0.02      0.2       0.02       0.2
> is.na                       0.02      0.2       0.02       0.2
> match                       0.02      0.2       0.02       0.2
> order                       0.02      0.2       0.02       0.2
> unclass                     0.02      0.2       0.02       0.2
> [                           0.00      0.0       3.68      28.0
> do.call                     0.00      0.0       1.18       9.0
> names<-                     0.00      0.0       1.16       8.8
> row.names<-                 0.00      0.0       0.18       1.4
> any                         0.00      0.0       0.14       1.1
> duplicated                  0.00      0.0       0.12       0.9
> cbind                       0.00      0.0       0.04       0.3
> as.vector                   0.00      0.0       0.02       0.2
> seq                         0.00      0.0       0.02       0.2
> seq.default                 0.00      0.0       0.02       0.2
> 
> $by.total
>                        total.time total.pct self.time self.pct
> merge                       13.12     100.0      1.28      9.8
> merge.data.frame            11.84      90.2      6.56     50.0
> [.data.frame                 3.68      28.0      2.42     18.4
> [                            3.68      28.0      0.00      0.0
> rbind                        1.36      10.4      1.24      9.5
> do.call                      1.18       9.0      0.00      0.0
> names<-.default              1.16       8.8      1.16      8.8
> names<-                      1.16       8.8      0.00      0.0
> row.names<-.data.frame       0.18       1.4      0.12      0.9
> row.names<-                  0.18       1.4      0.00      0.0
> any                          0.14       1.1      0.00      0.0
> duplicated.default           0.12       0.9      0.12      0.9
> duplicated                   0.12       0.9      0.00      0.0
> make.unique                  0.10       0.8      0.10      0.8
> data.frame                   0.04       0.3      0.02      0.2
> cbind                        0.04       0.3      0.00      0.0
> *                            0.02       0.2      0.02      0.2
> is.na                        0.02       0.2      0.02      0.2
> match                        0.02       0.2      0.02      0.2
> order                        0.02       0.2      0.02      0.2
> unclass                      0.02       0.2      0.02      0.2
> as.vector                    0.02       0.2      0.00      0.0
> seq                          0.02       0.2      0.00      0.0
> seq.default                  0.02       0.2      0.00      0.0
> 
> $sampling.time
> [1] 13.12
> 
> 
> Thanks again for your time in looking into this.
> -Christos

Christos,

Thanks for the follow up.  Thought I had something, but apparently not.

Question: What is the actual structure of the nmr.spectra.serum objects?
The indexing approach that you have suggests they are not simple two column
objects, which may be at least partially the source of the [.data.frame
overhead.

Thanks,

Marc