[R] Lining up x-y datasets based on values of x
Christos Hatzis
christos at nuverabio.com
Fri Feb 2 04:46:27 CET 2007
Marc,
I don't think the issue is duplicates in the matching columns. The data
were generated by an instrument (NMR spectrometer), processed by the
instrument's software through an FFT transform and other transformations and
finally reported as a sequence of chemical shift (x) vs intensity (y) pairs.
So all x values are unique. For the example that I reported earlier:
> length(nmr.spectra.serum[[1]]$V1)
[1] 32768
> length(unique(nmr.spectra.serum[[1]]$V1))
[1] 32768
> length(nmr.spectra.serum[[2]]$V1)
[1] 32768
> length(unique(nmr.spectra.serum[[2]]$V1))
[1] 32768
And most of the x-values are common
> sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1)
[1] 32625
For this reason, merge is probably an overkill for this problem and my
initial thought was to align the datasets through some simple index-shifting
operation.
Profiling of the merge code in my case shows that most of the time is spent
on data frame subsetting operations and on internal merge and rbind calls
secondarily (if I read the summary output correctly). So even if most of
the time in the internal merge function is spent on sorting (haven't checked
the source code), this is in the worst case a rather minor effect, as
suggested by Prof. Ripley.
> Rprof("merge.out")
> zz <- merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by="V1",
all=T, sort=T)
> Rprof(NULL)
> summaryRprof("merge.out")
$by.self
self.time self.pct total.time total.pct
merge.data.frame 6.56 50.0 11.84 90.2
[.data.frame 2.42 18.4 3.68 28.0
merge 1.28 9.8 13.12 100.0
rbind 1.24 9.5 1.36 10.4
names<-.default 1.16 8.8 1.16 8.8
row.names<-.data.frame 0.12 0.9 0.18 1.4
duplicated.default 0.12 0.9 0.12 0.9
make.unique 0.10 0.8 0.10 0.8
data.frame 0.02 0.2 0.04 0.3
* 0.02 0.2 0.02 0.2
is.na 0.02 0.2 0.02 0.2
match 0.02 0.2 0.02 0.2
order 0.02 0.2 0.02 0.2
unclass 0.02 0.2 0.02 0.2
[ 0.00 0.0 3.68 28.0
do.call 0.00 0.0 1.18 9.0
names<- 0.00 0.0 1.16 8.8
row.names<- 0.00 0.0 0.18 1.4
any 0.00 0.0 0.14 1.1
duplicated 0.00 0.0 0.12 0.9
cbind 0.00 0.0 0.04 0.3
as.vector 0.00 0.0 0.02 0.2
seq 0.00 0.0 0.02 0.2
seq.default 0.00 0.0 0.02 0.2
$by.total
total.time total.pct self.time self.pct
merge 13.12 100.0 1.28 9.8
merge.data.frame 11.84 90.2 6.56 50.0
[.data.frame 3.68 28.0 2.42 18.4
[ 3.68 28.0 0.00 0.0
rbind 1.36 10.4 1.24 9.5
do.call 1.18 9.0 0.00 0.0
names<-.default 1.16 8.8 1.16 8.8
names<- 1.16 8.8 0.00 0.0
row.names<-.data.frame 0.18 1.4 0.12 0.9
row.names<- 0.18 1.4 0.00 0.0
any 0.14 1.1 0.00 0.0
duplicated.default 0.12 0.9 0.12 0.9
duplicated 0.12 0.9 0.00 0.0
make.unique 0.10 0.8 0.10 0.8
data.frame 0.04 0.3 0.02 0.2
cbind 0.04 0.3 0.00 0.0
* 0.02 0.2 0.02 0.2
is.na 0.02 0.2 0.02 0.2
match 0.02 0.2 0.02 0.2
order 0.02 0.2 0.02 0.2
unclass 0.02 0.2 0.02 0.2
as.vector 0.02 0.2 0.00 0.0
seq 0.02 0.2 0.00 0.0
seq.default 0.02 0.2 0.00 0.0
$sampling.time
[1] 13.12
Thanks again for your time in looking into this.
-Christos
-----Original Message-----
From: Marc Schwartz [mailto:marc_schwartz at comcast.net]
Sent: Thursday, February 01, 2007 9:59 PM
To: Prof Brian
Ripley
Cc: r-help at stat.math.ethz.ch; christos at nuverabio.com
Subject: Re: [R] Lining up x-y datasets based on values of x
On Thu, 2007-02-01 at 23:34 +0000, Prof Brian Ripley wrote:
> On Thu, 1 Feb 2007, Marc Schwartz wrote:
>
> > Christos,
> >
> > Hmmmm....according to the Value section in ?merge:
> >
> > A data frame. The rows are by default lexicographically sorted on
> > the common columns, but for sort=FALSE are in an unspecified order.
>
> There is also a sort in the .Internal code. But I am not buying that
> this is a major part of the time without detailed evidence from
> profiling. Sorting 35k numbers should take a few milliseconds, and
> less if they are already sorted.
>
> > x <- rnorm(35000)
> > system.time(y <- sort(x, method="quick"))
> [1] 0.003 0.001 0.004 0.000 0.000
> > system.time(sort(y, method="quick"))
> [1] 0.002 0.000 0.001 0.000 0.000
Having had a chance to mock up some examples, I would have to agree with
Prof. Ripley on this point.
Presuming that we are not missing something about the nature of Christos'
data sets, here are 4 examples, with rows sorted in ascending order,
descending order, reversed sort order and random order. In theory, the
descending order example should, I believe, represent a worst cast scenario,
since reverse sorting a sorted list is typically slowest. However, note that
there is not much time variation below and running each of the examples
several times resulted in material differences across runs.
1. Ascending order
DF.X <- data.frame(X = 1:35000, Y = runif(35000)) DF.Y <- data.frame(X =
1:35000, Y = runif(35000))
> system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE))
[1] 0.249 0.004 0.264 0.000 0.000
2. Descending order
DF.X <- data.frame(X = 35000:1, Y = runif(35000)) DF.Y <- data.frame(X =
35000:1, Y = runif(35000))
> system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE))
[1] 0.300 0.007 0.309 0.000 0.000
3. Reversed sort order
DF.X <- data.frame(X = 35000:1, Y = runif(35000)) DF.Y <- data.frame(X =
1:35000, Y = runif(35000))
> system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE))
[1] 0.236 0.008 0.245 0.000 0.000
4. Random order
DF.X <- data.frame(X = sample(35000), Y = runif(35000)) DF.Y <- data.frame(X
= sample(35000), Y = runif(35000))
> system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE))
[1] 0.339 0.016 0.357 0.000 0.000
Spending some time looking at profiling the descending order example, we
get:
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
"duplicated.default" 0.16 38.1 0.16 38.1
"match" 0.08 19.0 0.08 19.0
"sort.list" 0.08 19.0 0.08 19.0
"[.data.frame" 0.04 9.5 0.24 57.1
"merge.data.frame" 0.02 4.8 0.42 100.0
"names.default" 0.02 4.8 0.02 4.8
"seq_len" 0.02 4.8 0.02 4.8
"merge" 0.00 0.0 0.42 100.0
"[" 0.00 0.0 0.24 57.1
"any" 0.00 0.0 0.18 42.9
"duplicated" 0.00 0.0 0.18 42.9
"cbind" 0.00 0.0 0.04 9.5
"data.frame" 0.00 0.0 0.04 9.5
"data.row.names" 0.00 0.0 0.02 4.8
"names" 0.00 0.0 0.02 4.8
"row.names<-" 0.00 0.0 0.02 4.8
"row.names<-.data.frame" 0.00 0.0 0.02 4.8
$by.total
total.time total.pct self.time self.pct
"merge.data.frame" 0.42 100.0 0.02 4.8
"merge" 0.42 100.0 0.00 0.0
"[.data.frame" 0.24 57.1 0.04 9.5
"[" 0.24 57.1 0.00 0.0
"any" 0.18 42.9 0.00 0.0
"duplicated" 0.18 42.9 0.00 0.0
"duplicated.default" 0.16 38.1 0.16 38.1
"match" 0.08 19.0 0.08 19.0
"sort.list" 0.08 19.0 0.08 19.0
"cbind" 0.04 9.5 0.00 0.0
"data.frame" 0.04 9.5 0.00 0.0
"names.default" 0.02 4.8 0.02 4.8
"seq_len" 0.02 4.8 0.02 4.8
"data.row.names" 0.02 4.8 0.00 0.0
"names" 0.02 4.8 0.00 0.0
"row.names<-" 0.02 4.8 0.00 0.0
"row.names<-.data.frame" 0.02 4.8 0.00 0.0
$sampling.time
[1] 0.42
The above suggests that a meaningful amount of time is spent in checking for
and dealing with duplicates in the common ('by') columns. To that
end:
DF.X <- data.frame(X = sample(10000, 35000, replace = TRUE), Y =
runif(35000)) DF.Y <- data.frame(X = sample(10000, 35000, replace = TRUE), Y
= runif(35000))
> system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE))
[1] 3.316 0.148 3.502 0.000 0.000
So, it would seem that introducing duplicate values in the same sized vector
space does indeed materially affect the time required:
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
"duplicated.default" 0.86 27.6 0.86 27.6
"make.unique" 0.76 24.4 0.76 24.4
"[.data.frame" 0.72 23.1 1.70 54.5
"data.frame" 0.18 5.8 0.38 12.2
"row.names<-.data.frame" 0.14 4.5 0.36 11.5
"names<-.default" 0.14 4.5 0.14 4.5
"merge.data.frame" 0.08 2.6 3.12 100.0
"order" 0.06 1.9 0.06 1.9
"rbind" 0.04 1.3 0.44 14.1
"[[<-.data.frame" 0.04 1.3 0.04 1.3
"match" 0.04 1.3 0.04 1.3
"unclass" 0.04 1.3 0.04 1.3
"unlist" 0.02 0.6 0.02 0.6
"merge" 0.00 0.0 3.12 100.0
"[" 0.00 0.0 1.70 54.5
"any" 0.00 0.0 0.86 27.6
"duplicated" 0.00 0.0 0.86 27.6
"cbind" 0.00 0.0 0.38 12.2
"row.names<-" 0.00 0.0 0.36 11.5
"do.call" 0.00 0.0 0.18 5.8
"names<-" 0.00 0.0 0.14 4.5
"data.row.names" 0.00 0.0 0.10 3.2
"[[<-" 0.00 0.0 0.04 1.3
$by.total
total.time total.pct self.time self.pct
"merge.data.frame" 3.12 100.0 0.08 2.6
"merge" 3.12 100.0 0.00 0.0
"[.data.frame" 1.70 54.5 0.72 23.1
"[" 1.70 54.5 0.00 0.0
"duplicated.default" 0.86 27.6 0.86 27.6
"any" 0.86 27.6 0.00 0.0
"duplicated" 0.86 27.6 0.00 0.0
"make.unique" 0.76 24.4 0.76 24.4
"rbind" 0.44 14.1 0.04 1.3
"data.frame" 0.38 12.2 0.18 5.8
"cbind" 0.38 12.2 0.00 0.0
"row.names<-.data.frame" 0.36 11.5 0.14 4.5
"row.names<-" 0.36 11.5 0.00 0.0
"do.call" 0.18 5.8 0.00 0.0
"names<-.default" 0.14 4.5 0.14 4.5
"names<-" 0.14 4.5 0.00 0.0
"data.row.names" 0.10 3.2 0.00 0.0
"order" 0.06 1.9 0.06 1.9
"[[<-.data.frame" 0.04 1.3 0.04 1.3
"match" 0.04 1.3 0.04 1.3
"unclass" 0.04 1.3 0.04 1.3
"[[<-" 0.04 1.3 0.00 0.0
"unlist" 0.02 0.6 0.02 0.6
$sampling.time
[1] 3.12
I would also point out the following:
> str(DF.XY)
'data.frame': 124704 obs. of 3 variables:
$ X : int 1 1 1 1 1 1 1 1 1 1 ...
$ Y.x: num 0.7233 0.7233 0.7233 0.0577 0.0577 ...
$ Y.y: num 0.805 0.742 0.324 0.805 0.742 ...
Note that the resultant data frame is NOT 35,000 rows as a consequence of
the multiple matches. Presumably, this gets worse with each subsequent
merge back to the prior result. For example:
> system.time(DF.XY2 <- merge(DF.XY, DF.Y, by = "X", all = TRUE))
[1] 12.270 1.173 13.624 0.000 0.000
> str(DF.XY2)
'data.frame': 557116 obs. of 4 variables:
$ X : int 1 1 1 1 1 1 1 1 1 1 ...
$ Y.x: num 0.00213 0.00213 0.00213 0.85017 0.85017 ...
$ Y.y: num 0.324 0.324 0.324 0.324 0.324 ...
$ Y : num 0.742 0.324 0.805 0.742 0.324 ...
I may be extrapolating beyond known data here, but Christos, the above
suggests that your data sets have some proportion of duplicate values in the
columns that you are using for matching.
HTH,
Marc Schwartz
More information about the R-help
mailing list