[R] Lining up x-y datasets based on values of x
Marc Schwartz
marc_schwartz at comcast.net
Fri Feb 2 03:58:48 CET 2007
On Thu, 2007-02-01 at 23:34 +0000, Prof Brian Ripley wrote:
> On Thu, 1 Feb 2007, Marc Schwartz wrote:
>
> > Christos,
> >
> > Hmmmm....according to the Value section in ?merge:
> >
> > A data frame. The rows are by default lexicographically sorted on the
> > common columns, but for sort=FALSE are in an unspecified order.
>
> There is also a sort in the .Internal code. But I am not buying
> that this is a major part of the time without detailed evidence from
> profiling. Sorting 35k numbers should take a few milliseconds, and
> less if they are already sorted.
>
> > x <- rnorm(35000)
> > system.time(y <- sort(x, method="quick"))
> [1] 0.003 0.001 0.004 0.000 0.000
> > system.time(sort(y, method="quick"))
> [1] 0.002 0.000 0.001 0.000 0.000
Having had a chance to mock up some examples, I would have to agree with
Prof. Ripley on this point.
Presuming that we are not missing something about the nature of
Christos' data sets, here are 4 examples, with rows sorted in ascending
order, descending order, reversed sort order and random order. In
theory, the descending order example should, I believe, represent a
worst cast scenario, since reverse sorting a sorted list is typically
slowest. However, note that there is not much time variation below and
running each of the examples several times resulted in material
differences across runs.
1. Ascending order
DF.X <- data.frame(X = 1:35000, Y = runif(35000))
DF.Y <- data.frame(X = 1:35000, Y = runif(35000))
> system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE))
[1] 0.249 0.004 0.264 0.000 0.000
2. Descending order
DF.X <- data.frame(X = 35000:1, Y = runif(35000))
DF.Y <- data.frame(X = 35000:1, Y = runif(35000))
> system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE))
[1] 0.300 0.007 0.309 0.000 0.000
3. Reversed sort order
DF.X <- data.frame(X = 35000:1, Y = runif(35000))
DF.Y <- data.frame(X = 1:35000, Y = runif(35000))
> system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE))
[1] 0.236 0.008 0.245 0.000 0.000
4. Random order
DF.X <- data.frame(X = sample(35000), Y = runif(35000))
DF.Y <- data.frame(X = sample(35000), Y = runif(35000))
> system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE))
[1] 0.339 0.016 0.357 0.000 0.000
Spending some time looking at profiling the descending order example, we
get:
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
"duplicated.default" 0.16 38.1 0.16 38.1
"match" 0.08 19.0 0.08 19.0
"sort.list" 0.08 19.0 0.08 19.0
"[.data.frame" 0.04 9.5 0.24 57.1
"merge.data.frame" 0.02 4.8 0.42 100.0
"names.default" 0.02 4.8 0.02 4.8
"seq_len" 0.02 4.8 0.02 4.8
"merge" 0.00 0.0 0.42 100.0
"[" 0.00 0.0 0.24 57.1
"any" 0.00 0.0 0.18 42.9
"duplicated" 0.00 0.0 0.18 42.9
"cbind" 0.00 0.0 0.04 9.5
"data.frame" 0.00 0.0 0.04 9.5
"data.row.names" 0.00 0.0 0.02 4.8
"names" 0.00 0.0 0.02 4.8
"row.names<-" 0.00 0.0 0.02 4.8
"row.names<-.data.frame" 0.00 0.0 0.02 4.8
$by.total
total.time total.pct self.time self.pct
"merge.data.frame" 0.42 100.0 0.02 4.8
"merge" 0.42 100.0 0.00 0.0
"[.data.frame" 0.24 57.1 0.04 9.5
"[" 0.24 57.1 0.00 0.0
"any" 0.18 42.9 0.00 0.0
"duplicated" 0.18 42.9 0.00 0.0
"duplicated.default" 0.16 38.1 0.16 38.1
"match" 0.08 19.0 0.08 19.0
"sort.list" 0.08 19.0 0.08 19.0
"cbind" 0.04 9.5 0.00 0.0
"data.frame" 0.04 9.5 0.00 0.0
"names.default" 0.02 4.8 0.02 4.8
"seq_len" 0.02 4.8 0.02 4.8
"data.row.names" 0.02 4.8 0.00 0.0
"names" 0.02 4.8 0.00 0.0
"row.names<-" 0.02 4.8 0.00 0.0
"row.names<-.data.frame" 0.02 4.8 0.00 0.0
$sampling.time
[1] 0.42
The above suggests that a meaningful amount of time is spent in checking
for and dealing with duplicates in the common ('by') columns. To that
end:
DF.X <- data.frame(X = sample(10000, 35000, replace = TRUE), Y = runif(35000))
DF.Y <- data.frame(X = sample(10000, 35000, replace = TRUE), Y = runif(35000))
> system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE))
[1] 3.316 0.148 3.502 0.000 0.000
So, it would seem that introducing duplicate values in the same sized
vector space does indeed materially affect the time required:
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
"duplicated.default" 0.86 27.6 0.86 27.6
"make.unique" 0.76 24.4 0.76 24.4
"[.data.frame" 0.72 23.1 1.70 54.5
"data.frame" 0.18 5.8 0.38 12.2
"row.names<-.data.frame" 0.14 4.5 0.36 11.5
"names<-.default" 0.14 4.5 0.14 4.5
"merge.data.frame" 0.08 2.6 3.12 100.0
"order" 0.06 1.9 0.06 1.9
"rbind" 0.04 1.3 0.44 14.1
"[[<-.data.frame" 0.04 1.3 0.04 1.3
"match" 0.04 1.3 0.04 1.3
"unclass" 0.04 1.3 0.04 1.3
"unlist" 0.02 0.6 0.02 0.6
"merge" 0.00 0.0 3.12 100.0
"[" 0.00 0.0 1.70 54.5
"any" 0.00 0.0 0.86 27.6
"duplicated" 0.00 0.0 0.86 27.6
"cbind" 0.00 0.0 0.38 12.2
"row.names<-" 0.00 0.0 0.36 11.5
"do.call" 0.00 0.0 0.18 5.8
"names<-" 0.00 0.0 0.14 4.5
"data.row.names" 0.00 0.0 0.10 3.2
"[[<-" 0.00 0.0 0.04 1.3
$by.total
total.time total.pct self.time self.pct
"merge.data.frame" 3.12 100.0 0.08 2.6
"merge" 3.12 100.0 0.00 0.0
"[.data.frame" 1.70 54.5 0.72 23.1
"[" 1.70 54.5 0.00 0.0
"duplicated.default" 0.86 27.6 0.86 27.6
"any" 0.86 27.6 0.00 0.0
"duplicated" 0.86 27.6 0.00 0.0
"make.unique" 0.76 24.4 0.76 24.4
"rbind" 0.44 14.1 0.04 1.3
"data.frame" 0.38 12.2 0.18 5.8
"cbind" 0.38 12.2 0.00 0.0
"row.names<-.data.frame" 0.36 11.5 0.14 4.5
"row.names<-" 0.36 11.5 0.00 0.0
"do.call" 0.18 5.8 0.00 0.0
"names<-.default" 0.14 4.5 0.14 4.5
"names<-" 0.14 4.5 0.00 0.0
"data.row.names" 0.10 3.2 0.00 0.0
"order" 0.06 1.9 0.06 1.9
"[[<-.data.frame" 0.04 1.3 0.04 1.3
"match" 0.04 1.3 0.04 1.3
"unclass" 0.04 1.3 0.04 1.3
"[[<-" 0.04 1.3 0.00 0.0
"unlist" 0.02 0.6 0.02 0.6
$sampling.time
[1] 3.12
I would also point out the following:
> str(DF.XY)
'data.frame': 124704 obs. of 3 variables:
$ X : int 1 1 1 1 1 1 1 1 1 1 ...
$ Y.x: num 0.7233 0.7233 0.7233 0.0577 0.0577 ...
$ Y.y: num 0.805 0.742 0.324 0.805 0.742 ...
Note that the resultant data frame is NOT 35,000 rows as a consequence
of the multiple matches. Presumably, this gets worse with each
subsequent merge back to the prior result. For example:
> system.time(DF.XY2 <- merge(DF.XY, DF.Y, by = "X", all = TRUE))
[1] 12.270 1.173 13.624 0.000 0.000
> str(DF.XY2)
'data.frame': 557116 obs. of 4 variables:
$ X : int 1 1 1 1 1 1 1 1 1 1 ...
$ Y.x: num 0.00213 0.00213 0.00213 0.85017 0.85017 ...
$ Y.y: num 0.324 0.324 0.324 0.324 0.324 ...
$ Y : num 0.742 0.324 0.805 0.742 0.324 ...
I may be extrapolating beyond known data here, but Christos, the above
suggests that your data sets have some proportion of duplicate values in
the columns that you are using for matching.
HTH,
Marc Schwartz
More information about the R-help
mailing list