[R] Efficient means to link two data frames
Ben Tupper
btupper at bigelow.org
Thu Sep 29 16:10:20 CEST 2016
Hi,
I have a solution based upon findInterval() which depends upon the ordered nature of the 'Day' column. I can't speak to whether or not it is efficient but it is handy. I love the findInterval() function but have often wished it works with look up tables in descending order. This function, find_interval(), is my first reasonable pass at working with descending order.
https://gist.github.com/btupper/5fc6cc4e7d86f39f9e4f
## start
source("find_interval.R")
x <- structure(list(Day = c(1L, 2L, 3L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
13L, 14L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L,
27L, 28L, 29L, 30L), Value = c(76, 116, 111, 103, 114, 99, 128,
96, 81, 84, 81, 108, 109, 106, 125, 128, 92, 90, 83, 89, 76,
89, 101, 93, 98, 77, 92)), .Names = c("Day", "Value"), row.names = c(NA,
-27L), class = "data.frame")
TESTDAYS <- data.frame(TestDay = c(4, 11, 15))
x <- x[rev(1:nrow(x)),]
ix <- find_interval(TESTDAYS[,'TestDay'], x[,'Day'])
TESTDAYS[,'TestValue'] <- x[ix,'Value']
TESTDAYS
# TestDay TestValue
# 1 4 103
# 2 11 84
# 3 15 109
### end
Will that do the trick with your large dataset?
Ben
> On Sep 29, 2016, at 9:38 AM, Dennis Fisher <fisher at plessthan.com> wrote:
>
> R 3.3.1
> OS X
>
> Colleagues,
>
> I have two large data frames that I am trying to link efficiently. A small example is as follows:
>
> structure(list(Day = c(1L, 2L, 3L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
> 13L, 14L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L,
> 27L, 28L, 29L, 30L), Value = c(76, 116, 111, 103, 114, 99, 128,
> 96, 81, 84, 81, 108, 109, 106, 125, 128, 92, 90, 83, 89, 76,
> 89, 101, 93, 98, 77, 92)), .Names = c("Day", "Value"), row.names = c(NA,
> -27L), class = "data.frame")
>
> which becomes:
> Day Value
> 1 1 76
> 2 2 116
> 3 3 111
> 4 5 103
> 5 6 114
> 6 7 99
> 7 8 128
> 8 9 96
> 9 10 81
> 10 11 84
> 11 13 81
> 12 14 108
> 13 16 109
> 14 17 106
> 15 18 125
> 16 19 128
> 17 20 92
> 18 21 90
> 19 22 83
> 20 23 89
> 21 24 76
> 22 25 89
> 23 26 101
> 24 27 93
> 25 28 98
> 26 29 77
> 27 30 92
>
> The second dataframe is merely:
> TESTDAYS <- data.frame(TestDay = c(4, 11, 15))
>
> For each row in the second dataframe, I would like to identify the first row in the first dataframe in which Day is >= TestDay.
> For example, for TestDay == 4, Day would equal 5. I would then append the corresponding “Value” in the TestValue column
> The result would be:
> TestDay TestValue
> 1 4 103
> 2 11 84
> 3 15 109
>
> I can accomplish this with brute force but I suspect that there is some clever day to vectorize this. Any help would be appreciated.
>
> Dennis
>
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone / Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com <http://www.plessthan.com/>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Ben Tupper
Bigelow Laboratory for Ocean Sciences
60 Bigelow Drive, P.O. Box 380
East Boothbay, Maine 04544
http://www.bigelow.org
More information about the R-help
mailing list