[R] Best way to do temporal joins in R?
Gabor Grothendieck
ggrothendieck at gmail.com
Tue Mar 17 13:13:16 CET 2009
I am assuming that for each species and station_id that you want
the value in that temp record whose date/time is closest to the
date/time in the species record along with the identifying information
(species, station_id) and date/time of the species. That interpretation
does give the same answer as in the fused data set you posted.
First we read in temp and use chron to convert the date/times to
chron. Similarly we do that for species.
Then we define a function which measures the "distance" between
two date/times and we define another function f which takes a
species rowname and merges that row with temp. Finally
we call lapply that function to species.
library(chron)
temp <- read.csv("temperature_data_Rexample.csv")
temp$dt <- as.chron(paste(temp$date, temp$hour), "%Y%m%d %H%M")
species <- read.csv("species_data_Rexample.csv")
ds <- species$Date_Sampled
species$dt <- chron(sub(" .*", "", ds), gsub("[apm]+$|^.* ", "", ds)) +
(regexpr("pm", ds) > 0)/2 # add half a day if its pm
mydist <- function(x, y) abs(as.numeric(x-y))
f <- function(r) {
s <- species[r, ]
out <- by(temp, temp$station_id, function(x) {
imin <- which.min(mydist(x$dt, s$dt))
data.frame(Species = s$Species, Date = s$dt,
station_id = x[imin, "station_id"], value = x[imin, "value"])
})
do.call(rbind, out)
}
do.call(rbind, lapply(rownames(species), f))
Result of last line is:
Species Date station_id value
1 SpeciesB (06/23/08 13:55:11) ANH 2.25
2 SpeciesA (06/23/08 13:43:11) ANH 2.25
3 SpeciesC (06/23/08 13:55:11) ANH 2.25
4 SpeciesB (06/23/08 13:55:11) BDT 3.82
5 SpeciesA (06/23/08 13:43:11) BDT 3.90
6 SpeciesC (06/23/08 13:55:11) BDT 3.82
On Mon, Mar 16, 2009 at 7:41 PM, Jonathan Greenberg
<greenberg at ucdavis.edu> wrote:
> Weird -- the email was sent through my gmail account, looks like the .csvs
> got intercepted somewhere along the way. At any rate, I placed them on a
> website:
>
> http://cstars.ucdavis.edu/~jongreen/temp/temporal_join_R/
>
> --j
>
> Gabor Grothendieck wrote:
>>
>> There was nothing attached.
>>
>> On Mon, Mar 16, 2009 at 3:11 PM, Jonathan Greenberg
>> <greenberg at ucdavis.edu> wrote:
>>
>>>
>>> Sorry for the immediate follow-up, but Phil Spector correctly reminded me
>>> this is a lot easier for the community I provide some sample data, so I'm
>>> attaching 3 small CSVs to this email:
>>>
>>> species_data_Rexample.csv contains the "field data" (which species was
>>> ID'd
>>> and what time it was ID'd),
>>> temperature_data_Rexample.csv contains the date, time, station ID and the
>>> temperature "value"
>>>
>>> I'd like a dataframe which contains for each unique line in
>>> species_data_Rexample.csv, a series of lines, one per station, and the
>>> temperature of the nearest time stamp, or an interpolated value (weighted
>>> average would be fine, but so would just grabbing the nearest value), so
>>> for
>>> this example I'd like something that looks like the csv
>>> "fused_data_Rexample.csv"
>>>
>>> Thanks!
>>>
>>> --j
>>>
>>> Jonathan Greenberg wrote:
>>>
>>>>
>>>> I've been playing with zoo a bit, and it seems ok except it doesn't
>>>> support non-unique time stamps when performing joins. I have two
>>>> databases
>>>> which contain a dataframe of a Date object (with the time, not just
>>>> MM/DD/YY), e.g.:
>>>>
>>>> DB 1:
>>>> UniqueID,Date1,Data 1,Data 2
>>>>
>>>> DB 2:
>>>> Date2, Station, Data 3
>>>>
>>>> We'll say Station can contain three values: A,B and C
>>>>
>>>> DB 1 may have some repeat times, and DB 2 definitely has them --
>>>> although
>>>> each Date, Station combo is unique (this DB contains weather data
>>>> collected
>>>> on the half-hour or fifteen minute interval at a set of stations). I'd
>>>> like
>>>> DB2's station and Data3 to be joined with DB1 based on the nearest time
>>>> stamp (interpolating Data3 or not).
>>>>
>>>> Ideally, I'd like a fused database such that I get for each uniqueID in
>>>> DB1:
>>>>
>>>> UniqueID,Date,Data1,Data2,Station,Data3
>>>>
>>>> Thoughts? Hints?
>>>>
>>>> --j
>>>>
>>>>
>>>>
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>
>
> --
>
> Jonathan A. Greenberg, PhD
> Postdoctoral Scholar
> Center for Spatial Technologies and Remote Sensing (CSTARS)
> University of California, Davis
> One Shields Avenue
> The Barn, Room 250N
> Davis, CA 95616
> Cell: 415-794-5043
> AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list