[R] Best way to do temporal joins in R?

Gabor Grothendieck ggrothendieck at gmail.com
Tue Mar 17 14:01:50 CET 2009


In thinking about this some more an SQL solution would be a
bit easier.  Try this where species and temp are from
the prior post (i.e. after processing with chron).

The first statement finds the minimum distances for
each species and station_id combination.  The
second finds all combinations of a record in species
with a record in temp, i.e. outer join.  The last sql
statement merges those two over Species, station_id
and dist.  Note that value is a keyword in sql so
its name is automatically transformed to value__1 to
avoid a conflict.  At the end we fix up the dt column
so its chron again.

library(sqldf)

mins <- sqldf("select s.Species, s.dt, t.station_id, min(abs(s.dt - t.dt)) dist
	from species2 s, temp2 t group by species, station_id")

every <- sqldf("select s.Species, s.dt, t.station_id, t.value__1,
	abs(s.dt - t.dt) dist
	from species2 s, temp2 t")

merged <- sqldf("select Species, e.dt, station_id, value__1
	from mins join every e
	using(Species, station_id, dist)")

merged <- transform(merged, dt = chron(dt))


On Tue, Mar 17, 2009 at 8:13 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> I am assuming that for each species and station_id that you want
> the value in that temp record whose date/time is closest to the
> date/time in the species record along with the identifying information
> (species, station_id) and date/time of the species.   That interpretation
> does give the same answer as in the fused data set you posted.
>
> First we read in temp and use chron to convert the date/times to
> chron.  Similarly we do that for species.
>
> Then we define a function which measures the "distance" between
> two date/times and we define another function f which takes a
> species rowname and merges that row with temp.  Finally
> we call lapply that function to species.
>
> library(chron)
>
> temp <- read.csv("temperature_data_Rexample.csv")
> temp$dt <- as.chron(paste(temp$date, temp$hour), "%Y%m%d %H%M")
>
> species <- read.csv("species_data_Rexample.csv")
> ds <- species$Date_Sampled
> species$dt <- chron(sub(" .*", "", ds), gsub("[apm]+$|^.* ", "", ds)) +
>        (regexpr("pm", ds) > 0)/2  # add half a day if its pm
>
> mydist <- function(x, y) abs(as.numeric(x-y))
>
> f <- function(r) {
>        s <- species[r, ]
>        out <- by(temp, temp$station_id, function(x) {
>                imin <- which.min(mydist(x$dt, s$dt))
>                data.frame(Species = s$Species, Date = s$dt,
>                        station_id = x[imin, "station_id"], value = x[imin, "value"])
>        })
>        do.call(rbind, out)
> }
>
> do.call(rbind, lapply(rownames(species), f))
>
> Result of last line is:
>
>   Species                Date station_id value
> 1 SpeciesB (06/23/08 13:55:11)        ANH  2.25
> 2 SpeciesA (06/23/08 13:43:11)        ANH  2.25
> 3 SpeciesC (06/23/08 13:55:11)        ANH  2.25
> 4 SpeciesB (06/23/08 13:55:11)        BDT  3.82
> 5 SpeciesA (06/23/08 13:43:11)        BDT  3.90
> 6 SpeciesC (06/23/08 13:55:11)        BDT  3.82
>
>
>
> On Mon, Mar 16, 2009 at 7:41 PM, Jonathan Greenberg
> <greenberg at ucdavis.edu> wrote:
>> Weird -- the  email was sent through my gmail account, looks like the .csvs
>> got intercepted somewhere along the way.  At any rate, I placed them on a
>> website:
>>
>> http://cstars.ucdavis.edu/~jongreen/temp/temporal_join_R/
>>
>> --j
>>
>> Gabor Grothendieck wrote:
>>>
>>> There was nothing attached.
>>>
>>> On Mon, Mar 16, 2009 at 3:11 PM, Jonathan Greenberg
>>> <greenberg at ucdavis.edu> wrote:
>>>
>>>>
>>>> Sorry for the immediate follow-up, but Phil Spector correctly reminded me
>>>> this is a lot easier for the community I provide some sample data, so I'm
>>>> attaching 3 small CSVs to this email:
>>>>
>>>> species_data_Rexample.csv contains the "field data" (which species was
>>>> ID'd
>>>> and what time it was ID'd),
>>>> temperature_data_Rexample.csv contains the date, time, station ID and the
>>>> temperature "value"
>>>>
>>>> I'd like a dataframe which contains for each unique line in
>>>> species_data_Rexample.csv, a series of lines, one per station, and the
>>>> temperature of the nearest time stamp, or an interpolated value (weighted
>>>> average would be fine, but so would just grabbing the nearest value), so
>>>> for
>>>> this example I'd like something that looks like the csv
>>>> "fused_data_Rexample.csv"
>>>>
>>>> Thanks!
>>>>
>>>> --j
>>>>
>>>> Jonathan Greenberg wrote:
>>>>
>>>>>
>>>>> I've been playing with zoo a bit, and it seems ok except it doesn't
>>>>> support non-unique time stamps when performing joins.  I have two
>>>>> databases
>>>>> which contain a dataframe of a Date object (with the time, not just
>>>>> MM/DD/YY), e.g.:
>>>>>
>>>>> DB 1:
>>>>> UniqueID,Date1,Data 1,Data 2
>>>>>
>>>>> DB 2:
>>>>> Date2, Station, Data 3
>>>>>
>>>>> We'll say Station can contain three values: A,B and C
>>>>>
>>>>> DB 1 may have some repeat times, and DB 2 definitely has them --
>>>>> although
>>>>> each Date, Station combo is unique (this DB contains weather data
>>>>> collected
>>>>> on the half-hour or fifteen minute interval at a set of stations).  I'd
>>>>> like
>>>>> DB2's station and Data3 to be joined with DB1 based on the nearest time
>>>>> stamp (interpolating Data3 or not).
>>>>>
>>>>> Ideally, I'd like a fused database such that I get for each uniqueID in
>>>>> DB1:
>>>>>
>>>>> UniqueID,Date,Data1,Data2,Station,Data3
>>>>>
>>>>> Thoughts?  Hints?
>>>>>
>>>>> --j
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>>
>>
>> --
>>
>> Jonathan A. Greenberg, PhD
>> Postdoctoral Scholar
>> Center for Spatial Technologies and Remote Sensing (CSTARS)
>> University of California, Davis
>> One Shields Avenue
>> The Barn, Room 250N
>> Davis, CA 95616
>> Cell: 415-794-5043
>> AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>



More information about the R-help mailing list