[R] Best way to do temporal joins in R?
Gabor Grothendieck
ggrothendieck at gmail.com
Tue Mar 17 14:01:50 CET 2009
In thinking about this some more an SQL solution would be a
bit easier. Try this where species and temp are from
the prior post (i.e. after processing with chron).
The first statement finds the minimum distances for
each species and station_id combination. The
second finds all combinations of a record in species
with a record in temp, i.e. outer join. The last sql
statement merges those two over Species, station_id
and dist. Note that value is a keyword in sql so
its name is automatically transformed to value__1 to
avoid a conflict. At the end we fix up the dt column
so its chron again.
library(sqldf)
mins <- sqldf("select s.Species, s.dt, t.station_id, min(abs(s.dt - t.dt)) dist
from species2 s, temp2 t group by species, station_id")
every <- sqldf("select s.Species, s.dt, t.station_id, t.value__1,
abs(s.dt - t.dt) dist
from species2 s, temp2 t")
merged <- sqldf("select Species, e.dt, station_id, value__1
from mins join every e
using(Species, station_id, dist)")
merged <- transform(merged, dt = chron(dt))
On Tue, Mar 17, 2009 at 8:13 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> I am assuming that for each species and station_id that you want
> the value in that temp record whose date/time is closest to the
> date/time in the species record along with the identifying information
> (species, station_id) and date/time of the species. That interpretation
> does give the same answer as in the fused data set you posted.
>
> First we read in temp and use chron to convert the date/times to
> chron. Similarly we do that for species.
>
> Then we define a function which measures the "distance" between
> two date/times and we define another function f which takes a
> species rowname and merges that row with temp. Finally
> we call lapply that function to species.
>
> library(chron)
>
> temp <- read.csv("temperature_data_Rexample.csv")
> temp$dt <- as.chron(paste(temp$date, temp$hour), "%Y%m%d %H%M")
>
> species <- read.csv("species_data_Rexample.csv")
> ds <- species$Date_Sampled
> species$dt <- chron(sub(" .*", "", ds), gsub("[apm]+$|^.* ", "", ds)) +
> (regexpr("pm", ds) > 0)/2 # add half a day if its pm
>
> mydist <- function(x, y) abs(as.numeric(x-y))
>
> f <- function(r) {
> s <- species[r, ]
> out <- by(temp, temp$station_id, function(x) {
> imin <- which.min(mydist(x$dt, s$dt))
> data.frame(Species = s$Species, Date = s$dt,
> station_id = x[imin, "station_id"], value = x[imin, "value"])
> })
> do.call(rbind, out)
> }
>
> do.call(rbind, lapply(rownames(species), f))
>
> Result of last line is:
>
> Species Date station_id value
> 1 SpeciesB (06/23/08 13:55:11) ANH 2.25
> 2 SpeciesA (06/23/08 13:43:11) ANH 2.25
> 3 SpeciesC (06/23/08 13:55:11) ANH 2.25
> 4 SpeciesB (06/23/08 13:55:11) BDT 3.82
> 5 SpeciesA (06/23/08 13:43:11) BDT 3.90
> 6 SpeciesC (06/23/08 13:55:11) BDT 3.82
>
>
>
> On Mon, Mar 16, 2009 at 7:41 PM, Jonathan Greenberg
> <greenberg at ucdavis.edu> wrote:
>> Weird -- the email was sent through my gmail account, looks like the .csvs
>> got intercepted somewhere along the way. At any rate, I placed them on a
>> website:
>>
>> http://cstars.ucdavis.edu/~jongreen/temp/temporal_join_R/
>>
>> --j
>>
>> Gabor Grothendieck wrote:
>>>
>>> There was nothing attached.
>>>
>>> On Mon, Mar 16, 2009 at 3:11 PM, Jonathan Greenberg
>>> <greenberg at ucdavis.edu> wrote:
>>>
>>>>
>>>> Sorry for the immediate follow-up, but Phil Spector correctly reminded me
>>>> this is a lot easier for the community I provide some sample data, so I'm
>>>> attaching 3 small CSVs to this email:
>>>>
>>>> species_data_Rexample.csv contains the "field data" (which species was
>>>> ID'd
>>>> and what time it was ID'd),
>>>> temperature_data_Rexample.csv contains the date, time, station ID and the
>>>> temperature "value"
>>>>
>>>> I'd like a dataframe which contains for each unique line in
>>>> species_data_Rexample.csv, a series of lines, one per station, and the
>>>> temperature of the nearest time stamp, or an interpolated value (weighted
>>>> average would be fine, but so would just grabbing the nearest value), so
>>>> for
>>>> this example I'd like something that looks like the csv
>>>> "fused_data_Rexample.csv"
>>>>
>>>> Thanks!
>>>>
>>>> --j
>>>>
>>>> Jonathan Greenberg wrote:
>>>>
>>>>>
>>>>> I've been playing with zoo a bit, and it seems ok except it doesn't
>>>>> support non-unique time stamps when performing joins. I have two
>>>>> databases
>>>>> which contain a dataframe of a Date object (with the time, not just
>>>>> MM/DD/YY), e.g.:
>>>>>
>>>>> DB 1:
>>>>> UniqueID,Date1,Data 1,Data 2
>>>>>
>>>>> DB 2:
>>>>> Date2, Station, Data 3
>>>>>
>>>>> We'll say Station can contain three values: A,B and C
>>>>>
>>>>> DB 1 may have some repeat times, and DB 2 definitely has them --
>>>>> although
>>>>> each Date, Station combo is unique (this DB contains weather data
>>>>> collected
>>>>> on the half-hour or fifteen minute interval at a set of stations). I'd
>>>>> like
>>>>> DB2's station and Data3 to be joined with DB1 based on the nearest time
>>>>> stamp (interpolating Data3 or not).
>>>>>
>>>>> Ideally, I'd like a fused database such that I get for each uniqueID in
>>>>> DB1:
>>>>>
>>>>> UniqueID,Date,Data1,Data2,Station,Data3
>>>>>
>>>>> Thoughts? Hints?
>>>>>
>>>>> --j
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>>
>>
>> --
>>
>> Jonathan A. Greenberg, PhD
>> Postdoctoral Scholar
>> Center for Spatial Technologies and Remote Sensing (CSTARS)
>> University of California, Davis
>> One Shields Avenue
>> The Barn, Room 250N
>> Davis, CA 95616
>> Cell: 415-794-5043
>> AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
More information about the R-help
mailing list