[R] Help with aggregate and cor
James Marca
jmarca at translab.its.uci.edu
Wed Mar 10 19:39:55 CET 2010
Hi Ista,
Many thanks, the plyr package was just what I needed.
Because I did such a bad job with my question (no data, etc etc), here
is my current solution:
First, I grabbed my data from PostgreSQL as follows:
library('RPostgreSQL')
m <- dbDriver("PostgreSQL")
con <-
dbConnect(m,user="user",password="yourpassword",host="super.secret.host.edu",dbname="yourdb")
rs <- dbSendQuery(con,"select vds_id, to_char(ts,'Dy') as dow,date_trunc('hour'::text, ts) as tshour, n1,o1, n2,o2, n3,o3, n4,o4, n5,o5 from data_raw where vds_id=1201087 and (ts>'Mar 1, 2007' and ts<'Apr 1, 2007 00:00:00')")
df.il <- fetch(res, n = -1)
This timeperiod grabs 88,000 odd records. The data look like this:
summary(df.i1)
vds_id dow tshour
Min. :1201087 Length:88070 Min. :2007-03-01 00:00:00
1st Qu.:1201087 Class :character 1st Qu.:2007-03-08 19:00:00
Median :1201087 Mode :character Median :2007-03-16 15:00:00
Mean :1201087 Mean :2007-03-16 13:51:31
3rd Qu.:1201087 3rd Qu.:2007-03-24 08:00:00
Max. :1201087 Max. :2007-03-31 23:00:00
ts n1 o1
Min. :2007-03-01 00:00:30 Min. : 0.000 Min. :0.00000
1st Qu.:2007-03-08 19:09:37 1st Qu.: 3.000 1st Qu.:0.01620
Median :2007-03-16 15:08:45 Median : 8.000 Median :0.04860
Mean :2007-03-16 14:21:18 Mean : 8.147 Mean :0.05024
3rd Qu.:2007-03-24 08:25:52 3rd Qu.:12.000 3rd Qu.:0.07330
Max. :2007-03-31 23:59:30 Max. :35.000 Max. :0.79440
n2 o2 n3 o3
Min. : 0.000 Min. :0.00000 Min. : 0.000 Min. :0.00000
1st Qu.: 4.000 1st Qu.:0.02160 1st Qu.: 3.000 1st Qu.:0.01940
Median : 8.000 Median :0.04580 Median : 6.000 Median :0.03900
Mean : 7.268 Mean :0.04584 Mean : 5.682 Mean :0.04178
3rd Qu.:10.000 3rd Qu.:0.06370 3rd Qu.: 8.000 3rd Qu.:0.05910
Max. :27.000 Max. :0.77750 Max. :27.000 Max. :0.63060
n4 o4 n5 o5
Min. : 0.000 Min. :0.00000 Min. : 0.000 Min. :0.00000
1st Qu.: 2.000 1st Qu.:0.01450 1st Qu.: 1.000 1st Qu.:0.00640
Median : 5.000 Median :0.03400 Median : 3.000 Median :0.02040
Mean : 5.418 Mean :0.03811 Mean : 3.706 Mean :0.03085
3rd Qu.: 8.000 3rd Qu.:0.05510 3rd Qu.: 6.000 3rd Qu.:0.04530
Max. :28.000 Max. :0.59930 Max. :27.000 Max. :0.66210
df.i1[1,]
vds_id dow tshour ts n1 o1 n2 o2 n3 o3 n4 o4 n5 o5
1 1201087 Thu 2007-03-01 2007-03-01 00:00:30 6 0.0373 5 0.0291 2 0.0217 2 0.0109 1 0.0086
The 'n_i' values are vehicle counts in lane i on a freeway in 30 seconds,
and the 'o_i' are occupancy values for the lane over those 30 seconds,
ranging from 0 (nothing over the sensor) to 1 (somebody parked over
the sensor). I want to isolate just those time periods when n and o
are highly correlated, as I'm exploring whether that indicates free flow
traffic conditions.
My final function after hacking around a bit last night looks like this
cor.dat <- function(df) {
cor.l1 <- cor.test(df$n1,df$o1)
cor.l2 <- cor.test(df$n2,df$o2)
cor.l3 <- cor.test(df$n3,df$o3)
cor.l4 <- cor.test(df$n4,df$o4)
cor.l5 <- cor.test(df$n5,df$o5)
c(l1=cor.l1,l2=cor.l2,l3=cor.l3,l4=cor.l4,l5=cor.l5)
}
(I'm not really sure what the best way to handle that return value is, but
at least this works for what I want...see below).
Run that function hourly with plyr
output.hourly <- dlply(df.i1,"tshour",cor.dat)
(because I used c(...) the output is ugly, but I can do this):
get.cor <- function (l,w) {
l[w]
}
So to get the "estimate" of the parameter for lane 1:
g <- lapply(output.hourly,get.cor,"l1.estimate")
g[1]
$`2007-03-01 00:00:00`
$`2007-03-01 00:00:00`$l1.estimate
cor
0.9845006
plot(unlist(g))
etc.
Super ugly, but I'm slowly remembe-R-ing.
Again thanks a lot for the tip.
Regards,
James
On Tue, Mar 09, 2010 at 10:24:05PM -0500, Ista Zahn wrote:
> Hi James,
> It would really help if you gave us a sample of the data you are
> working with. The following is not tested, because I don't have your
> data and am too lazy to construct a similar example dataset for you,
> but it might get you started.
>
> You can try using a for loop along the lines of
>
> output <- data.frame(obsfivemin = obsfivemin, 5min.cor =
> vector(length=length(obsfivemin)))
> for (f in fivemin){
> output$5min.cor[obsfivemin==f] <- cor(df[obsfivemin==f, c("v", "o")])
> }
>
> Or you can try with the plyr package something like
>
> cor.dat <- function(df) {
> cor(df[,c("v", "o")])
> }
>
> library(plyr)
> dlply(df, obsfivemin, cor.dat)
>
> Good luck,
> Ista
>
>
> On Tue, Mar 9, 2010 at 9:36 PM, James Marca <jmarca at translab.its.uci.edu> wrote:
> > Hello,
> >
> > I do not understand the correct way to approach the following problem
> > in R.
> >
> > I have observations of pairs of variables, v1, o1, v2, o2, etc,
> > observed every 30 seconds. What I would like to do is compute the
> > correlation matrix, but not for all my data, just for, say 5 minutes
> > or 1 hour chunks.
> >
> > In sql, what I would say is
> >
> > select id, date_trunc('hour'::text, ts) as tshour, corr(n1,o1) as corr1
> > from raw30s
> > where id = 1201087 and
> > (ts between 'Mar 1, 2007' and 'Apr 1, 2007')
> > group by id,tshour order by id,tshour;
> >
> >
> > I've pulled data from PostgreSQL into R, and have a dataframe
> > containing a timestamp column, v, and o (both numeric).
> >
> > I created an grouping index for every 5 minutes along these lines:
> >
> > obsfivemin <- trunc(obsts,units="hours")
> > +( floor( (obsts$min / 5 ) ) * 5 * 60 )
> >
> > (where obsts is the sql timestamp converted into a DateTime object)
> >
> > Then I tried aggregate(df,by=obsfivemin,cor), but that seemed to pass
> > just a single column at a time to cor, not the entire data frame. It
> > worked for mean and sum, but not cor.
> >
> > In desperation, I tried looping over the different 5 minute levels and
> > computing cor, but I'm so R-clueless I couldn't even figure out how to
> > assign to a variable inside of that loop!
> >
> > code such as
> >
> > for (f in fivemin){
> > output[f] <- cor(df[grouper==f,]); }
> >
> > failed, as I couldn't figure out how to initialize output so that
> > output[f] would accept the output of cor.
> >
> > Any help or steering towards the proper R-way would be appreciated.
> >
> > Regards,
> >
> > James Marca
> >
> > --
> > This message has been scanned for viruses and
> > dangerous content by MailScanner, and is
> > believed to be clean.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Ista Zahn
> Graduate student
> University of Rochester
> Department of Clinical and Social Psychology
> http://yourpsyche.org
>
>
>
> --==============24723324=Content-Type: message/rfc822
> MIME-Version: 1.0
--
James E. Marca, PhD
Researcher
Institute of Transportation Studies
AIRB Suite 4000
University of California
Irvine, CA 92697-3600
jmarca at translab.its.uci.edu
(949) 824-6287
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the R-help
mailing list