[R] relation in aggregated data

Wed Jul 7 16:24:31 CEST 2010

Dear all

My question is more on statistics than on R, however it can be 
demonstrated by R. It is about pros and cons trying to find a relationship 
by aggregated data. I can have two variables which can be related and I 
measure them regularly during some time (let say a year) but I can not 
measure them in a same time - (e.g. I can not measure x and respective 
value of y, usually I have 3 or more values of x and only one value of y 
per day). 

I can make a aggregated values (let say quarterly). My questions are:

1.      Is such approach sound? Can I use it?
2.      What could be the problems
3.      Is there any other method to inspect variables which can be 
related but you can not directly measure them in a same time?

My opinion is, that it is not much sound to inspect aggregated values and 
there can be many traps especially if there are only few aggregated 
values. Below you can see my examples.

If you have some opinion on this issue, please let me know.

Best regards
Petr

Let us have a relation x/y

set.seed(555)
x <- rnorm(120)
y <- 5*x+3+rnorm(120)
plot(x, y)

As you can see there is clear relation which can be seen from plot. Now I 
make a factor for aggregation.

fac <- rep(1:4,each=30)

xprum <- tapply(x, fac, mean)
yprum <- tapply(y, fac, mean)
plot(xprum, yprum)

Relationship is completely gone. Now let us make other fake data

xn <- runif(120)*rep(1:4, each=30)
yn <- runif(120)*rep(1:4, each=30)
plot(xn,yn)

There is no visible relation, xn and yn are independent but related to 
aggregation factor.

xprumn <- tapply(xn, fac, mean)
yprumn <- tapply(yn, fac, mean)
plot(xprumn, yprumn)

Here you can see perfect relation which is only due to aggregation factor.