[R] acf gives correlations > 1

(Ted Harding) Ted.Harding at manchester.ac.uk
Mon Sep 14 19:21:02 CEST 2009


This is quite a nasty one! I've been having a look at it, and I think
it would be useful to go back to first principles.

Given a series u[1:N], the correlation coefficient between u[1:(N-k)]
and u[(k+1):N] is

c(k) = cov(u[1:(N-k)],u[(k+1):N])/sqrt(var(u[1:(N-k)])*var(u(k+1):N]))

in which, in particular, the means of u[1:(N-k)] and u[(k+1):N] are
subtracted from their respective series in caluclating the variances
and the covariance. This definition can be adopted for the serial
correlation of lag k.

However, variants of this definition are often adopted for the sake
of "better estimation" (in various senses). These include subtracting
the mean M of the entire series from u[1:(N-k)] and u[(k+1):N], and
then forming sum( (u[1:(N-k)]-M)*( u[(k+1):N-M) ) in calculating
the covariance, and using var(u[1:N]) in the denominator instead
of the above denominator.

But only the above definition of c(k) is guaranteed to yield a
result between -1 and 1; variants can theoretically give a result
outside this range.

This is unlikely to happen for a long series -- indeed, there is
unlikely to be much difference between the various ways of doing
it in a long series. But in your case you do not have long series.
I have read your series into a vector called x, of length N=6940.

For example, the set of values i such that neither x[i] nor x[i+1]
is NA is:

which((!is.na(x[1:(N-1)])&(!is.na(x[2:N]))))
 [1] 3238 4604 4605 4606 4932 4933 4985 4991 4992 4993 4994
[12] 5331 5353 5429 5478 5479 5527 5554 5555 5569 5583 5597
[23] 5625 5639 5653 5681 5682 5683 5905 5947 5977 5978 6043
[34] 6044 6045 6241 6283 6297 6301 6311 6395 6396 6397 6398

only 44 in length; and the lengths of sequences of succesive integers
in the above are:

1 3 2 1 4 1 1 1 2 1 2 1 1 1 1 1 1 3 1 1 2 3 1 1 1 1 1 4

 Now let

  ix.1 <- which((!is.na(x[1:(N-1)])&(!is.na(x[2:N]))))
  x1.0 <- x[ix.1]
  x1.1 <- x[ix.1+1]

Now we can calculate a correlation of lag 1 as

  c(1) = cov(x1.0,x1.1)/sqrt(var(x1.0)*var(x1.1)) = 0.8511088

which is nicely between 0 and 1. This is calculated using the basic
definition of c(k) above, not any "variant".

NOW: What variant would we like to play with?
[1]: Subtract the mean of all the x's in x1.0 and x1.1
  ix <- union(ix.1,(ix.1+1)); n <- length(ix)
  M <- mean(x[ix])
  v1.0 <- x1.0 - M ; v1.1 <- x1.1 - M
  sum(v1.0*v1.1)/sqrt(sum(v1.0^2)*sum(v1.1^2)) = 0.8523563
  which is not very different.
[2]: Subtract the mean of all non-NA x's in x
  M <- mean(x,na.rm=TRUE)
  v1.0 <- x1.0 - M ; v1.1 <- x1.1 - M
  sum(v1.0*v1.1)/sqrt(sum(v1.0^2)*sum(v1.1^2)) = 0.8621987
  which is a bit more different
[3]: Use the covariance of x1.0,x1.1 as in c(k), but use the variance
  of all the non-NA x's
  V <- var(x,na.rm=TRUE)
  cov(x1.0,x1.1)/V = 1.196581
**This is beginning to look like what R's acf() does! But not quite:
  acf at lag 1 gave 1.258, at lag 2 1.192
  However, it makes the point ...

I too would be unhappy with results, purporting to be correlations,
which so much exceeded 1. I would normally think of using the above
method for c(k), along the lines of

  Ck <- function(x,k){
    ix.1 <- which((!is.na(x[1:(N-k)])&(!is.na(x[(k+1):N]))))
    x1.0 <- x[ix.1]
    x1.1 <- x[ix.1+k]
    cov(x1.0,x1.1)/sqrt(var(x1.0)*var(x1.1))
  }
  for(k in (1:10)){ acfs[k]<-Ck(x,k) }
  cbind((1:10),acfs)
  #         acfs
  #  1 0.8511088
  #  2 0.5881168
  #  3 0.6543933
  #  4 0.9046573
  #  5 0.9392939
  #  6 0.8076299
  #  7 0.8605418
  #  8 0.9431999
  #  9 0.9667194
  # 10 0.9477807

(Note): I've also had a look at the code in acf(), but it's not
clear to me exactly what it does in a case like yours!

Hoping this is of some help,
Ted.

On 14-Sep-09 13:53:01, Steve Jones wrote:
> I misunderstood what the help page was saying - I thought that the
> missing values were what constituted the invalid function. Clearly I
> was mistaken.
> 
> Looks like I'll have to compute the acf myself. Shame, but such is
> life.
> 
> Thanks for the input, everyone!
> Steve.
> 
> Berwin A Turlach wrote:
>> G'day Steve,
>> 
>> On Mon, 14 Sep 2009 13:44:56 +0200
>> Steve Jones <steve at squaregoldfish.co.uk> wrote:
>> 
>>> Apologies for the missing data. It can be downloaded from here
>>> (22Kb): http://www.squaregoldfish.co.uk/sekrett/series.csv
>> 
>> Well, the Details section of acf's help page states:
>> 
>>  By default, no missing values are allowed.  If the 'na.action'
>>  function passes through missing values (as 'na.pass' does), the
>>  covariances are computed from the complete cases.  This means
>>  that the estimate computed may well not be a valid autocorrelation
>>  sequence, and may contain missing values.  [...]
>> 
>> And you have seem to have a massive amount of missing data:
>> 
>> R> dat <-
>> scan(url("http://www.squaregoldfish.co.uk/sekrett/series.csv"))
>> Read 6940 items
>> R> mean(!is.na(dat))
>> [1] 0.02881844
>> 
>> And, not surprisingly, an even smaller proportion of consecutive, 
>> non-missing observations.
>> 
>> R> mean(!is.na(dat[-1]) & !is.na(dat[-length(dat)]))
>> [1] 0.006340971
>> 
>> You can find out which formulae are used exactly by acf by studying
>> the
>> code, but this might give you an idea about what is going on:
>> 
>> R> ind <- !is.na(dat)
>> R> (mu <- mean(dat[ind]))  ## too lazy for mean(dat, na.rm=TRUE)
>> [1] 373.5165
>> R> (sig2 <- var(dat[ind])) ## ditto
>> [1] 463.4041
>> R> ind <- which(!is.na(dat[-1]) & !is.na(dat[-length(dat)]))
>> R> sum( (dat[ind]-mu) * (dat[ind+1] - mu)) / length(ind)
>> [1] 593.3041
>> R> sum( (dat[ind]-mu) * (dat[ind+1] - mu)) / length(ind) / sig2
>> [1] 1.280317
>> 
>> HTH
>> Cheers,
>>      Berwin

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 14-Sep-09                                       Time: 18:20:53
------------------------------ XFMail ------------------------------




More information about the R-help mailing list