[R-SIG-Finance] How to clean errors in yahoo historical quotes?

Marc Delvaux mdelvaux at gmail.com
Mon Nov 17 21:49:31 CET 2008


This is a followup question on the previous thread on time gaps.

In the first thread, I had identified gaps in the time series.  I was
focusing on GENZ because I was seeing aberrant results when working
with that time series.  I first detected the problem with the time
gaps and was assuming that this was the problem.  While the code
provided by Josh Ulrich works beautifully to get compatible series, I
still got aberrant results when working with GENZ :-(

This has now been traced to errors in the series, probably related to
a problem with the adjustment algorithm (see below).   Thanks to Jeff
Ryan, I am also able to compare the same data as reported by Google,
that in this specific instance is not affected.

The general question is then: given that we know that in general
downloaded data can be effected by errors, how to clean them?  I can
see ways to do that, especially by direct observation and manual
cleaning, but again I don't want to reinvent the wheel.  Also is it
worth to contact Yahoo to have the series cleaned at the source (gut
feeling is no).  And yes, I understand both yahoo and google data are
free and so come with no guarantee.


First the data as downloaded from Yahoo via getYahooData in package
TTR, corresponding Yahoo chart is OK BTW

> genz['2001-04-24::2001-05-08']
             Open   High    Low  Close  Volume Unadj.Close Div Split Adj.Div
2001-04-24 51.900 52.000 50.225 50.525 2839400      101.05  NA    NA      NA
2001-04-25 50.495 53.345 50.495 51.765 4161600      103.53  NA    NA      NA
2001-04-26 51.875 53.255 51.145 52.685 1956000      105.37  NA    NA      NA
2001-04-27 52.880 55.000 52.750 53.755 3028800      107.51  NA    NA      NA
2001-04-30 27.025 27.585 26.545 27.245 3396800       54.49  NA    NA
   NA   << not very likely
2001-05-01 54.625 55.750 52.860 55.515 2792000      111.03  NA    NA      NA
2001-05-02 55.540 55.625 51.875 54.080 3466600      108.16  NA    NA      NA
2001-05-03 51.375 51.835 51.100 51.315 5412600      102.63  NA    NA      NA
2001-05-04 49.000 51.515 48.750 50.900 4066400      101.80  NA    NA      NA
2001-05-07 25.485 26.495 25.375 26.315 3185500       52.63  NA    NA
   NA << not very likely
2001-05-08 52.800 53.275 52.110 52.980 1884000      105.96  NA    NA      NA

Then the data as downloaded from google via getSymbols in package quantmod

> GENZ['2001-04-24::2001-05-08']
           GENZ.Open GENZ.High GENZ.Low GENZ.Close GENZ.Volume
2001-04-24     51.90     52.00    50.22      50.52     5678800
2001-04-25     50.50     53.34    50.50      51.76     8323400
2001-04-26     51.88     53.26    51.14      52.68     3912200
2001-04-27     52.88     55.00    52.75      53.76     6057400
2001-04-30     54.05     55.17    53.09      54.48     6793600  << I
can believe this one
2001-05-01     54.62     55.75    52.86      55.52     5584000
2001-05-02     55.54     55.62    51.88      54.08     6933200
2001-05-03     51.38     51.84    51.10      51.32    10825200
2001-05-04     49.00     51.52    48.75      50.90     8132800
2001-05-07     50.97     53.00    50.75      52.64     6371000
2001-05-08     52.80     53.28    52.11      52.98     3768200



More information about the R-SIG-Finance mailing list