[RsR] Outliers in the market model that's used to estimate `beta' of a stock

Thu Sep 18 17:36:16 CEST 2008

In continuation of the discussion on `Winsorisation' that has taken
place on r-sig-finance today, I thought I'd present all of you with an
interesting dataset and a question.

This data is the daily stock returns of the large Indian software firm
`Infosys'. (This is the symbol `INFY' on NASDAQ). It is a large number
of observations of daily returns (i.e. percentage changes of the
adjusted stock price).

Load the data in --

    print(load(url("http://www.mayin.org/ajayshah/tmp/infosys_mm.rda")))
    str(x)
    summary(x)
    sd(x)

The name `rj' is used for returns on Infosys, and `rM' is used for
returns on the stock market index (Nifty). There are three really
weird observations in this.

    weird.rj <- c(1896,2395)
    weird.rM <- 2672
    x[weird.rj,]
    x[weird.rM,]

As you can see, these observations are quite remarkable given the
small standard deviations that we saw above. There is absolutely no
measurement error here. These things actually happened.

Now consider a typical application: using this to estimate a market
model. The goal here is to estimate the coefficient of a regression of
rj on rM.

    # A regression with all obs
    summary(lm(rj ~ rM, data=x))

    # Drop the weird rj --
    summary(lm(rj ~ rM, data=x[-weird.rj,]))

    # Drop the weird rM --
    summary(lm(rj ~ rM, data=x[-weird.rM,]))

    # Drop both kinds of weird observations --
    summary(lm(rj ~ rM, data=x[-c(weird.rM,weird.rj),]))

    # Robust regressions
    library(MASS)
    summary(rlm(rj ~ rM, data=x))
    summary(rlm(rj ~ rM, method="MM", data=x))
    library(robust)
    summary(lmRob(rj ~ rM, data=x))
    library(quantreg)
    summary(rq(rj ~ rM, tau=0.5, data=x))

So you see, we have a variety of different estimates for the slope
(which is termed `beta' in finance). What value would you trust the
most?

And, would winsorisation using either my code
(https://stat.ethz.ch/pipermail/r-sig-finance/2008q3/002921.html) or
Patrick Burns' code
(https://stat.ethz.ch/pipermail/r-sig-finance/2008q3/002923.html) be a
good idea here?

I'm instinctively unhappy with any scheme based on discarding
observations that I'm absolutely sure have no measurement error. We
have to model the weirdness of this data generating process, not
ignore it.

-- 
Ajay Shah                                      http://www.mayin.org/ajayshah  
ajayshah using mayin.org                             http://ajayshahblog.blogspot.com
<*(:-? - wizard who doesn't know the answer.