[R-sig-finance] Fitting Distribution Mixs with Incomplete Data

Thu May 18 15:45:30 CEST 2006

Dear All,
Suppose you are given some data to fit to a mixture of Gaussian
distributions (a weighted sum of three distributions).
You can do that using e.g. the mix package
(http://www.math.mcmaster.ca/peter/mix/mix.html)
which gave me very good results on some data I generated artificially.
See e.g. the code:

# R code used to fit a sum of three normal distributions
rm(list=ls())
library(mixdist)
set.seed(123)
NN<-10000
three<-3 #number of gaussian sequences I want to generate
mysample<-matrix(nrow=NN,ncol=three)
sdvec<-seq(1:three)
sdvec[1]<-4
sdvec[2]<-3
sdvec[3]<-3.5
meanvec<-seq(1:three)
meanvec[1]<-20
meanvec[2]<-50
meanvec[3]<-70

for (i in 1:three)
{
 mysample[ ,i] <-rnorm(NN,meanvec[i],sdvec[i])
}
dim(mysample)<-c(3*NN,1)
plot(density(mysample,kern="gaussian"),lwd=2,col=300)
## The commented lines are parts of the 1st version of this script

#x<-matrix(ncol=2,nrow=three*NN)
#x[ ,1]<-mysample[]
#x[ ,2]<-1
y<-as.data.frame(mysample)
mixgroup(y)
z<-mixgroup(y,breaks=100)
plot(z)
#ini<-read.table("gaupar.TXT")
ini<-mixparam(mu = c(24, 45, 73), sigma = c(5, 4, 4),pi=c(0.4,0.3,0.3))
mymix<-mix(z,ini,"norm")
summary(mymix)
plot(mymix)
print("So far so good")

However, you can be in a situation in which you cannot sample the
distribution that easily (e.g. you do not have stock quotations
below/above a certain price) and then the same tool does not work that
well. Consider for instance the same case as above but removing all
the data below 15 and above 65:

# R code used to fit a sum of three normal distributions
rm(list=ls())
library(mixdist)
set.seed(123)
NN<-10000
three<-3 #number of gaussian sequences I want to generate
mysample<-matrix(nrow=NN,ncol=three)
sdvec<-seq(1:three)
sdvec[1]<-4
sdvec[2]<-3
sdvec[3]<-3.5
meanvec<-seq(1:three)
meanvec[1]<-20
meanvec[2]<-50
meanvec[3]<-70

for (i in 1:three)
{
 mysample[ ,i] <-rnorm(NN,meanvec[i],sdvec[i])
}
dim(mysample)<-c(3*NN,1)
##  now I am going to remove some data and see if I can still
meaningfully optimize the
## distribution
N3=3*NN
mycount=0
for (i in 1:N3)
{if (mysample[i]>22 & mysample[i]<67)
{
mycount=mycount+1
if (mycount ==1 )
{
mysample2<-cbind(mysample[i])
}
else
{
mysample2<-cbind(mysample2,mysample[i])
}

}

}

plot(density(mysample,kern="gaussian"),lwd=2,col=300)
## The commented lines are parts of the 1st version of this script

#x<-matrix(ncol=2,nrow=three*NN)
#x[ ,1]<-mysample[]
#x[ ,2]<-1
# y<-as.data.frame(mysample)
mysample3<-mysample2[1:15000]
y<-as.data.frame(mysample3)
mixgroup(y)
z<-mixgroup(y,breaks=100)
plot(z)
#ini<-read.table("gaupar.TXT")
ini<-mixparam(mu = c(24, 45, 73), sigma = c(5, 4, 4),pi=c(0.4,0.3,0.3))
mymix<-mix(z,ini,"norm")
summary(mymix)
plot(mymix)
print("So far so good")

Then the mixdist completely misses the means of the 2 external modes.
I am sure that, apart from the combination of three Gaussians, this
situation cannot be "new" at all.
I am thinking about some iteration technique (e.g. you fit the data as
well as you can with mixdist to start with, then according to your
fitted parameters you generate the "missing"  results, then you re-fit
the whole new set of data (real ones and those freshly generated) and
so on hoping to get some convergent results).
But it seems a bit cumbersome (furthermore, how many of them should I
generate?) plus I am sure that solving this problem from scratch is
re-inventing the wheel.
Sorry for the long email.
Kind Regards

Lorenzo Isella