[R] estimation problem

Thu May 3 17:45:45 CEST 2012

On Thu, May 03, 2012 at 03:08:00PM +0200, Kehl Dániel wrote:
> Dear List-members,
> 
> I have a problem where I have to estimate a mean, or a sum of a 
> population but for some reason it contains a huge amount of zeros.
> I cannot give real data but I constructed a toy example as follows
> 
> N1 <- 100000
> N2 <- 3000
> x1 <- rep(0,N1)
> x2 <- rnorm(N2,300,100)
> x <- c(x1,x2)
> 
> n <- 1000
> 
> x_sample <- sample(x,n,replace=FALSE)
> 
> I want to estimate the sum of x based on x_sample (not knowing N1 and N2 
> but their sum (N) only).
> The sample mean has a huge standard deviation I am looking for a better 
> estimator.

Hi.

I do not know the exact answer, but let me formulate the following observation.
If the question is redefined to estimate the mean of nonzero numbers, then
an estimate is mean(x_sample[x_sample != 0]). Its standard deviation in your
situation may be estimated as

  res <- rep(NA, times=1000)
  for (i in seq.int(along=res)) {
      x_sample <- sample(x,n,replace=FALSE)
      res[i] <- mean(x_sample[x_sample != 0])
  }
  sd(res)

  [1] 18.72677 # this varies with the seed a bit

The observation is that this cannot be improved much, since the estimate
is based on a very small sample. The average size of the sample of nonzero
values is N2/(N1+N2)*n = 29.1. So, the standard deviation should be
something close to 100/sqrt(29.1) = 18.5376.

Petr Savicky.