# [R] Simulating data and imputation

Sarah s1327720 at student.rug.nl
Wed Dec 29 14:46:24 CET 2010

```Hi,

I wrote a script in order to simulate data, which I will use for evaluating
missing data and imputation. However, I'm having trouble with the last part
of my script, in which a dataframe is constructed without missing values.

This is my script:
y1 <- rnorm(10,0,3)
y2 <- rnorm(10,3,3)
y3 <- rnorm(10,3,3)
y4 <- rnorm(10,6,3)
y <- c(y1,y2,y3,y4)
a1 <-rep(1,20)
a2 <-rep(2,20)
a <- c(a1,a2)
b1 <- gl(2,10,20)
b2 <- gl(2,10,20)
b <- c(b1,b2)
x1 <- 1+2*y1+ rnorm(10,0,8)
x2 <- 1+2*y2+ rnorm(10,0,8)
x3 <- 1+2*y3+ rnorm(10,0,8)
x4 <- 1+2*y4+ rnorm(10,0,8)
x <- c(x1,x2,x3,x4)
#Create missing data dependent on factor A:
mar.y <- rep(NA,40)
df <- data.frame(y=y, mar.y=mar.y, a=a, b=b, x=x)
for (j in 1:40)
{
# Create missingness at random dependent on A:
df\$mar.y[which(df\$a==1)] <- replicate(length(which(df\$a==1)),
rbinom(1,1,0.20))
df\$mar.y[which(df\$a==2)] <- replicate(length(which(df\$a==2)),
rbinom(1,1,0.10))
}
if (length(which(df\$mar.y==0))>34) {
df <- df[sample(which(df\$mar.y==0),34), ]
} else {
df <- df[c(which(df\$mar.y==0),
sample(which(df\$mar.y==1),34-length(which(df\$mar.y==0)))), ]
}

(I would like the total number of randomly removed values to be 15% of the
total sample size, which in this case are 6 values. In other scripts I'm
using different values.)

At this point, I would like to impute missing values. However, my dataframe
only contains the 34 'observed' values (which seemed okay in the beginning
of my study). Now, I would like my dataframe to contain 34 observed values
(y=0) AND the 6 'missing' or deleted values (y=1). Unfortunately, the
missing values are deleted from the data set with 'sample', so imputation is
not possible at the moment (i.e., there are no NA's to impute)
Does anyone knows how to rewrite the last bit of the script
(if...else...-part), in order to keep the 6 'deleted/missing' values in the
data set, and give them a value mar.y=1 (or NA, or any other value),
together with the 34 'observed ones' (mar.y=0)? In this way, I can impute
the missing values in my data set.