[R] Help with simulation of unbalanced clustered data
Chao Liu
p@ych@o||u @end|ng |rom gm@||@com
Wed Dec 16 15:56:12 CET 2020
Thank you for the reminder, Jeff. I am new to R-help and so please
bear with my ignorance. This is not homework and here is a
reproducible example. The number of observations per cluster doesn't
follow the condition specified above though, I just used this to
convey my idea.
> y <- rnorm(20)
> x <- rnorm(20)
> z <- rep(1:5, 4)
> w <- rep(1:4, each=5)
> dd <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
id cluster x y
1 1 1 0.30003855 0.65325768
2 2 1 -1.00563626 -0.12270866
3 3 1 0.01925927 -0.41367651
4 4 1 -1.07742065 -2.64314895
5 5 1 0.71270333 -0.09294102
6 1 2 1.08477509 0.43028470
7 2 2 -2.22498770 0.53539884
8 3 2 1.23569346 -0.55527835
9 4 2 -1.24104450 1.77950291
10 5 2 0.45476927 0.28642442
11 1 3 0.65990264 0.12631586
12 2 3 -0.19988983 1.27226678
13 3 3 -0.64511396 -0.71846622
14 4 3 0.16532102 -0.45033862
15 5 3 0.43881870 2.39745248
16 1 4 0.88330282 0.01112919
17 2 4 -2.05233698 1.63356842
18 3 4 -1.63637927 -1.43850664
19 4 4 1.43040234 -0.19051680
20 5 4 1.04662885 0.37842390
After randomly adding and deleting some data, the unbalanced data become
like this:
id cluster x y
1 1 1 0.895 -0.659
2 2 1 -0.160 -0.366
3 1 2 -0.528 -0.294
4 2 2 -0.919 0.362
5 3 2 -0.901 -0.467
6 1 3 0.275 0.134
7 2 3 0.423 0.534
8 3 3 0.929 -0.953
9 4 3 1.67 0.668
10 5 3 0.286 0.0872
11 1 4 -0.373 -0.109
12 2 4 0.289 0.299
13 3 4 -1.43 -0.677
14 4 4 -0.884 1.70
15 5 4 1.12 0.386
16 1 5 -0.723 0.247
17 2 5 0.463 -2.59
18 3 5 0.234 0.893
19 4 5 -0.313 -1.96
20 5 5 0.848 -0.0613
Here is what I tried:
dd[-sample(which(dd$cluster==sample(unique(dd$cluster),round(0.2*length(unique(dd$cluster))))),
round(0.5*length(which(dd$cluster==sample(unique(dd$cluster),round(0.2*length(unique(dd$cluster)))))))),].
I know it is very inefficient. Also it just randomly deleted rows and
had no effects in adding rows to match the total number of
observations. Thank you for your help!
Best,
Liu
On Wed, Dec 16, 2020 at 8:50 AM Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
wrote:
> This is R-help, not R-do-my-work-for-me. It is also not a homework help
> line. The Posting Guide is required reading. Assuming this is not homework,
> since each step in your problem definition can be mapped to a fairly basic
> operation in R (the sample function and indexing being key tools), you
> should be showing your work with a reproducible example that illustrates
> where you are stuck or why the result you are getting does not exhibit the
> desired properties.
>
> On December 15, 2020 6:48:12 PM PST, Chao Liu <psychaoliu using gmail.com>
> wrote:
> >Dear R experts,
> >
> >I want to simulate some unbalanced clustered data. The number of
> >clusters
> >is 20 and the average number of observations is 30. However, I would
> >like
> >to create an unbalanced clustered data per cluster where there are 10%
> >more
> >observations than specified (i.e., 33 rather than 30). I then want to
> >randomly exclude an appropriate number of observations (i.e., 60) to
> >arrive
> >at the specified average number of observations per cluster (i.e., 30).
> >The
> >probability of excluding an observation within each cluster was not
> >uniform
> >(i.e., some clusters had no cases removed and others had more
> >excluded).
> >Therefore in the end I still have 600 observations in total. How to
> >realize
> >that in R? Thank you for your help!
> >
> >Best,
> >
> >Liu
> >
> > [[alternative HTML version deleted]]
> >
> >______________________________________________
> >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list