[R] Randomly drop a percent of data from a data.frame
arun
smartpink111 at yahoo.com
Sat Aug 17 00:32:05 CEST 2013
Hi,
Suppose the dataset had odd number of columns:
set.seed(6458)
data2<- data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5))
n<- prod(dim(data2))
n
#[1] 15
dummy<- rep(F,n/2)
dummy[sample(1:(n/2),n*.2)]<-T
dummy
#[1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE
data2[,c("x2", "x3")][matrix(dummy, nc = 2)] <- NA
#Error in `[<-.data.frame`(`*tmp*`, matrix(dummy, nc = 2), value = NA) :
# unsupported matrix index in replacement
#In addition: Warning message:
#In matrix(dummy, nc = 2) :
# data length [7] is not a sub-multiple or multiple of the number of rows [4]
I might do:
n1<- 2*nrow(data2) ##for 2 columns
dummy<- rep(FALSE,n1)
dummy[sample(1:n1,n1*.2)]<-TRUE
data2[,c("x2","x3")][matrix(dummy,nc=2)]<-NA
data2
# x1 x2 x3
#1 -0.55899744 0.6622481 -0.3305958
#2 0.12776368 NA NA
#3 -1.09734838 0.2069539 -0.6997853
#4 0.75919499 -0.5683809 0.4752002
#5 -0.03063141 -0.7549605 2.6038635
A.K.
________________________________
From: Richard Kwock <richardkwock at gmail.com>
To: arun <smartpink111 at yahoo.com>
Cc: Christopher Desjardins <cddesjardins at gmail.com>; R help <r-help at r-project.org>
Sent: Friday, August 16, 2013 5:55 PM
Subject: Re: [R] Randomly drop a percent of data from a data.frame
Try this:
data <- data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
data <- round(data,digits=3)
#get the total counts
n = prod(dim(data))
#set up a dummy array/matrix
dummy <- rep(F, n/2)
dummy[sample(1:(n/2), n*.2)] <- T
# 5x2 dummy matrix with T and F
matrix(dummy, nc = 2)
#subset the T indices in x3 and x4 and replace with NAs
data[,c("x3", "x4")][matrix(dummy, nc = 2)] <- NA
data
# x1 x2 x3 x4
#1 -1.310 0.659 NA 0.510
#2 -3.003 -0.004 NA NA
#3 0.584 0.310 NA -0.087
#4 1.644 -2.792 -0.390 -0.382
#5 -1.791 0.840 1.137 0.820
Richard
On Fri, Aug 16, 2013 at 2:34 PM, arun <smartpink111 at yahoo.com> wrote:
Hi,
>May be this helps:
>#data1 (changed `data` to `data1`)
>set.seed(6245)
> data1 <- data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
> data1<- round(data1,digits=3)
>
>data2<- data1
>
>data1[,3:4]<-lapply(data1[,3:4],function(x){x1<- match(x,sample(unlist(data1[,3:4]),round(0.8*length(unlist(data1[,3:4])))));x[is.na(x1)]<-NA;x})
> data1
># x1 x2 x3 x4
>#1 0.482 1.320 NA -0.142
>#2 -0.753 -0.041 -0.063 0.886
>#3 0.028 -0.256 -0.069 0.354
>#4 -0.086 0.475 0.244 0.781
>#5 0.690 -0.181 1.274 1.633
>
>
>#or
>data2[,3:4]<-lapply(data2[,3:4],function(x){x1<- match(x,sample(unlist(data2[,3:4]),round(0.8*length(unlist(data2[,3:4])))));x[is.na(x1)]<-NA;x})
> data2
># x1 x2 x3 x4
>#1 0.482 1.320 -0.859 -0.142
>#2 -0.753 -0.041 NA NA
>#3 0.028 -0.256 -0.069 0.354
>#4 -0.086 0.475 0.244 0.781
>#5 0.690 -0.181 1.274 1.633
>A.K.
>
>
>
>
>----- Original Message -----
>From: Christopher Desjardins <cddesjardins at gmail.com>
>To: "r-help at r-project.org" <r-help at r-project.org>
>Cc:
>Sent: Friday, August 16, 2013 3:02 PM
>Subject: [R] Randomly drop a percent of data from a data.frame
>
>Hi,
>I have the following data.
>
>> set.seed(6245)
>> data <- data.frame(x1=rnorm(5),x2=rnorm(5),x3=rnorm(5),x4=rnorm(5))
>> round(data,digits=3)
> x1 x2 x3 x4
>1 0.482 1.320 -0.859 -0.142
>2 -0.753 -0.041 -0.063 0.886
>3 0.028 -0.256 -0.069 0.354
>4 -0.086 0.475 0.244 0.781
>5 0.690 -0.181 1.274 1.633
>
>What I would like to do is drop 20% of the data. But I want this 20% to
>only come from dropping data from x3 and x4. It doesn't have to be evenly,
>i.e. I don't care to drop 2 from x3 and 2 from x4 or make sure only one
>observation has missing data on only one variable. I just want to drop 20%
>of the data through x3 and x4 only. In other words,
>
> x1 x2 x3 x4
>1 0.482 1.320 -0.859 NA
>2 -0.753 -0.041 -0.063 0.886
>3 0.028 -0.256 NA 0.354
>4 -0.086 0.475 NA 0.781
>5 0.690 -0.181 NA 1.633
>
>OR
>
> x1 x2 x3 x4
>1 0.482 1.320 NA -0.142
>2 -0.753 -0.041 -0.063 0.886
>3 0.028 -0.256 NA NA
>4 -0.086 0.475 0.244 NA
>5 0.690 -0.181 1.274 1.633
>
>OR
>
> x1 x2 x3 x4
>1 0.482 1.320 -0.859 -0.142
>2 -0.753 -0.041 -0.063 NA
>3 0.028 -0.256 -0.069 NA
>4 -0.086 0.475 0.244 NA
>5 0.690 -0.181 1.274 NA
>
>ETC. are all fine.
>
>Any ideas how I can do this?
>Chris
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list