[R] sampling rows with values never sampled before
Daniel Nordlund
djnordlund at frontier.com
Mon Jun 22 20:19:04 CEST 2015
On 6/22/2015 9:42 AM, C W wrote:
> Hello R list,
>
> I am have question about sampling unique coordinate values.
>
> Here's how my data looks like
>
>> dat <- cbind(x1 = rep(1:5, 3), x2 = rep(c(3.7, 2.9, 5.2), each=5))
>> dat
> x1 x2
> [1,] 1 3.7
> [2,] 2 3.7
> [3,] 3 3.7
> [4,] 4 3.7
> [5,] 5 3.7
> [6,] 1 2.9
> [7,] 2 2.9
> [8,] 3 2.9
> [9,] 4 2.9
> [10,] 5 2.9
> [11,] 1 5.2
> [12,] 2 5.2
> [13,] 3 5.2
> [14,] 4 5.2
> [15,] 5 5.2
>
>
> If I sampled (1, 3.7), then, I don't want (1, 2.9) or (2, 3.7).
>
> I want to avoid either the first or second coordinate repeated. It leads
> to undefined matrix inversion.
>
> I thought of using sampling(), but not sure about applying it to a data
> frame.
>
> Thanks in advance,
>
> Mike
>
I am not sure you gave us enough information to solve your real world
problem. But I have a few comments and a potential solution.
1. In your example the unique values in in x1 are completely crossed
with the unique values in x2.
2. since you don't want duplicates of either number, then the maximum
number of samples that you can take is the minimum number of unique
values in either vector, x1 or x2 (in this case x2 with 3 unique values).
3. Sample without replace from the smallest set of unique values first.
4. Sample without replacement from the larger set second.
> x <- 1:5
> xx <- c(3.7, 2.9, 5.2)
> s2 <- sample(xx,2, replace=FALSE)
> s1 <- sample(x,2, replace=FALSE)
> samp <- cbind(s1,s2)
>
> samp
s1 s2
[1,] 5 3.7
[2,] 1 5.2
>
Your actual data is probably larger, and the unique values in each
vector may not be completely crossed, in which case the task is a little
harder. In that case, you could remove values from your data as you
sample. This may not be efficient, but it will work.
smpl <- function(dat, size){
mysamp <- numeric(0)
for(i in 1:size) {
s <- dat[sample(nrow(dat),1),]
mysamp <- rbind(mysamp,s, deparse.level=0)
dat <- dat[!(dat[,1]==s[1] | dat[,2]==s[2]),]
}
mysamp
}
This is just an example of how you might approach your real world
problem. There is no error checking, and for large samples it may not
scale well.
Hope this is helpful,
Dan
--
Daniel Nordlund
Bothell, WA USA
More information about the R-help
mailing list