[R] sampling rows with values never sampled before
Jon Skoien
jon.skoien at jrc.ec.europa.eu
Tue Jun 23 10:04:26 CEST 2015
If df is the data.frame with values and you want nn samples, then this
is a slightly different approach:
# example data.frame:
df = data.frame(a1 = sample(1:20,50, replace = TRUE),
a2 = sample(seq(0.1,10,length.out =
30),50, replace = TRUE),
a3 = sample(seq(0.3, 20,length.out =
20),50,replace = TRUE))
nrow = dim(df)[1] # 50
ncol = dim(df)[2] # 3
# start by randomizing the order in your data.frame
randomOrder = sample(1:nrow, nrow, replace = FALSE)
dff = df[randomOrder,]
# find and remove all duplicates from all columns. With this you will
only keep the first instance of any unique value:
rem = NULL
for (ic in 1:ncol) rem = c(rem, which(duplicated(dff[, ic])))
if (length(rem) > 0) dff = dff[-unique(rem),]
# Reduce to the length you need
if (dim(dff)[1] > nn) res = dff[1:nn,] else res = dff
I am not sure how this scales if you have a really big data, and whether
you could get some FAQ 7.31 problems depending on how you fill your
data.frame.
Cheers,
Jon
On 6/23/2015 12:13 AM, C W wrote:
> Hi Jean,
>
> Thanks!
>
> Daniel,
> Yes, you are absolutely right. I want sampled vectors to be as different
> as possible.
>
> I added a little more to the earlier data set.
> x1 x2 x3
> [1,] 1 3.7 2.1
> [2,] 2 3.7 5.3
> [3,] 3 3.7 6.2
> [4,] 4 3.7 8.9
> [5,] 5 3.7 4.1
> [6,] 1 2.9 2.1
> [7,] 2 2.9 5.3
> [8,] 3 2.9 6.2
> [9,] 4 2.9 8.9
> [10,] 5 2.9 4.1
> [11,] 1 5.2 2.1
> [12,] 2 5.2 5.3
> [13,] 3 5.2 6.2
> [14,] 4 5.2 8.9
> [15,] 5 5.2 4.1
>
> If I sampled row, 1, 6, 11, solving the system of equations will not be
> possible. So, I am avoiding "similar vectors".
>
> Thanks,
>
> Mike
>
>
> On Mon, Jun 22, 2015 at 2:19 PM, Daniel Nordlund <djnordlund at frontier.com>
> wrote:
>
>> On 6/22/2015 9:42 AM, C W wrote:
>>
>>> Hello R list,
>>>
>>> I am have question about sampling unique coordinate values.
>>>
>>> Here's how my data looks like
>>>
>>> dat <- cbind(x1 = rep(1:5, 3), x2 = rep(c(3.7, 2.9, 5.2), each=5))
>>>> dat
>>>>
>>> x1 x2
>>> [1,] 1 3.7
>>> [2,] 2 3.7
>>> [3,] 3 3.7
>>> [4,] 4 3.7
>>> [5,] 5 3.7
>>> [6,] 1 2.9
>>> [7,] 2 2.9
>>> [8,] 3 2.9
>>> [9,] 4 2.9
>>> [10,] 5 2.9
>>> [11,] 1 5.2
>>> [12,] 2 5.2
>>> [13,] 3 5.2
>>> [14,] 4 5.2
>>> [15,] 5 5.2
>>>
>>>
>>> If I sampled (1, 3.7), then, I don't want (1, 2.9) or (2, 3.7).
>>>
>>> I want to avoid either the first or second coordinate repeated. It leads
>>> to undefined matrix inversion.
>>>
>>> I thought of using sampling(), but not sure about applying it to a data
>>> frame.
>>>
>>> Thanks in advance,
>>>
>>> Mike
>>>
>>>
>> I am not sure you gave us enough information to solve your real world
>> problem. But I have a few comments and a potential solution.
>>
>> 1. In your example the unique values in in x1 are completely crossed with
>> the unique values in x2.
>> 2. since you don't want duplicates of either number, then the maximum
>> number of samples that you can take is the minimum number of unique values
>> in either vector, x1 or x2 (in this case x2 with 3 unique values).
>> 3. Sample without replace from the smallest set of unique values first.
>> 4. Sample without replacement from the larger set second.
>>
>>> x <- 1:5
>>> xx <- c(3.7, 2.9, 5.2)
>>> s2 <- sample(xx,2, replace=FALSE)
>>> s1 <- sample(x,2, replace=FALSE)
>>> samp <- cbind(s1,s2)
>>>
>>> samp
>> s1 s2
>> [1,] 5 3.7
>> [2,] 1 5.2
>> Your actual data is probably larger, and the unique values in each vector
>> may not be completely crossed, in which case the task is a little harder.
>> In that case, you could remove values from your data as you sample. This
>> may not be efficient, but it will work.
>>
>> smpl <- function(dat, size){
>> mysamp <- numeric(0)
>> for(i in 1:size) {
>> s <- dat[sample(nrow(dat),1),]
>> mysamp <- rbind(mysamp,s, deparse.level=0)
>> dat <- dat[!(dat[,1]==s[1] | dat[,2]==s[2]),]
>> }
>> mysamp
>> }
>>
>>
>> This is just an example of how you might approach your real world
>> problem. There is no error checking, and for large samples it may not
>> scale well.
>>
>>
>> Hope this is helpful,
>>
>> Dan
>>
>> --
>> Daniel Nordlund
>> Bothell, WA USA
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Jon Olav Skøien
Joint Research Centre - European Commission
Institute for Environment and Sustainability (IES)
Climate Risk Management Unit
Via Fermi 2749, TP 100-01, I-21027 Ispra (VA), ITALY
jon.skoien at jrc.ec.europa.eu
Tel: +39 0332 789205
Disclaimer: Views expressed in this email are those of the individual and do not necessarily represent official views of the European Commission.
More information about the R-help
mailing list