[R] Help with matching rows

Petr Savicky savicky at praha1.ff.cuni.cz
Thu Apr 21 09:34:29 CEST 2011


On Wed, Apr 20, 2011 at 10:09:26PM -0400, gary engstrom wrote:
> Dear Sir,
> 
> Please excuse my akwardness as I a new to R and computers, but would kindly
> appreciate help.
> {
> a <- sample (1:10,100,replace=T )
> b <-sample(10:20,100,replace=T)
> c <- sample(20:30,100,replace=T)
> d <- sample(30:40,100,replace=T)
> e <- sample(40:50,100,replace=T)
> }
> d1 <- a
> d2 <- b
> d3 <-c
> d4 <- d
> d5 <- e
> 
> data.frame(d1,d2,d3,d4,d5)
> dd <- data.frame(d1,d2,d3,d4,d5)
> dd
> sd(d1)
> summary(d1)
> sd(d2)
> summary(d2)
> sd(d3)
> summary(d3)
> sd(d4)
> summary(d4)
> sd(d5)
> summary(d5)
> I am a beginner to R and am trying to learn statistical
> probability. I have started Dr. Levine and Dr Kerns books.
> So far from the usual sources, I haven't found the answers
> to the following questions and would greatly appreciate
> any assistance that anyone might kindly share.
> If I run this code, how do I look for duplicate rows and how can

See ?duplicated .

>  I adjust the SD of the sample function to make the chances
> of a duplicate row occur more often ?

A simple way, how to increase the number of duplicated rows,
is to reduce the space, from which the rows are drawn.

The following estimates the probability to have at least one
duplicated row using your original code.

  m <- 10000
  count <- 0
  for (i in 1:m) {
      d1 <- sample(1:10,100,replace=T)
      d2 <- sample(10:20,100,replace=T)
      d3 <- sample(20:30,100,replace=T)
      d4 <- sample(30:40,100,replace=T)
      d5 <- sample(40:50,100,replace=T)
      dd <- data.frame(d1,d2,d3,d4,d5)
      if (any(duplicated(dd))) {
          count <- count + 1
      }
  }
  count/m

I obtained

  [1] 0.035

This probability may also be computed exactly as follows.
The number of all possible rows, from which we sample, is the
product of the sizes of the sets, from which each component
is chosen. This is 10*11^4. Using this, the probability to
have at least one duplicated row among 100 rows chosen from
the uniform distribution is

  N <- 10*11^4 # the number of all possible rows
  1 - prod(1 - (0:99)/N)
  [1] 0.03325143

If the sample space is reduced to 8^5 using

    d1 <- sample(1:8,100,replace=T)
    d2 <- sample(11:18,100,replace=T)
    d3 <- sample(21:28,100,replace=T)
    d4 <- sample(31:38,100,replace=T)
    d5 <- sample(41:48,100,replace=T)

then the probability to have at least one duplicated row 
increases to

  N <- 8^5
  1 - prod(1 - (0:99)/N)
  [1] 0.1403373

Hope this helps.

Petr Savicky.



More information about the R-help mailing list