[BioC] Fisher-Yates algorithm for DNA shuffling ?

Robert Castelo robert.castelo at upf.edu
Tue Jun 23 15:58:53 CEST 2009


dear list,

i'm interested in using some proper DNA shuffling procedure like the
Fisher-Yates algorithm:

R. A. Fisher and F. Yates, Example 12, Statistical Tables, London, 1938.

Richard Durstenfeld, Algorithm 235: Random permutation, CACM 7(7):420,
July 1964.

i've been googling bioconductor and the r-project for this, and
particularly looking at the Biostrings package, but I've failed to find
anything. just in case i'm missing some implementation of this DNA
shuffling in R, i'd like to ask the list whether anybody knows of that
procedure being implemented in R or, even better, in a bioconductor
package such that would work with DNAString objects.

the reason for that is that i would like to simulate random DNA strings
preserving nucleotide composition in order to calculate empirical
p-values for motif scanning.

if it is not yet implemented i would like to suggest the package
maintainers of Biostrings or any other biological sequence
infrastructure package in Bioconductor to have it implemented in C for
having maximum speed with this kind of simulations. just in case it
eases my request here goes some simple R code implementing the
Fisher-Yates shuffling algorithm:

fy <- function(seq) {
  seq <- unlist(strsplit(seq, ""))
  n <- length(seq)
  i <- n
  while (i > 0) {
    j <- floor(runif(1)*i+1)
    if (i != j) {
      tmp <- seq[i]
      seq[i] <- seq[j]
      seq[j] <- tmp
    }
    i <- i - 1
  }
  paste(seq, collapse="")
}

and this is an example of how to use it:

# generate some random DNA sequence of length 3
seq <- paste(sample(c("A","C","G","T"), size=3, replace=TRUE), collapse="")
seq
[1] "ATG"

# shuffle the sequence 10,000 times and store all the permutations
shuffledseqs <- c()
for (i in 1:10000) shuffledseqs <- c(shuffledseqs, fy(seq))

# verify that indeed the permutations occur uniformly at random
base::table(shuffledseqs)/10000
shuffledseqs
   AGT    ATG    GAT    GTA    TAG    TGA 
0.1693 0.1682 0.1678 0.1629 0.1610 0.1708 
 1/6
[1] 0.1666667


thanks!
robert.



More information about the Bioconductor mailing list