[BioC] Fisher-Yates algorithm for DNA shuffling ?

Thu Jun 25 17:14:33 CEST 2009

hi Sean,

On Tue, 2009-06-23 at 19:46 -0400, Sean Davis wrote:
> 
> 
> 2009/6/23 Hervé Pagès <hpages at fhcrc.org>
>         Hi Robert,
>         
>         
>         Robert Castelo wrote:
>                 dear list,
>                 
>                 i'm interested in using some proper DNA shuffling
>                 procedure like the
>                 Fisher-Yates algorithm:
>                 
>                 R. A. Fisher and F. Yates, Example 12, Statistical
>                 Tables, London, 1938.
>                 
>                 Richard Durstenfeld, Algorithm 235: Random
>                 permutation, CACM 7(7):420,
>                 July 1964.
>         
>         
>         The Fisher & Yates or Durstenfeld algos are just particular
>         implementations
>         of a ramdom permutation generator.
>         R has the sample() function to generate a random permutation.
>         The man page
>         doesn't specify what implementation is used but does that
>         really matter?
>         
>         
>                 
>                 i've been googling bioconductor and the r-project for
>                 this, and
>                 particularly looking at the Biostrings package, but
>                 I've failed to find
>                 anything. just in case i'm missing some implementation
>                 of this DNA
>                 shuffling in R, i'd like to ask the list whether
>                 anybody knows of that
>                 procedure being implemented in R or, even better, in a
>                 bioconductor
>                 package such that would work with DNAString objects.
>                 
>                 the reason for that is that i would like to simulate
>                 random DNA strings
>                 preserving nucleotide composition in order to
>                 calculate empirical
>                 p-values for motif scanning.
> 
> I guess it is obvious, but note that using random sequences based on
> single nucleotide frequencies may give you a more significant p-value
> (that is, it may be anti-conservative) than if you model the
> background more robustly using several nucleotides and taking into
> account genome differences in base content.  There are a fair number
> of papers describing HMMs or other models that attempt to capture the
> underlying background model for motif scanning.

thanks for pointing that out, i'd like definitely to try one of those
approaches. so far, i'm shuffling dinucleotides and my p-values haven't
changed much (a little more conservative, but not much more, the bulk of
more significant ones is below more or less the same level).

i assume because you haven't say it, that there's none of those more
sophisticated possibly-HMM-based approaches implemented in bioconductor,
right? i think it could be nice to have something like that
next-to/within-in the matchPattern() or countPattern() functions from
Biostrings, but this is just one opinion of an end-user and i don't
pretend to start a discussion on software design and requirements.

i've googled a bit and found lots of links to papers including keywords
like "HMM", "background" and "motif scanning" but it's difficult for me
to tell which one looks more promising to have some algorithm properly
described or, even better, some available implementation. do you have
any particular suggestion?

thanks!
robert.

> Sean
>  
>         
>         
>         Generate a random DNA string (of length 1000) with specific
>         nucleotide probabilities:
>         
>          x <- paste(sample(c("A", "C", "G", "T"), 1000, replace=TRUE,
>                     prob=c(0.2, 0.55, 0.1, 0.15)), collapse="")
>          x <- DNAString(x)
>         
>         Then:
>         
>          > alphabetFrequency(x, baseOnly=TRUE)
>              A     C     G     T other
>            214   531   110   145     0
>         
>         If you really want to shuffle a given DNAString object x0 in
>         order to
>         preserve its nucleotide composition:
>         
>          x1 <- x0[sample(length(x0))]
>         
>         For example, shuffling the 'x' obtained above gives:
>         
>          x1 <- x[sample(length(x))]
>         
>          > alphabetFrequency(x1, baseOnly=TRUE)
>              A     C     G     T other
>            214   531   110   145     0
>         
>         Performance:
>         
>          > system.time(for (i in 1:10000) {y <- x[sample(length(x))]})
>             user  system elapsed
>            5.760   0.008   5.764
>         
>         Everything is already here and probably fast enough. No need
>         to
>         re-implement anything.
>         
>         Cheers,
>         
>         
>         H.
>         
>         
>                 
>                 if it is not yet implemented i would like to suggest
>                 the package
>                 maintainers of Biostrings or any other biological
>                 sequence
>                 infrastructure package in Bioconductor to have it
>                 implemented in C for
>                 having maximum speed with this kind of simulations.
>                 just in case it
>                 eases my request here goes some simple R code
>                 implementing the
>                 Fisher-Yates shuffling algorithm:
>                 
>                 fy <- function(seq) {
>                  seq <- unlist(strsplit(seq, ""))
>                  n <- length(seq)
>                  i <- n
>                  while (i > 0) {
>                    j <- floor(runif(1)*i+1)
>                    if (i != j) {
>                      tmp <- seq[i]
>                      seq[i] <- seq[j]
>                      seq[j] <- tmp
>                    }
>                    i <- i - 1
>                  }
>                  paste(seq, collapse="")
>                 }
>                 
>                 and this is an example of how to use it:
>                 
>                 # generate some random DNA sequence of length 3
>                 seq <- paste(sample(c("A","C","G","T"), size=3,
>                 replace=TRUE), collapse="")
>                 seq
>                 [1] "ATG"
>                 
>                 # shuffle the sequence 10,000 times and store all the
>                 permutations
>                 shuffledseqs <- c()
>                 for (i in 1:10000) shuffledseqs <- c(shuffledseqs,
>                 fy(seq))
>                 
>                 # verify that indeed the permutations occur uniformly
>                 at random
>                 base::table(shuffledseqs)/10000
>                 shuffledseqs
>                   AGT    ATG    GAT    GTA    TAG    TGA 0.1693 0.1682
>                 0.1678 0.1629 0.1610 0.1708  1/6
>                 [1] 0.1666667
>                 
>                 
>                 thanks!
>                 robert.
>                 
>                 _______________________________________________
>                 Bioconductor mailing list
>                 Bioconductor at stat.math.ethz.ch
>                 https://stat.ethz.ch/mailman/listinfo/bioconductor
>                 Search the archives:
>                 http://news.gmane.org/gmane.science.biology.informatics.conductor
>         
>         
>         -- 
>         Hervé Pagès
>         
>         Program in Computational Biology
>         Division of Public Health Sciences
>         Fred Hutchinson Cancer Research Center
>         1100 Fairview Ave. N, M2-B876
>         P.O. Box 19024
>         Seattle, WA 98109-1024
>         
>         E-mail: hpages at fhcrc.org
>         Phone:  (206) 667-5791
>         Fax:    (206) 667-1319
>         
>         
>         
>         _______________________________________________
>         Bioconductor mailing list
>         Bioconductor at stat.math.ethz.ch
>         https://stat.ethz.ch/mailman/listinfo/bioconductor
>         Search the archives:
>         http://news.gmane.org/gmane.science.biology.informatics.conductor
>         
>