[Bioc-sig-seq] Local Context

Hervé Pagès hpages at fhcrc.org
Sat Jun 26 02:07:47 CEST 2010


Hi Jim,

On 06/21/2010 04:56 PM, James Bullard wrote:
> Hi,
>
> This must be readily doable and I am just missing the obvious -- I am
> trying, for a long sequence, to classify each position as coming from a
> particular context. Something like running nucleotide table. The things
> that I have thought about doing are:
>
> .) Making Views representing my sliding windows, but then I don't see an
> obvious thing except nucleotide frequency, but that doesn't really solve
> the problem.

Making Views representing a sliding windows on a very long sequence like
e.g. Human chr1 generates a big object: 2 integers per position -> 2GB!
And then there is the problem that looping over the views in R is not an
option. So yes, as Patrick mentioned, there is the
letterFrequencyInSlidingView() function that addresses the particular
case of extracting the nucleotide frequencies for each view (it's
optimized, all written in C, and avoids the cost of actually
generating the Views object) but we don't have a general and
efficient mechanism for applying an arbitrary function to the
sliding views.

However, if one wants to compute the frequency of an arbitrary pattern
along the sliding views, something like this can be done:

   library(Biostrings)

   countPatternInSlidingView <- function(pattern, subject, view.width)
   {
     view.width <- as.integer(view.width)
     if (length(subject) < view.width)
         stop("length(subject) < view.width")
     nviews <- length(subject) - view.width + 1L
     if (nchar(pattern) > view.width)
         return(integer(nviews))
     hits <- matchPattern(pattern, subject, algo="naive-exact")
     NLP <- function(total_length, points)
     {
       idx <- logical(total_length)
       idx[points] <- TRUE
       NLP <- cumsum(idx)
       return(NLP)
     }
     ## 'NLP0' has the length of 'subject'
     NLP0 <- NLP(length(subject), start(hits))
     ## 'd1' is guaranteed to be >= 0
     d1 <- view.width - nchar(pattern)
     NLP1 <- NLP0[-seq_len(d1)]
     length(NLP1) <- nviews
     NLP2 <- c(0L, NLP0)
     length(NLP2) <- nviews
     return(NLP1 - NLP2)
   }

Then:
   > countPatternInSlidingView("AT", DNAString("ATATTGATA"),
                               view.width=4)
   [1] 2 1 1 0 1 1

It's not as fast as letterFrequencyInSlidingView() but is still ok:

   > library(BSgenome.Hsapiens.UCSC.hg18)
   > chr1 <- unmasked(Hsapiens$chr1)
   > system.time(TAACCCfreq <- countPatternInSlidingView("TAACCC",
                                               chr1, view.width=40))
      user  system elapsed
    14.250   5.470  21.099
   > TAACCCfreq[1:20]
    [1] 6 5 6 6 6 6 6 5 6 6 6 6 6 5 6 6 6 6 6 5

Not sure this helps for the problem you have though...

>
> .) Using nucleotideFrequencyAt, however, I am not sure that this function
> does what I want. Since it takes an XStringSet type object rather than a
> DNAString type object.

If you've defined your sliding views on your sequence e.g.:

   > v <- Views(DNAString("ACCCGTCGTGTTGAAT"), start=1:13, width=4)
   > v
     Views on a 16-letter DNAString subject
   subject: ACCCGTCGTGTTGAAT
   views:
        start end width
    [1]     1   4     4 [ACCC]
    [2]     2   5     4 [CCCG]
    [3]     3   6     4 [CCGT]
    [4]     4   7     4 [CGTC]
    [5]     5   8     4 [GTCG]
    [6]     6   9     4 [TCGT]
    [7]     7  10     4 [CGTG]
    [8]     8  11     4 [GTGT]
    [9]     9  12     4 [TGTT]
   [10]    10  13     4 [GTTG]
   [11]    11  14     4 [TTGA]
   [12]    12  15     4 [TGAA]
   [13]    13  16     4 [GAAT]

Then you can call nucleotideFrequencyAt() on it:

   > nucleotideFrequencyAt(v, c(1,4))
     A C G T
   A 0 1 0 0
   C 0 1 2 1
   G 0 0 2 2
   T 2 0 0 2

What you get here are the frequencies for the dinucleotides
obtained by picking the 1st and last nucleotides of each
view. Again, I'm not sure this is what you want...

Can you be more precise?

Thanks,
H.

>
> thanks, jim
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



More information about the Bioc-sig-sequencing mailing list