[BioC] how to operate on a DNAStringSet object

Sat Mar 23 04:24:29 CET 2013

Hello Steve and Herve,

Thanks very much! Holy cow! Randomize a string set with endoapply(xx, 
sample). This is the essence of my love/hate relationship with 
programming. I've spent my share of quality time with the IRanges 
vignette, and I still would have never thought of that 21 character one-
liner!

I did run across your previous post Steve about unname, it was what 
prompted my experimentation. I'm also glad to know that my intition of 
how to sew things back together might involve some kind of Views 
function (as Herve suggests). Anyway, thanks for the suggestions and 
solutions. Enormously helpful. The application involves searching ChIP 
Seq peaks (as represented by a DNAStringSet) for cis-regulatory 
elements, and scrambing the sequences is one method of creating a 
background, and the methods below are easy, fast, and work great.

-Chris

On Fri, Mar 22, 2013 at 9:18 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
> Hi Chris, Steve,
>
> I tried Steve's 'endoapply(xx, sample)' and was surprised/disappointed
> that it's about 10x slower than doing lapply() :-/
> The reason for that is that it was actually calling the "endoapply"
> method for List objects, which is very inefficient.
>
> I just added an "endoapply" method for XVectorList objects 
(DNAStringSet
> objects are just particular XVectorList objects) that is as fast as
> doing lapply():
>
>   > library(Biostrings)
>   > library(hgu95av2probe)
>   > probes <- DNAStringSet(hgu95av2probe)
>   > system.time(probes2 <- endoapply(probes, sample))
>      user  system elapsed
>   389.928   0.188 390.935
>
> Still not particularly fast but 10x faster than before.
>
> This is in Biostrings 2.27.12 (devel).
>
> A faster (but more complicated) way to go is to use a 3-step approach:
> (1) unlist 'probes' (produces a single big DNAString object), (2)
> operate on the unlisted object, and (3) relist it:
>
>   (1) Unlist:
>
>         unlisted <- unlist(probes)
>
>   (2) Shuffle the nucleotides in 'unlisted'. One complication is that
>       the shuffling must not move nucleotides across the boundaries
>       of the probes. This requires a little bit of extra work:
>
>         probes_width <- width(probes)
>         shuffling_idx <- IntegerList(lapply(probes_width, sample))
>         offsets <- cumsum(c(0L, probes_width[-length(probes_width)]))
>         shuffling_idx <- unlist(shuffling_idx + offsets)
>
>         ## Sanity check.
>         stopifnot(identical(sort(shuffling_idx), seq_along(unlisted)))
>
>         unlisted2 <- unlisted[shuffling_idx]
>
>   (3) Relist. Unfortunately, there is no "relist' method for DNAString
>       objects at the moment (I still need to add one). In the mean
>       time we can do:
>
>         probes2b <- as(successiveViews(unlisted2, width(probes)),
>                        "DNAStringSet")
>
> With this method, it takes only about 10 sec on my laptop to shuffle
> the nucleotides of all the probes in hgu95av2probe. If you are careful
> to set the RNG seed to the same value (via set.seed) before you run
> endoapply() and the 3-step approach, you'll end up with exactly the 
same
> result i.e. 'identical(as.character(probes2b), as.character(probes2))'
> will be TRUE.
>
> Cheers,
> H.
>
>
> On 03/21/2013 02:07 PM, Steve Lianoglou wrote:
>>
>> And upon one more second's worth of inspection, the endoapply
>> suggestion actually is a for-loop under the covers, so you won't be
>> buying time ... I guess the lapply will go faster ...
>>
>> -steve
>>
>> On Thu, Mar 21, 2013 at 5:05 PM, Steve Lianoglou
>> <mailinglist.honeypot at gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> On Thu, Mar 21, 2013 at 4:48 PM, Chris Seidel <seidel at phaget4.org> 
wrote:
>>> [aggressive clipping]
>>>
>>>> What's odd, is that this actually works:
>>>>
>>>> DNAStringSet(do.call(c,unlist(myRandomizedseqs)))
>>>>
>>>> *IF* the sequences are NOT NAMED.
>>>
>>>
>>> This (or similar things) have come up before on the ML, but I don't
>>> have time to search for it right now. I posted a suggestion that I 
use
>>> "unname" defensively to sidestep these corner cases. Perhaps that 
will
>>> help you find the thread when searching the archives. In any event,
>>> you could do:
>>>
>>> R> DNAStringSet(do.call(c, unname(unlist(...))))
>>>
>>> Now that I look at your example, I think the thread I'm talking 
about
>>> might have been slightly different, but I guess this should still 
work
>>> in your case.
>>>
>>>> How does one operate on the sequences of a DNAStringSet object 
without
>>>> getting a list back, or without a for loop? I'm sure there's some
>>>> elegant one-liner that completely escapes me.
>>>
>>>
>>> To randomize the sequences, you could do:
>>>
>>> R> xx <- DNAStringSet(c("GATACA", "GATCCTAA"))
>>> R> endoapply(xx, sample)
>>>    A DNAStringSet instance of length 2
>>>      width seq
>>> [1]     6 ACGATA
>>> [2]     8 GTCATAAC
>>>
>>> Where did that come from, right?
>>>
>>> Note that a DNAStringSet is an IRanges::Vector, and you'll find lots
>>> of things in the IRangesOverview vignette, which at first might seem
>>> like to long/detailed to read, but will be worth your time.
>>>
>>> Not sure how fast this will be on large XStringSet object, though. 
You
>>> may not buy yourself more speed than the for loop, but can't test 
that
>>> right now. Perhaps lapply(DNAstringSet, sample) might be faster, but
>>> I'll leave this as an exercise for the reader.
>>>
>>> HTH,
>>> -steve
>>>
>>> --
>>> Steve Lianoglou
>>> Defender of The Thesis
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center