[BioC] how to operate on a DNAStringSet object
Chris Seidel
seidel at phaget4.org
Sat Mar 23 04:24:29 CET 2013
Hello Steve and Herve,
Thanks very much! Holy cow! Randomize a string set with endoapply(xx,
sample). This is the essence of my love/hate relationship with
programming. I've spent my share of quality time with the IRanges
vignette, and I still would have never thought of that 21 character one-
liner!
I did run across your previous post Steve about unname, it was what
prompted my experimentation. I'm also glad to know that my intition of
how to sew things back together might involve some kind of Views
function (as Herve suggests). Anyway, thanks for the suggestions and
solutions. Enormously helpful. The application involves searching ChIP
Seq peaks (as represented by a DNAStringSet) for cis-regulatory
elements, and scrambing the sequences is one method of creating a
background, and the methods below are easy, fast, and work great.
-Chris
On Fri, Mar 22, 2013 at 9:18 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
> Hi Chris, Steve,
>
> I tried Steve's 'endoapply(xx, sample)' and was surprised/disappointed
> that it's about 10x slower than doing lapply() :-/
> The reason for that is that it was actually calling the "endoapply"
> method for List objects, which is very inefficient.
>
> I just added an "endoapply" method for XVectorList objects
(DNAStringSet
> objects are just particular XVectorList objects) that is as fast as
> doing lapply():
>
> > library(Biostrings)
> > library(hgu95av2probe)
> > probes <- DNAStringSet(hgu95av2probe)
> > system.time(probes2 <- endoapply(probes, sample))
> user system elapsed
> 389.928 0.188 390.935
>
> Still not particularly fast but 10x faster than before.
>
> This is in Biostrings 2.27.12 (devel).
>
> A faster (but more complicated) way to go is to use a 3-step approach:
> (1) unlist 'probes' (produces a single big DNAString object), (2)
> operate on the unlisted object, and (3) relist it:
>
> (1) Unlist:
>
> unlisted <- unlist(probes)
>
> (2) Shuffle the nucleotides in 'unlisted'. One complication is that
> the shuffling must not move nucleotides across the boundaries
> of the probes. This requires a little bit of extra work:
>
> probes_width <- width(probes)
> shuffling_idx <- IntegerList(lapply(probes_width, sample))
> offsets <- cumsum(c(0L, probes_width[-length(probes_width)]))
> shuffling_idx <- unlist(shuffling_idx + offsets)
>
> ## Sanity check.
> stopifnot(identical(sort(shuffling_idx), seq_along(unlisted)))
>
> unlisted2 <- unlisted[shuffling_idx]
>
> (3) Relist. Unfortunately, there is no "relist' method for DNAString
> objects at the moment (I still need to add one). In the mean
> time we can do:
>
> probes2b <- as(successiveViews(unlisted2, width(probes)),
> "DNAStringSet")
>
> With this method, it takes only about 10 sec on my laptop to shuffle
> the nucleotides of all the probes in hgu95av2probe. If you are careful
> to set the RNG seed to the same value (via set.seed) before you run
> endoapply() and the 3-step approach, you'll end up with exactly the
same
> result i.e. 'identical(as.character(probes2b), as.character(probes2))'
> will be TRUE.
>
> Cheers,
> H.
>
>
> On 03/21/2013 02:07 PM, Steve Lianoglou wrote:
>>
>> And upon one more second's worth of inspection, the endoapply
>> suggestion actually is a for-loop under the covers, so you won't be
>> buying time ... I guess the lapply will go faster ...
>>
>> -steve
>>
>> On Thu, Mar 21, 2013 at 5:05 PM, Steve Lianoglou
>> <mailinglist.honeypot at gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> On Thu, Mar 21, 2013 at 4:48 PM, Chris Seidel <seidel at phaget4.org>
wrote:
>>> [aggressive clipping]
>>>
>>>> What's odd, is that this actually works:
>>>>
>>>> DNAStringSet(do.call(c,unlist(myRandomizedseqs)))
>>>>
>>>> *IF* the sequences are NOT NAMED.
>>>
>>>
>>> This (or similar things) have come up before on the ML, but I don't
>>> have time to search for it right now. I posted a suggestion that I
use
>>> "unname" defensively to sidestep these corner cases. Perhaps that
will
>>> help you find the thread when searching the archives. In any event,
>>> you could do:
>>>
>>> R> DNAStringSet(do.call(c, unname(unlist(...))))
>>>
>>> Now that I look at your example, I think the thread I'm talking
about
>>> might have been slightly different, but I guess this should still
work
>>> in your case.
>>>
>>>> How does one operate on the sequences of a DNAStringSet object
without
>>>> getting a list back, or without a for loop? I'm sure there's some
>>>> elegant one-liner that completely escapes me.
>>>
>>>
>>> To randomize the sequences, you could do:
>>>
>>> R> xx <- DNAStringSet(c("GATACA", "GATCCTAA"))
>>> R> endoapply(xx, sample)
>>> A DNAStringSet instance of length 2
>>> width seq
>>> [1] 6 ACGATA
>>> [2] 8 GTCATAAC
>>>
>>> Where did that come from, right?
>>>
>>> Note that a DNAStringSet is an IRanges::Vector, and you'll find lots
>>> of things in the IRangesOverview vignette, which at first might seem
>>> like to long/detailed to read, but will be worth your time.
>>>
>>> Not sure how fast this will be on large XStringSet object, though.
You
>>> may not buy yourself more speed than the for loop, but can't test
that
>>> right now. Perhaps lapply(DNAstringSet, sample) might be faster, but
>>> I'll leave this as an exercise for the reader.
>>>
>>> HTH,
>>> -steve
>>>
>>> --
>>> Steve Lianoglou
>>> Defender of The Thesis
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
More information about the Bioconductor
mailing list