[Bioc-devel] Iterating over BSgenomeViews returns DNAString instead of BSgenomeViews

Thu Apr 13 10:50:51 CEST 2017

You could eventually point your student to MaskedXString and
oligonucleotideFrequency(). You can mask the repeats and then just run
the latter to count the N-mers. Comparing their original code to the
code based on existing high-level utilities might be a useful
exercise.

Michael

On Wed, Apr 12, 2017 at 8:24 PM, Pariksheet Nanda
<pariksheet.nanda at uconn.edu> wrote:
> On Fri, Apr 7, 2017 at 1:13 AM, Hervé Pagès <hpages at fredhutch.org> wrote:
>>
>> This is the expected behavior.
>>
>> Some background: BSgenomeViews are list-like objects where the *list
>> elements* (i.e. the elements one extracts with [[) are the DNA
>> sequences from the views
> --snip--
>> The important difference is that with [[ I get a DNAString object
>> (the content of the view) and with [ I get a BSgenomeViews object
>> of length 1.
>
> Thank you, Hervé!
>
> I was failing to make the connection with the `[[` accessor.
>
>
> On Fri, Apr 7, 2017 at 1:16 AM, Michael Lawrence <lawrence.michael at gene.com>
> wrote:
>>
>> I'm curious as to why you are looping over the views in the first
>> place. Maybe we could arrive at a vectorized solution, which is often
>> but not always simpler and faster.
>
> Hi Michael!
>
> Broad background is I'm acculturating an undergraduate student to writing a
> bioconductor package and applying software engineering practices of version
> control, unit testing, documenting, dependency setup and validation in a
> different environment on our university HPC cluster, etc.  The student also
> came along to LibrePlanet to better understand the culture of software
> freedom :o)  The package goal is to use Biostrings to look for repeating
> DNA sequences of a fixed kmer size and subset to portions of the genome
> without repeats (an aligner can do this ofc, but the goal is to teach R and
> engineering practices).
>
> I appreciate your thoughtfulness for vectorizing the code to best use
> BSgenomeViews, but please don't spend more than 10 minutes as I have to
> balance changes to the code with the student's learning and coding "voice"
> and may not do proper justice for more of your effort.  My slowness to
> reply was getting the project further along to be more understandable.
> Here was the line which I've updating as Hervé suggested to use seq_along():
> https://github.com/coregenomics/kmap/blob/4adaed6b8007e8ea39f39ff57a42a821445d3d46/R/BiostringsProjectNEW.R#L185
> (I'm having a hard time thinking of how to summarizing a small example out
> of context).
> Although in that line ranges_hits() is only operating on single indices,
> ranges_hits() was written to process groups of indices to reduce
> multi-processor communication.  Generating such sets of indices would
> involve applying width() to the views inside mappable() to break in into
> chunks of, say, a million bases for matchPDict().  Again, I'm linking to
> the code for anything that stands out at you, but I will feel bad if you
> spend a lot of time on it.
>
>
>> H.
>
>> Michael
>
> Pariksheet
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel