[Bioc-devel] Patch for Biostrings::letterFrequencyInSlidingView over XStringSet

Steve Lianoglou mailinglist.honeypot at gmail.com
Thu Feb 17 20:47:45 CET 2011


I've got a suit at home for special occasions, and I'm wondering if I
should put it on so I can start lobbying to get my patch into
Biostrings? :-)

Seriously though ... I think it's a worthwhile addition to Biostrings
as it seems like a natural extension to the
Biostrings::letterFrequencyInSlidingView function so you can perform
this operation over a large number of events (sequences) very quickly.

Perhaps I can provide a motivating example?

I'm working with sequence data, and I needed to remove reads that seem
to be a result of an artifact. I rigged up a "score" for this artifact
by calculating the frequency of A's in a sliding window in the
neighborhood of where my reads align.

So. Imagine that `v` is a DNAStringSet that consists of the genomic
sequences in the neighborhoods of my alignments. Looping over these
sequence elements in R and calculating the nucleotide frequency there
takes a long time. Below I'm just showing the timing for 200 such
regions:

R> system.time(scores <- lapply(v[1:200],
letterFrequencyInSlidingView, 10, 'A'))
  user  system elapsed
 2.197   0.002   2.199

Running this code for all of the reads in my NGS experiment would be
prohibitively slow.

When I implemented the C code to loop over the same 200 set of
sequences, the picture/timing changed drastically:

R> system.time(scores2 <- letterFrequencyInSlidingView(v[1:200], 10, 'A'))
  user  system elapsed
 0.011   0.000   0.011

And further more, the result of these calculations are identical:

R> all(mapply(identical, scores, scores2))

Soo ... there's just one example of how it's practically useful. I
guess other will find their own ways to use the function (or not), but
there are many functions in Biostrings that work quickly/efficiently
over XStringSets, and I think (obviously) letterFrequencyInSlidingView
should, too.

I'd be happy to groom the code in any ways that seems fit to get this
into Biostrings trunk, if that's what's preventing it from doing so.

Thanks,
-steve

On Fri, Feb 11, 2011 at 2:22 AM, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
> Hi,
>
> I recently needed to have letterFrequencyInSlidingView work on
> multiple strings at once.
>
> I initially iterated over my XStringSet in R and passed each element
> to the letterFrequencyInSlidingView method, but this was really slow
> over large XStringSet objects.
>
> I bit the bullet and wrote the loop in C as the
> "XStringSet_letterFrequencyInSlidingView" you'll see defined in this
> diff.
>
> That function basically calls a slightly modified
> XString_letterFrequencyInSlidingView, which is also added as
> _XString_letterFrequencyInSlidingView.
>
> Slight changes to the documentation, etc. is also included in this
> patch, which is against svn revision 52588, in order to make it "well
> rounded."
>
> If the "powers that be" deem this a worthy addition, could you please
> apply this, or some cleaned version of it? This was my first real
> foray into modifying any C code in these large/well-established
> libraries, so ... I did a bit of hunting and pecking, and you'll
> surely have an opinion of how to do it better.
>
> The way it is implemented now, the
> XString_letterFrequencyInSlidingView is no longer being called
> directly and I guess should be removed if this is the appropriate
> style. The XStringSet_letterFrequencyInSlidingView is always delegated
> to, in a similar way that there is only a XStringSet_letterFrequency ,
> and no XString_letterFrequency.
>
> The patch is attached.
>
> If attachments don't come through (can't remember if they get
> stripped), you can also find it here:
> http://cbio.mskcc.org/~lianos/files/bioconductor/biostrings.diff
>
> Thanks,
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioc-devel mailing list