[R] use sliding window to count substrings found in large string

Gabor Grothendieck ggrothendieck at gmail.com
Wed Jul 7 18:50:31 CEST 2010


On Wed, Jul 7, 2010 at 12:25 PM, Immanuel <mane.desk at googlemail.com> wrote:
> Hello together,
>
>
> I'm looking for advice on how to do some tests on strings.
> What I want to do is the following:
>
> (just an example, real strings/sequence are about 200-400 characters long)
> given set of Strings:
>
> String1 abcdefgh
> String2 bcdefgop
>
> use a sliding window of size x  to create an vector of all subsequences
> of size x
> found in the set (order matters! ).
>
> Now create, for every string in the set, an vector containing the counts
> on how often
> each subsequence was found in this particular string.
>
>  It would be great if someone could give me a vague outline on how to
> start and which methods to work.
> I did read through the man pages and goggled a lot, but still don't know
> how to
> approach this.
>

Try this:

# generate an input string n long
set.seed(123)
n <- 300
lets <- paste(sample(letters[1:5], n, replace = TRUE), collapse = "")

# get rolling k-length sequences and count
k <- 3
table(substring(lets, 1:(n-k+1), k:n))



More information about the R-help mailing list