[R] use sliding window to count substrings found in large string
Gabor Grothendieck
ggrothendieck at gmail.com
Wed Jul 7 18:50:31 CEST 2010
On Wed, Jul 7, 2010 at 12:25 PM, Immanuel <mane.desk at googlemail.com> wrote:
> Hello together,
>
>
> I'm looking for advice on how to do some tests on strings.
> What I want to do is the following:
>
> (just an example, real strings/sequence are about 200-400 characters long)
> given set of Strings:
>
> String1 abcdefgh
> String2 bcdefgop
>
> use a sliding window of size x to create an vector of all subsequences
> of size x
> found in the set (order matters! ).
>
> Now create, for every string in the set, an vector containing the counts
> on how often
> each subsequence was found in this particular string.
>
> It would be great if someone could give me a vague outline on how to
> start and which methods to work.
> I did read through the man pages and goggled a lot, but still don't know
> how to
> approach this.
>
Try this:
# generate an input string n long
set.seed(123)
n <- 300
lets <- paste(sample(letters[1:5], n, replace = TRUE), collapse = "")
# get rolling k-length sequences and count
k <- 3
table(substring(lets, 1:(n-k+1), k:n))
More information about the R-help
mailing list