[R] use sliding window to count substrings found in large string

David Winsemius dwinsemius at comcast.net
Wed Jul 7 21:24:50 CEST 2010


On Jul 7, 2010, at 1:26 PM, Gabor Grothendieck wrote:

> On Wed, Jul 7, 2010 at 1:25 PM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>> On Wed, Jul 7, 2010 at 1:15 PM, Immanuel <mane.desk at googlemail.com>  
>> wrote:
>>> Hey,
>>>
>>> big help, thanks!
>>> One little question remains, if I create
>>> more then one string and table ...
>>> ---------------------
>>>
>>> # generate an input string n long
>>> set.seed(123)
>>> n <- 300
>>> lets_1 <- paste(sample(letters[1:5], n, replace = TRUE), collapse  
>>> = "")
>>> lets_2 <- paste(sample(letters[1:5], n, replace = TRUE), collapse  
>>> = "")
>>>
>>>
>>> # get rolling k-length sequences and count
>>> k <- 3
>>> table_1 <-table(substring(lets_1, 1:(n-k+1), k:n))
>>> table_2 <-table(substring(lets_2, 1:(n-k+1), k:n))
>>> -----------------------
>>>
>>> is it possible to manipulate table_1 so that it contains zero  
>>> entries
>>> for all the substrings found in table_2 but not in table_1?
>>>
>>> best regards
>>> Immanuel
>>>
>>
>> Turn them into factors with the appropriate levels before counting
>> them with table:
>>
>> # generate an input string n long
>> set.seed(123)
>> n <- 300
>> lets_1 <- paste(sample(letters[1:5], n, replace = TRUE), collapse =  
>> "")
>> lets_2 <- paste(sample(letters[1:5], n, replace = TRUE), collapse =  
>> "")
>>
>> # get rolling k-length sequences and count
>> k <- 3
>> s1 <- substring(lets_1, 1:(n-k+1), k:n)
>> s2 <- substring(lets_2, 1:(n-k+1), k:n)
>> levs <- sort(unique(union(s1, s2)))
>> table(factors(s1, levs))
>> table(factors(s2, levs))
>>
>
> That should be factor, not factors:
>
> table(factor(s1, levs))
> table(factor(s2, levs))

That approach has many advantages and is surely the preferred one, and  
mine has only the advantage that it (slavishly) executed the OP's  
instructions and illustrates that named vectors can be appended to one- 
dimensional table objects (since they are basically 1d arrays).

 > "%w/o%" <- function(x,y) x[!x %in% y] #--  x without y  ... from  
the help page for match
 > extras <- rep(0, length(names(table_1) %w/o% ft2)  )
 > names(extras) <-names(table_1) %w/o% ft2
 > extras
aaa ace bcc cab cbd dba dbc dee ede
   0   0   0   0   0   0   0   0   0
 > t3 <- c(table_2, extras)
 > t3

-- 
David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list