[R] Basic data structures

Mon Aug 11 02:45:23 CEST 2008

Suppose I want to have a regexp match against a string, and return all
the matching substrings in a vector of strings.

   regexp <- "[ab]+"
   strlist <- c( "abc", "dbabddadd", "aaa" )
   matches <- gregexpr(regexp,strlist)

With this input, I'd want to return list( list("ab"), list("ab", "a"),
list("aaa") ).

Now the matches object prints out as

   [[1]]
   [1] 1
   attr(,"match.length")
   [1] 2

   [[2]]
   [1] 2 7
   attr(,"match.length")
   [1] 3 1

   [[3]]
   [1] 1
   attr(,"match.length")
   [1] 3

which, if I'm interpreting this correctly, means that it is a list
(not a vector, because vectors can only have atomic elements) of three
elements, each of which is a vector of integers (the matching
positions) with an attribute match.length (the length of the
corresponding match), which is in turn a vector of integers.

===
Question: is there a more compact standard print format for this? It's
a bit disconcerting that printing the 2x2 list
list(list(1,2),list(3,4)) takes 16 lines while the corresponding 2x2
array takes 2 lines! (I guess that arrays are "more native").

Here is one way:

> (mat <- t(sapply(matches,function(x)
+        list(start.index=`attributes<-`(x,NULL),
+             match.length=attr(x,"match.length")))))
     start.index match.length
[1,] 1           2           
[2,] Integer,2   Integer,2   
[3,] 1           3  

The object returned by this function is a 3x2 matrix of mode "list" -
each element of the matrix contains a list:

> mat[2,1]
$start.index
[1] 2 7

> mat[[2,1]]
[1] 2 7

also, see below...

===
Now, matches[[1]], the first element of matches, describes the matches
in the first string.  To extract those strings, I can write

   substr( strlist[[1]],
           matches[[1]],
       attr(matches[[1]],"match.length")+matches[[1]]-1 )

which correctly gives "ab".

Question: This looks awfully clumsy; is there some more idiomatic way
to do this, in particular to refer to the match.length attribute
without using a quoted string or the attr function?
attributes(matches[[1]])$match.length and
attributes(matches[[1]])[[1]] work, but seem even clumsier.

Check out the gsubfn package - I'm still learning it myself, but it may
provide the functionality you seek. For instance, I believe what you
are trying to accomplish is 

> strapply(strlist,regexp,identity)
or 
> strapply(strlist,regexp,c)
[[1]]
[1] "ab"

[[2]]
[1] "bab" "a"  

[[3]]
[1] "aaa"

===
Question: R uses names like xxx.yyy in many places.  Is this just a
convention to represent spaces (the way most languages use "_"), or is
there some semantics attached to "."?

In many examples that I have seen , programmers have used "." in the
place of the traditional "_" because "_" used to be an assignment
operator in earlier versions of R. Now, "_" is no longer an assignment
operator and its use in variable names is permitted also.

The "." notation also plays a role in the implementation of OOP by R.
R has two object-oriented approaches: S3 and S4. For both approaches,
methods are associated with generic functions, rather than the object
itself (which I understand is similar to Lisp's CLOS). For S3 methods,
function.objectclass implies the generic "function" to be applied to
class "objectclass".

For instance, print() is a generic function:
> print
function (x, ...) 
UseMethod("print")
<environment: namespace:base>

If you want to define a method for a particular class of objects, 
you can use the xxx.yyy syntax:
> print.regexp <- function(x)
+   for(i in seq(along=x))
+   cat(i,":", x[[i]], "| match.length =",
+       attr(x[[i]],"match.length"),"\n")

> class(matches) <- "regexp"
> print(matches)
1 : 1 | match.length = 2 
2 : 2 7 | match.length = 3 1 
3 : 1 | match.length = 3 

You can assign a class (or classes) to each object; this information
is used for making method dispatch decisions for generic
functions. For S3 there is no checking of consistency between object
classes and its attributes; S4 is a more formal implementation of OOP
in R.

Check out
(S3)
http://www-128.ibm.com/developerworks/linux/library/l-r3.html
(S4)
http://developer.r-project.org/howMethodsWork.pdf

The first reference also mentions how to implement infinite sequences
in R - which may answer part of your question below.

===
Question: Is it good practice in R to treat a string as a vector of
characters so that R's powerful vector operations can be used on it?
How would I do that?

I'm sure it can be done by defining your own objects and methods, but
it's not done out-of-the-box (that I'm aware of). I believe the most
common string operations used by R users are extraction and
concatenation; these are effectively achieved by substr(), substring()
and paste(), rather than "[", c(), or "+", as you seem to have figured
out. In my experience, R's standard objects and functions for
string-like objects are immediately convenient for manipulating file
and variable names but not necessarily for hard-core text processing.

===
Now suppose I want to list *all* the matches in matches[[2]].  I try:

   substr( strlist[[2]],
           matches[[2]],
       attr(matches[[2]],"match.length")+matches[[2]]-1 )

but only get the first one, so it seems that the recycling rule for
vectors doesn't apply here (same thing with [2] instead of [[2]]).
Where does recycling apply and not apply?

I don't know if there's a hard rule for that (though I usually expect
that recycling works for mathematical operators and plotting
functions), but in this case hope the strapply() function above will
solve your problem. Otherwise, an inelegant way would be to use Map()
or mapply():

> mapply(function(x,y) substr( strlist[[2]],x,y),
+     matches[[2]],
+     attr(matches[[2]],"match.length")+matches[[2]]-1)
[1] "bab" "a" 

===
Question: Is there some operator (using promises?) to make
strlist[[2]] into a (lazy) infinite vector/list?

Like an iterator? There is some mention of infinite sequences in the
IBM DeveloperWorks article above, but I've personally never tried
implementing one in R.

Hope this helps,
Satoshi