[R] Basic data structures

Stavros Macrakis macrakis at alum.mit.edu
Sun Aug 10 23:00:07 CEST 2008


I'm new to R and very excited about its possibilities.  But I'm
struggling with some very simple things, probably because I haven't
found the correct documentation.  Here's a simple example which
illustrates several of my problems.

Suppose I want to have a regexp match against a string, and return all
the matching substrings in a vector of strings.

   regexp <- "[ab]+"
   strlist <- c( "abc", "dbabddadd", "aaa" )
   matches <- gregexpr(regexp,strlist)

With this input, I'd want to return list( list("ab"), list("ab", "a"),
list("aaa") ).

Now the matches object prints out as

   [[1]]
   [1] 1
   attr(,"match.length")
   [1] 2

   [[2]]
   [1] 2 7
   attr(,"match.length")
   [1] 3 1

   [[3]]
   [1] 1
   attr(,"match.length")
   [1] 3

which, if I'm interpreting this correctly, means that it is a list
(not a vector, because vectors can only have atomic elements) of three
elements, each of which is a vector of integers (the matching
positions) with an attribute match.length (the length of the
corresponding match), which is in turn a vector of integers.

Question: is there a more compact standard print format for this? It's
a bit disconcerting that printing the 2x2 list
list(list(1,2),list(3,4)) takes 16 lines while the corresponding 2x2
array takes 2 lines! (I guess that arrays are "more native").

Now, matches[[1]], the first element of matches, describes the matches
in the first string.  To extract those strings, I can write

   substr( strlist[[1]],
           matches[[1]],
	   attr(matches[[1]],"match.length")+matches[[1]]-1 )

which correctly gives "ab".

Question: This looks awfully clumsy; is there some more idiomatic way
to do this, in particular to refer to the match.length attribute
without using a quoted string or the attr function?
attributes(matches[[1]])$match.length and
attributes(matches[[1]])[[1]] work, but seem even clumsier.

Question: R uses names like xxx.yyy in many places.  Is this just a
convention to represent spaces (the way most languages use "_"), or is
there some semantics attached to "."?

Question: Is it good practice in R to treat a string as a vector of
characters so that R's powerful vector operations can be used on it?
How would I do that?

Now suppose I want to list *all* the matches in matches[[2]].  I try:

   substr( strlist[[2]],
           matches[[2]],
	   attr(matches[[2]],"match.length")+matches[[2]]-1 )

but only get the first one, so it seems that the recycling rule for
vectors doesn't apply here (same thing with [2] instead of [[2]]).
Where does recycling apply and not apply?

Question: Is there some operator (using promises?) to make
strlist[[2]] into a (lazy) infinite vector/list?

Now suppose I want to list *all* the matches in all the strings.  How
would I do that?  The naive way, substr(strlist,matches, ...) doesn't
work, partly because the attr operator doesn't distribute over lists
(I see why it can't, but...).

Thanks in advance for your patience with these very elementary questions,

      -s

Stavros Macrakis, Cambridge, MA



More information about the R-help mailing list