[R] Basic data structures
Gabor Grothendieck
ggrothendieck at gmail.com
Mon Aug 11 02:03:15 CEST 2008
Try this:
regexp <- "[ab]+"
strlist <- c( "abc", "dbabddadd", "aaa" )
library(gsubfn)
s <- strapply(strlist, regexp)
s
# compactly show 1st few in ea component
str(s)
See gsubfn home page at
http://gsubfn.googlecode.com
On Sun, Aug 10, 2008 at 5:00 PM, Stavros Macrakis <macrakis at alum.mit.edu> wrote:
> I'm new to R and very excited about its possibilities. But I'm
> struggling with some very simple things, probably because I haven't
> found the correct documentation. Here's a simple example which
> illustrates several of my problems.
>
> Suppose I want to have a regexp match against a string, and return all
> the matching substrings in a vector of strings.
>
> regexp <- "[ab]+"
> strlist <- c( "abc", "dbabddadd", "aaa" )
> matches <- gregexpr(regexp,strlist)
>
> With this input, I'd want to return list( list("ab"), list("ab", "a"),
> list("aaa") ).
>
> Now the matches object prints out as
>
> [[1]]
> [1] 1
> attr(,"match.length")
> [1] 2
>
> [[2]]
> [1] 2 7
> attr(,"match.length")
> [1] 3 1
>
> [[3]]
> [1] 1
> attr(,"match.length")
> [1] 3
>
> which, if I'm interpreting this correctly, means that it is a list
> (not a vector, because vectors can only have atomic elements) of three
> elements, each of which is a vector of integers (the matching
> positions) with an attribute match.length (the length of the
> corresponding match), which is in turn a vector of integers.
>
> Question: is there a more compact standard print format for this? It's
> a bit disconcerting that printing the 2x2 list
> list(list(1,2),list(3,4)) takes 16 lines while the corresponding 2x2
> array takes 2 lines! (I guess that arrays are "more native").
>
> Now, matches[[1]], the first element of matches, describes the matches
> in the first string. To extract those strings, I can write
>
> substr( strlist[[1]],
> matches[[1]],
> attr(matches[[1]],"match.length")+matches[[1]]-1 )
>
> which correctly gives "ab".
>
> Question: This looks awfully clumsy; is there some more idiomatic way
> to do this, in particular to refer to the match.length attribute
> without using a quoted string or the attr function?
> attributes(matches[[1]])$match.length and
> attributes(matches[[1]])[[1]] work, but seem even clumsier.
>
> Question: R uses names like xxx.yyy in many places. Is this just a
> convention to represent spaces (the way most languages use "_"), or is
> there some semantics attached to "."?
>
> Question: Is it good practice in R to treat a string as a vector of
> characters so that R's powerful vector operations can be used on it?
> How would I do that?
>
> Now suppose I want to list *all* the matches in matches[[2]]. I try:
>
> substr( strlist[[2]],
> matches[[2]],
> attr(matches[[2]],"match.length")+matches[[2]]-1 )
>
> but only get the first one, so it seems that the recycling rule for
> vectors doesn't apply here (same thing with [2] instead of [[2]]).
> Where does recycling apply and not apply?
>
> Question: Is there some operator (using promises?) to make
> strlist[[2]] into a (lazy) infinite vector/list?
>
> Now suppose I want to list *all* the matches in all the strings. How
> would I do that? The naive way, substr(strlist,matches, ...) doesn't
> work, partly because the attr operator doesn't distribute over lists
> (I see why it can't, but...).
>
> Thanks in advance for your patience with these very elementary questions,
>
> -s
>
> Stavros Macrakis, Cambridge, MA
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list