[Rd] error handling in strcapture

William Dunlap wdunlap at tibco.com
Tue Oct 4 23:37:00 CEST 2016


It is also not catching the cases where the number of capture expressions
does not match the number of entries in proto.  I think all of the
following should give an error about the mismatch.

> strcapture("(.)(.)", c("ab", "cde", "fgh", "ij", "lm"),
proto=list(A="",B="",C=""))
   A  B  C
1  a  b cd
2  d fg  f
3 ij  i  j
4  l  m ab
Warning message:
In matrix(as.character(unlist(str)), ncol = ntokens, byrow = TRUE) :
  data length [15] is not a sub-multiple or multiple of the number of rows
[4]
> strcapture("(.)(.)(.)", c("abc", "def", "ghi", "jkl", "mno"),
proto=list(A="",B=""))
    A   B
1   a   b
2 def   d
3   f ghi
4   h   i
5   j   k
6 mno   m
7   o abc
Warning message:
In matrix(as.character(unlist(str)), ncol = ntokens, byrow = TRUE) :
  data length [20] is not a sub-multiple or multiple of the number of rows
[7]
> strcapture("(.)(.)(.)", c("abc", "def"), proto=list(A=""))
  A
1 a
2 c
3 d
4 f


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Oct 4, 2016 at 2:21 PM, Michael Lawrence <lawrence.michael at gene.com>
wrote:

> Hi Bill,
>
> This is a bug in regexec() and I will commit a fix.
>
> Thanks for the report,
> Michael
>
> On Tue, Oct 4, 2016 at 1:40 PM, William Dunlap <wdunlap at tibco.com> wrote:
> > I noticed a problem in the strcapture from R-devel (2016-09-27 r71386),
> when
> > the text contains a missing value and perl=TRUE.
> >
> > {
> >       # NA in text input should map to row of NA's in output, without
> > warning
> >       r9p <- strcapture(perl = TRUE, "(.).* ([[:digit:]]+)", c("One 1",
> NA,
> > "Fifty 50"), data.frame(Initial=factor(), Number=numeric()))
> >       e9p <- structure(list(Initial = structure(c(2L, NA, 1L), .Label =
> > c("F", "O"), class = "factor"),
> >                            Number = c(1, NA, 50)),
> >                       row.names = c(NA, -3L),
> >                       class = "data.frame")
> >       all.equal(e9p, r9p)
> >   }
> > #Error in if (any(ind)) { : missing value where TRUE/FALSE needed
> >
> >
> > Bill Dunlap
> > TIBCO Software
> > wdunlap tibco.com
> >
> > On Wed, Sep 21, 2016 at 2:32 PM, Michael Lawrence
> > <lawrence.michael at gene.com> wrote:
> >>
> >> The new behavior is that it yields NAs when the pattern does not match
> >> (like strptime) and for empty captures in a matching pattern it yields
> >> the empty string, which is consistent with regmatches().
> >>
> >> Michael
> >>
> >> On Wed, Sep 21, 2016 at 2:21 PM, William Dunlap <wdunlap at tibco.com>
> wrote:
> >> > If there are any matches then strcapture can see if the pattern has
> the
> >> > same
> >> > number of capture expressions as the prototype has columns and give an
> >> > error if not.  That seems appropriate.
> >> >
> >> > If there are no matches, then there is no easy way to see if the
> >> > prototype
> >> > is compatible with the pattern, so should strcapture just assume the
> >> > best
> >> > and fill in the prototype with NA's?
> >> >
> >> > Should there be warnings?  This is kind of like strptime(), which
> >> > silently
> >> > gives NA's when the format does not match the text input.
> >> >
> >> >
> >> > Bill Dunlap
> >> > TIBCO Software
> >> > wdunlap tibco.com
> >> >
> >> > On Wed, Sep 21, 2016 at 2:10 PM, Michael Lawrence
> >> > <lawrence.michael at gene.com> wrote:
> >> >>
> >> >> Hi Bill,
> >> >>
> >> >> Thanks, another good suggestion. strcapture() now returns NAs for
> >> >> non-matches. It's nice to have someone kicking the tires on that
> >> >> function.
> >> >>
> >> >> Michael
> >> >>
> >> >> On Wed, Sep 21, 2016 at 12:11 PM, William Dunlap via R-devel
> >> >> <r-devel at r-project.org> wrote:
> >> >> > Michael, thanks for looking at my first issue with
> utils::strcapture.
> >> >> >
> >> >> > Another issue is how it deals with lines that don't match the
> >> >> > pattern.
> >> >> > Currently it gives an error
> >> >> >
> >> >> >> strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three 3"),
> >> >> > proto=list(Name="", Number=0))
> >> >> > Error in strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three
> >> >> > 3"),
> >> >> > :
> >> >> >   number of matches does not always match ncol(proto)
> >> >> >
> >> >> > First, isn't the 'number of matches' the number of parenthesized
> >> >> > subpatterns in the regular expression?  I thought that if the
> entire
> >> >> > pattern matches then the subpatterns without matches would be
> >> >> > shown as matches at position 0 with length 0.  Hence either the
> >> >> > pattern is compatible with the prototype or it isn't, it does not
> >> >> > depend
> >> >> > on the text input.  E.g.,
> >> >> >
> >> >> >> regexec("^(([[:alpha:]]+)|([[:digit:]]+))$", c("Twelve", "12",
> >> >> >> "Z280"))
> >> >> > [[1]]
> >> >> > [1] 1 1 1 0
> >> >> > attr(,"match.length")
> >> >> > [1] 6 6 6 0
> >> >> > attr(,"useBytes")
> >> >> > [1] TRUE
> >> >> >
> >> >> > [[2]]
> >> >> > [1] 1 1 0 1
> >> >> > attr(,"match.length")
> >> >> > [1] 2 2 0 2
> >> >> > attr(,"useBytes")
> >> >> > [1] TRUE
> >> >> >
> >> >> > [[3]]
> >> >> > [1] -1
> >> >> > attr(,"match.length")
> >> >> > [1] -1
> >> >> > attr(,"useBytes")
> >> >> > [1] TRUE
> >> >> >
> >> >> > Second, an error message like 'some lines were bad' is not very
> >> >> > helpful.
> >> >> > Should it put NA's in all the columns of the current output row if
> >> >> > the
> >> >> > input line didn't match the pattern and perhaps warn the user that
> >> >> > there
> >> >> > were problems?  The user could then look for rows of NA's to see
> >> >> > where
> >> >> > the
> >> >> > problems were.
> >> >> >
> >> >> > Bill Dunlap
> >> >> > TIBCO Software
> >> >> > wdunlap tibco.com
> >> >> >
> >> >> >         [[alternative HTML version deleted]]
> >> >> >
> >> >> > ______________________________________________
> >> >> > R-devel at r-project.org mailing list
> >> >> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >> >
> >> >
> >
> >
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list