[Rd] error handling in strcapture

Michael Lawrence lawrence.michael at gene.com
Tue Oct 4 23:21:50 CEST 2016


Hi Bill,

This is a bug in regexec() and I will commit a fix.

Thanks for the report,
Michael

On Tue, Oct 4, 2016 at 1:40 PM, William Dunlap <wdunlap at tibco.com> wrote:
> I noticed a problem in the strcapture from R-devel (2016-09-27 r71386), when
> the text contains a missing value and perl=TRUE.
>
> {
>       # NA in text input should map to row of NA's in output, without
> warning
>       r9p <- strcapture(perl = TRUE, "(.).* ([[:digit:]]+)", c("One 1", NA,
> "Fifty 50"), data.frame(Initial=factor(), Number=numeric()))
>       e9p <- structure(list(Initial = structure(c(2L, NA, 1L), .Label =
> c("F", "O"), class = "factor"),
>                            Number = c(1, NA, 50)),
>                       row.names = c(NA, -3L),
>                       class = "data.frame")
>       all.equal(e9p, r9p)
>   }
> #Error in if (any(ind)) { : missing value where TRUE/FALSE needed
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Wed, Sep 21, 2016 at 2:32 PM, Michael Lawrence
> <lawrence.michael at gene.com> wrote:
>>
>> The new behavior is that it yields NAs when the pattern does not match
>> (like strptime) and for empty captures in a matching pattern it yields
>> the empty string, which is consistent with regmatches().
>>
>> Michael
>>
>> On Wed, Sep 21, 2016 at 2:21 PM, William Dunlap <wdunlap at tibco.com> wrote:
>> > If there are any matches then strcapture can see if the pattern has the
>> > same
>> > number of capture expressions as the prototype has columns and give an
>> > error if not.  That seems appropriate.
>> >
>> > If there are no matches, then there is no easy way to see if the
>> > prototype
>> > is compatible with the pattern, so should strcapture just assume the
>> > best
>> > and fill in the prototype with NA's?
>> >
>> > Should there be warnings?  This is kind of like strptime(), which
>> > silently
>> > gives NA's when the format does not match the text input.
>> >
>> >
>> > Bill Dunlap
>> > TIBCO Software
>> > wdunlap tibco.com
>> >
>> > On Wed, Sep 21, 2016 at 2:10 PM, Michael Lawrence
>> > <lawrence.michael at gene.com> wrote:
>> >>
>> >> Hi Bill,
>> >>
>> >> Thanks, another good suggestion. strcapture() now returns NAs for
>> >> non-matches. It's nice to have someone kicking the tires on that
>> >> function.
>> >>
>> >> Michael
>> >>
>> >> On Wed, Sep 21, 2016 at 12:11 PM, William Dunlap via R-devel
>> >> <r-devel at r-project.org> wrote:
>> >> > Michael, thanks for looking at my first issue with utils::strcapture.
>> >> >
>> >> > Another issue is how it deals with lines that don't match the
>> >> > pattern.
>> >> > Currently it gives an error
>> >> >
>> >> >> strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three 3"),
>> >> > proto=list(Name="", Number=0))
>> >> > Error in strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three
>> >> > 3"),
>> >> > :
>> >> >   number of matches does not always match ncol(proto)
>> >> >
>> >> > First, isn't the 'number of matches' the number of parenthesized
>> >> > subpatterns in the regular expression?  I thought that if the entire
>> >> > pattern matches then the subpatterns without matches would be
>> >> > shown as matches at position 0 with length 0.  Hence either the
>> >> > pattern is compatible with the prototype or it isn't, it does not
>> >> > depend
>> >> > on the text input.  E.g.,
>> >> >
>> >> >> regexec("^(([[:alpha:]]+)|([[:digit:]]+))$", c("Twelve", "12",
>> >> >> "Z280"))
>> >> > [[1]]
>> >> > [1] 1 1 1 0
>> >> > attr(,"match.length")
>> >> > [1] 6 6 6 0
>> >> > attr(,"useBytes")
>> >> > [1] TRUE
>> >> >
>> >> > [[2]]
>> >> > [1] 1 1 0 1
>> >> > attr(,"match.length")
>> >> > [1] 2 2 0 2
>> >> > attr(,"useBytes")
>> >> > [1] TRUE
>> >> >
>> >> > [[3]]
>> >> > [1] -1
>> >> > attr(,"match.length")
>> >> > [1] -1
>> >> > attr(,"useBytes")
>> >> > [1] TRUE
>> >> >
>> >> > Second, an error message like 'some lines were bad' is not very
>> >> > helpful.
>> >> > Should it put NA's in all the columns of the current output row if
>> >> > the
>> >> > input line didn't match the pattern and perhaps warn the user that
>> >> > there
>> >> > were problems?  The user could then look for rows of NA's to see
>> >> > where
>> >> > the
>> >> > problems were.
>> >> >
>> >> > Bill Dunlap
>> >> > TIBCO Software
>> >> > wdunlap tibco.com
>> >> >
>> >> >         [[alternative HTML version deleted]]
>> >> >
>> >> > ______________________________________________
>> >> > R-devel at r-project.org mailing list
>> >> > https://stat.ethz.ch/mailman/listinfo/r-devel
>> >
>> >
>
>



More information about the R-devel mailing list