[Rd] error handling in strcapture

Michael Lawrence lawrence.michael at gene.com
Wed Sep 21 23:32:45 CEST 2016


The new behavior is that it yields NAs when the pattern does not match
(like strptime) and for empty captures in a matching pattern it yields
the empty string, which is consistent with regmatches().

Michael

On Wed, Sep 21, 2016 at 2:21 PM, William Dunlap <wdunlap at tibco.com> wrote:
> If there are any matches then strcapture can see if the pattern has the same
> number of capture expressions as the prototype has columns and give an
> error if not.  That seems appropriate.
>
> If there are no matches, then there is no easy way to see if the prototype
> is compatible with the pattern, so should strcapture just assume the best
> and fill in the prototype with NA's?
>
> Should there be warnings?  This is kind of like strptime(), which silently
> gives NA's when the format does not match the text input.
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Wed, Sep 21, 2016 at 2:10 PM, Michael Lawrence
> <lawrence.michael at gene.com> wrote:
>>
>> Hi Bill,
>>
>> Thanks, another good suggestion. strcapture() now returns NAs for
>> non-matches. It's nice to have someone kicking the tires on that
>> function.
>>
>> Michael
>>
>> On Wed, Sep 21, 2016 at 12:11 PM, William Dunlap via R-devel
>> <r-devel at r-project.org> wrote:
>> > Michael, thanks for looking at my first issue with utils::strcapture.
>> >
>> > Another issue is how it deals with lines that don't match the pattern.
>> > Currently it gives an error
>> >
>> >> strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three 3"),
>> > proto=list(Name="", Number=0))
>> > Error in strcapture("(.+) (.+)", c("One 1", "noSpaceInLine", "Three 3"),
>> > :
>> >   number of matches does not always match ncol(proto)
>> >
>> > First, isn't the 'number of matches' the number of parenthesized
>> > subpatterns in the regular expression?  I thought that if the entire
>> > pattern matches then the subpatterns without matches would be
>> > shown as matches at position 0 with length 0.  Hence either the
>> > pattern is compatible with the prototype or it isn't, it does not depend
>> > on the text input.  E.g.,
>> >
>> >> regexec("^(([[:alpha:]]+)|([[:digit:]]+))$", c("Twelve", "12", "Z280"))
>> > [[1]]
>> > [1] 1 1 1 0
>> > attr(,"match.length")
>> > [1] 6 6 6 0
>> > attr(,"useBytes")
>> > [1] TRUE
>> >
>> > [[2]]
>> > [1] 1 1 0 1
>> > attr(,"match.length")
>> > [1] 2 2 0 2
>> > attr(,"useBytes")
>> > [1] TRUE
>> >
>> > [[3]]
>> > [1] -1
>> > attr(,"match.length")
>> > [1] -1
>> > attr(,"useBytes")
>> > [1] TRUE
>> >
>> > Second, an error message like 'some lines were bad' is not very helpful.
>> > Should it put NA's in all the columns of the current output row if the
>> > input line didn't match the pattern and perhaps warn the user that there
>> > were problems?  The user could then look for rows of NA's to see where
>> > the
>> > problems were.
>> >
>> > Bill Dunlap
>> > TIBCO Software
>> > wdunlap tibco.com
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>



More information about the R-devel mailing list