[Rd] strsplit and the empty string

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Wed Jun 18 14:45:36 CEST 2008


Hello,

I am wondering about the behaviour of strsplit.  When the pattern
matches the beginning of the search string, the mepty string is added to
the result, but that's not the case when the pattern matches the end of
the search string:

strsplit(" hello dolly ")
[1] "" "hello" "dolly"

The man for strsplit explains the algorithm:

"
 The algorithm applied to each input string is


         repeat {
             if the string is empty
                 break.
             if there is a match
                 add the string to the left of the match to the output.
                 remove the match and all to the left of it.
             else
                 add the string to the output.
                 break.
         }

     Note that this means that if there is a match at the beginning of
     a (non-empty) string, the first element of the output is '""', but
     if there is a match at the end of the string, the output is the
     same as with the match removed.
"

I do not see how this algorithm specifies that there should be no empty
string at the end of the output if the pattern matches the end of the
input string.
If the pattern matches, (second if above), the match is added to the
output, and removed from the input -- which after this step is the empty
string; in the next step, there is no match (else above), so the rest of
the input string (= the empty string) *should* be added, but it is not
what happens. 

I think that the implementation of the algorithm (and the explanation
that "if there is a match at the end of the string, the output is the
same as with the match removed") is both unintuitive (i see no good
reason for including the empty string at the beginning but not at the
end of the output; no other language i know would do that this way) and
actually wrong wrt. the algorithm.

Any opinion?  What was the ground for this design?

vQ




-- 
-------------------------------------------------------------------------------
Wacek Kusnierczyk, MD PhD

Email: waku at idi.ntnu.no
Phone: +47 73591875, +47 72574609

Department of Computer and Information Science (IDI)
Faculty of Information Technology, Mathematics and Electrical Engineering (IME)
Norwegian University of Science and Technology (NTNU)
Sem Saelands vei 7, 7491 Trondheim, Norway
Room itv303

Bioinformatics & Gene Regulation Group
Department of Cancer Research and Molecular Medicine (IKM)
Faculty of Medicine (DMF)
Norwegian University of Science and Technology (NTNU)
Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway
Room 231.05.060



More information about the R-devel mailing list