[Rd] grep() and factors

Marc Schwartz (via MN) mschwartz at mn.rr.com
Tue Jun 6 00:15:03 CEST 2006


On Mon, 2006-06-05 at 13:45 -0700, Bill Dunlap wrote:
> On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote:
> 
> > Based upon an offlist communication this morning, I am somewhat confused
> > (more than I usually am on most Monday mornings...) about the use of
> > grep() with factors as the 'x' argument.
> >  ...
> > > grep("[a-z]", letters)
> >  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
> > [23] 23 24 25 26
> >
> > > grep("[a-z]", factor(letters))
> > numeric(0)
> 
> I was recently surprised by this also.  In addition, if
> R's grep did support factors in this way, what sort of
> object (factor or character) should it return when value=T?
> I recently changed Splus's grep to return a character vector in
> that case.
> 
>    Splus> grep("[def]", letters[26:1])
>    [1] 21 22 23
>    Splus>  grep("[def]", factor(letters[26:1], levels=letters[26:1]))
>    [1] 21 22 23
>    Splus> grep("[def]", letters[26:1], value=T)
>    [1] "f" "e" "d"
>    Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), value=T)
>    [1] "f" "e" "d"
>    Splus> class(.Last.value)
>    [1] "character"
> 
> R does this when grepping an integer vector.
>    R> grep("1", 0:11, value=T)
>    [1] "1"  "10" "11"
> help(grep) says it returns "the matching elements themselves", but
> doesn't say if "themselves" means before or after the conversion to
> character.

Bill,

My first inclination for the return value when used on a factor would be
the indexed factor elements where grep() would otherwise simply return
the indices. This would also maintain the factor levels from the
original source factor since "[".factor would normally retain these when
drop = FALSE.

For example:

# Return the indexed values as would otherwise be done
# in grep() if the factor to character coercion takes place:
# Use the same indices 21:23 as above

> factor(letters[26:1], levels = letters[26:1])[21:23]
[1] f e d
Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a



>From my read of the C code in do_grep() in character.c (again, if
correct), when 'value = TRUE', the C code appears to first get the
indices and then build the returned vector from the indexed values from
the source vector in a for() loop. So this should not be a problem
philosophically.

However, given your example of the coercion of integers, perhaps with
grep() at least, consistent behavior would dictate that return values
are always character vectors. These could then be coerced manually back
to a factor, using the original levels, as may be required:

> factor.letters <- factor(letters[26:1], levels=letters[26:1])
> factor.letters
 [1] z y x w v u t s r q p o n m l k j i h g f e d c b a
Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a

> grep("[def]", as.character(factor.letters))
[1] 21 22 23

> res <- grep("[def]", as.character(factor.letters), value = TRUE)
> res
[1] "f" "e" "d"

> factor(res, levels = levels(factor.letters))
[1] f e d
Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a

Which of course is the same result I proposed initially above.

I could be convinced either way. The concern of course being that (given
the offlist replies I have received today) even experienced users are
getting bitten by the current behavior versus their intuitive
expectations, which are at least loosely supported by the documentation.

HTH,

Marc Schwartz



More information about the R-devel mailing list