[Rd] grep() and factors

Sean Davis sdavis2 at mail.nih.gov
Tue Jun 6 00:29:27 CEST 2006


Marc Schwartz (via MN) wrote:
> On Mon, 2006-06-05 at 13:45 -0700, Bill Dunlap wrote:
> 
>>On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote:
>>
>>
>>>Based upon an offlist communication this morning, I am somewhat confused
>>>(more than I usually am on most Monday mornings...) about the use of
>>>grep() with factors as the 'x' argument.
>>> ...
>>>
>>>>grep("[a-z]", letters)
>>>
>>> [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22
>>>[23] 23 24 25 26
>>>
>>>
>>>>grep("[a-z]", factor(letters))
>>>
>>>numeric(0)
>>
>>I was recently surprised by this also.  In addition, if
>>R's grep did support factors in this way, what sort of
>>object (factor or character) should it return when value=T?
>>I recently changed Splus's grep to return a character vector in
>>that case.
>>
>>   Splus> grep("[def]", letters[26:1])
>>   [1] 21 22 23
>>   Splus>  grep("[def]", factor(letters[26:1], levels=letters[26:1]))
>>   [1] 21 22 23
>>   Splus> grep("[def]", letters[26:1], value=T)
>>   [1] "f" "e" "d"
>>   Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), value=T)
>>   [1] "f" "e" "d"
>>   Splus> class(.Last.value)
>>   [1] "character"
>>
>>R does this when grepping an integer vector.
>>   R> grep("1", 0:11, value=T)
>>   [1] "1"  "10" "11"
>>help(grep) says it returns "the matching elements themselves", but
>>doesn't say if "themselves" means before or after the conversion to
>>character.
> 
> 
> Bill,
> 
> My first inclination for the return value when used on a factor would be
> the indexed factor elements where grep() would otherwise simply return
> the indices. This would also maintain the factor levels from the
> original source factor since "[".factor would normally retain these when
> drop = FALSE.
> 
> For example:
> 
> # Return the indexed values as would otherwise be done
> # in grep() if the factor to character coercion takes place:
> # Use the same indices 21:23 as above
> 
> 
>>factor(letters[26:1], levels = letters[26:1])[21:23]
> 
> [1] f e d
> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
> 
> 
> 
>>From my read of the C code in do_grep() in character.c (again, if
> correct), when 'value = TRUE', the C code appears to first get the
> indices and then build the returned vector from the indexed values from
> the source vector in a for() loop. So this should not be a problem
> philosophically.
> 
> However, given your example of the coercion of integers, perhaps with
> grep() at least, consistent behavior would dictate that return values
> are always character vectors. These could then be coerced manually back
> to a factor, using the original levels, as may be required:
> 
> 
>>factor.letters <- factor(letters[26:1], levels=letters[26:1])
>>factor.letters
> 
>  [1] z y x w v u t s r q p o n m l k j i h g f e d c b a
> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
> 
> 
>>grep("[def]", as.character(factor.letters))
> 
> [1] 21 22 23
> 
> 
>>res <- grep("[def]", as.character(factor.letters), value = TRUE)
>>res
> 
> [1] "f" "e" "d"
> 
> 
>>factor(res, levels = levels(factor.letters))
> 
> [1] f e d
> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
> 
> Which of course is the same result I proposed initially above.
> 
> I could be convinced either way. The concern of course being that (given
> the offlist replies I have received today) even experienced users are
> getting bitten by the current behavior versus their intuitive
> expectations, which are at least loosely supported by the documentation.

I'll chime in on-list to say that I have had the same experience with 
expecting grep to coerce to text.  Despite the question of return 
values, I think of grep (not equivalent to the unix command, I 
understand, but it does have the same name) as operating on "text", not 
the factor levels themselves.  Not a big deal, but it does lead to 
sometimes hard to track bugs if one is not careful to put in 
as.character all the time.

Sean



More information about the R-devel mailing list