[Rd] grep() and factors
Sean Davis
sdavis2 at mail.nih.gov
Tue Jun 6 00:29:27 CEST 2006
Marc Schwartz (via MN) wrote:
> On Mon, 2006-06-05 at 13:45 -0700, Bill Dunlap wrote:
>
>>On Mon, 5 Jun 2006, Marc Schwartz (via MN) wrote:
>>
>>
>>>Based upon an offlist communication this morning, I am somewhat confused
>>>(more than I usually am on most Monday mornings...) about the use of
>>>grep() with factors as the 'x' argument.
>>> ...
>>>
>>>>grep("[a-z]", letters)
>>>
>>> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
>>>[23] 23 24 25 26
>>>
>>>
>>>>grep("[a-z]", factor(letters))
>>>
>>>numeric(0)
>>
>>I was recently surprised by this also. In addition, if
>>R's grep did support factors in this way, what sort of
>>object (factor or character) should it return when value=T?
>>I recently changed Splus's grep to return a character vector in
>>that case.
>>
>> Splus> grep("[def]", letters[26:1])
>> [1] 21 22 23
>> Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]))
>> [1] 21 22 23
>> Splus> grep("[def]", letters[26:1], value=T)
>> [1] "f" "e" "d"
>> Splus> grep("[def]", factor(letters[26:1], levels=letters[26:1]), value=T)
>> [1] "f" "e" "d"
>> Splus> class(.Last.value)
>> [1] "character"
>>
>>R does this when grepping an integer vector.
>> R> grep("1", 0:11, value=T)
>> [1] "1" "10" "11"
>>help(grep) says it returns "the matching elements themselves", but
>>doesn't say if "themselves" means before or after the conversion to
>>character.
>
>
> Bill,
>
> My first inclination for the return value when used on a factor would be
> the indexed factor elements where grep() would otherwise simply return
> the indices. This would also maintain the factor levels from the
> original source factor since "[".factor would normally retain these when
> drop = FALSE.
>
> For example:
>
> # Return the indexed values as would otherwise be done
> # in grep() if the factor to character coercion takes place:
> # Use the same indices 21:23 as above
>
>
>>factor(letters[26:1], levels = letters[26:1])[21:23]
>
> [1] f e d
> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
>
>
>
>>From my read of the C code in do_grep() in character.c (again, if
> correct), when 'value = TRUE', the C code appears to first get the
> indices and then build the returned vector from the indexed values from
> the source vector in a for() loop. So this should not be a problem
> philosophically.
>
> However, given your example of the coercion of integers, perhaps with
> grep() at least, consistent behavior would dictate that return values
> are always character vectors. These could then be coerced manually back
> to a factor, using the original levels, as may be required:
>
>
>>factor.letters <- factor(letters[26:1], levels=letters[26:1])
>>factor.letters
>
> [1] z y x w v u t s r q p o n m l k j i h g f e d c b a
> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
>
>
>>grep("[def]", as.character(factor.letters))
>
> [1] 21 22 23
>
>
>>res <- grep("[def]", as.character(factor.letters), value = TRUE)
>>res
>
> [1] "f" "e" "d"
>
>
>>factor(res, levels = levels(factor.letters))
>
> [1] f e d
> Levels: z y x w v u t s r q p o n m l k j i h g f e d c b a
>
> Which of course is the same result I proposed initially above.
>
> I could be convinced either way. The concern of course being that (given
> the offlist replies I have received today) even experienced users are
> getting bitten by the current behavior versus their intuitive
> expectations, which are at least loosely supported by the documentation.
I'll chime in on-list to say that I have had the same experience with
expecting grep to coerce to text. Despite the question of return
values, I think of grep (not equivalent to the unix command, I
understand, but it does have the same name) as operating on "text", not
the factor levels themselves. Not a big deal, but it does lead to
sometimes hard to track bugs if one is not careful to put in
as.character all the time.
Sean
More information about the R-devel
mailing list