[R] Trim trailng space from data.frame factor variables

Thu Aug 16 18:52:46 CEST 2007

On Thu, 16 Aug 2007, Marc Schwartz wrote:

> The easiest way might be to modify the lapply() call as follows:
>
> d[] <- lapply(d, function(x) if (is.factor(x)) factor(sub(" +$", "", x)) else x)
>
>> str(d)
> 'data.frame':   60 obs. of  3 variables:
> $ x: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
> $ y: num  7.01 8.33 5.48 6.51 5.61 ...
> $ f: Factor w/ 3 levels "lev1","lev2",..: 1 1 1 1 1 1 1 1 1 1 ...
>
>
> This way the coercion back to a factor takes place within the loop as
> needed.
>
> Note that I also meant to type sub() and not grep() below. The default
> behavior for both is to return a character vector (if 'value = TRUE' in
> grep()). There is not an argument to override that behavior.

I would have thought the thing to do was to apply sub() to the levels:

chfactor <- function(x) { levels(x) <- sub(" +$", "", levels(x)); x }

d[] <- lapply(d, function(x) if (is.factor(x)) chfactor(x) else x)

This has the advantage of not losing the order of the levels.  It will 
merge levels if they only differ in the number of trailing spaces, which 
is probably what you want.

> HTH,
>
> Marc
>
>
> On Thu, 2007-08-16 at 19:19 +0300, Lauri Nikkinen wrote:
>> Thanks Marc! What would be the easiest way to coerce char-variables
>> back to factor-variables? Is there a way to prevent the coercion in
>> d[] <- lapply(d, function(x) if ( is.factor(x)) sub(" +$", "", x) else
>> x) ?
>>
>>
>>
>> -Lauri
>>
>>
>>
>> 2007/8/16, Marc Schwartz <marc_schwartz at comcast.net>:
>>         On Thu, 2007-08-16 at 17:54 +0300, Lauri Nikkinen wrote:
>>        > Hi folks,
>>        >
>>        > I would like to trim the trailing spaces in my factor
>>         variables using lapply
>>        > (described in this post by Marc Schwartz:
>>        > http://tolstoy.newcastle.edu.au/R/e2/help/07/08/22826.html)
>>         but the code is
>>        > not functioning (in this example there is only one factor
>>         with trailing
>>        > spaces):
>>
>>         Ayep....as I noted in that post, it was untested....my error.
>>
>>         The problem is that by using ifelse() as I did, the test for
>>         the column
>>         being a factor returns a single result, not one result per
>>         element.
>>         Hence, the appropriate conditional code is only performed on
>>         the first
>>         element in each column, rather than being vectorized on the
>>         entire
>>         column.
>>
>>        > y1 <- rnorm(20) + 6.8
>>        > y2 <- rnorm(20) + (1:20* 1.7 + 1)
>>        > y3 <- rnorm(20) + (1:20*6.7 + 3.7)
>>        > y <- c(y1,y2,y3)
>>        > x <- gl(5,12)
>>        > f <- gl(3,20, labels=paste("lev", 1:3, "   ", sep=""))
>>        > d <- data.frame (x=x,y=y, f=f)
>>        > str(d)
>>        >
>>        > d[] <- lapply(d, function(x) ifelse(is.factor(x), sub(" +$",
>>         "", x), x))
>>        > str(d)
>>        >
>>        > How should I modify this?
>>
>>         Try this instead:
>>
>>         d[] <- lapply(d, function(x) if (is.factor(x)) sub(" +$", "",
>>         x) else x)
>>
>>        > str(d)
>>         'data.frame':   60 obs. of  3 variables:
>>         $ x: chr  "1" "1" "1" "1" ...
>>         $ y: num  6.70 4.42 8.03 4.90 6.98 ...
>>         $ f: chr  "lev1" "lev1" "lev1" "lev1" ...
>>
>>         Note that by using grep(), the factors are coerced to
>>         character vectors
>>         as expected. You would need to coerce back to factors if you
>>         need them
>>         as such.
>>
>>         HTH,
>>
>>         Marc Schwartz

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595