[R] Trim trailng space from data.frame factor variables

Thu Aug 16 18:59:33 CEST 2007

If the problem is with the levels of the factor, why not change
them directly?

> d = data.frame(a=1:5,
+     b=c('one ','two','three ','three ','two'))
> d$b
[1] one    two    three  three  two
Levels: one  three  two
> levels(d$b) = sub(' +$','',levels(d$b))
> d$b
[1] one   two   three three two
Levels: one three two

                                        - Phil Spector
 					 Statistical Computing Facility
 					 Department of Statistics
 					 UC Berkeley
 					 spector at stat.berkeley.edu

On Thu, 16 Aug 2007, Marc Schwartz wrote:

> The easiest way might be to modify the lapply() call as follows:
>
> d[] <- lapply(d, function(x) if (is.factor(x)) factor(sub(" +$", "", x)) else x)
>
>> str(d)
> 'data.frame':   60 obs. of  3 variables:
> $ x: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
> $ y: num  7.01 8.33 5.48 6.51 5.61 ...
> $ f: Factor w/ 3 levels "lev1","lev2",..: 1 1 1 1 1 1 1 1 1 1 ...
>
>
> This way the coercion back to a factor takes place within the loop as
> needed.
>
> Note that I also meant to type sub() and not grep() below. The default
> behavior for both is to return a character vector (if 'value = TRUE' in
> grep()). There is not an argument to override that behavior.
>
> HTH,
>
> Marc
>
>
> On Thu, 2007-08-16 at 19:19 +0300, Lauri Nikkinen wrote:
>> Thanks Marc! What would be the easiest way to coerce char-variables
>> back to factor-variables? Is there a way to prevent the coercion in
>> d[] <- lapply(d, function(x) if ( is.factor(x)) sub(" +$", "", x) else
>> x) ?
>>
>>
>>
>> -Lauri
>>
>>
>>
>> 2007/8/16, Marc Schwartz <marc_schwartz at comcast.net>:
>>         On Thu, 2007-08-16 at 17:54 +0300, Lauri Nikkinen wrote:
>>        > Hi folks,
>>        >
>>        > I would like to trim the trailing spaces in my factor
>>         variables using lapply
>>        > (described in this post by Marc Schwartz:
>>        > http://tolstoy.newcastle.edu.au/R/e2/help/07/08/22826.html)
>>         but the code is
>>        > not functioning (in this example there is only one factor
>>         with trailing
>>        > spaces):
>>
>>         Ayep....as I noted in that post, it was untested....my error.
>>
>>         The problem is that by using ifelse() as I did, the test for
>>         the column
>>         being a factor returns a single result, not one result per
>>         element.
>>         Hence, the appropriate conditional code is only performed on
>>         the first
>>         element in each column, rather than being vectorized on the
>>         entire
>>         column.
>>
>>        > y1 <- rnorm(20) + 6.8
>>        > y2 <- rnorm(20) + (1:20* 1.7 + 1)
>>        > y3 <- rnorm(20) + (1:20*6.7 + 3.7)
>>        > y <- c(y1,y2,y3)
>>        > x <- gl(5,12)
>>        > f <- gl(3,20, labels=paste("lev", 1:3, "   ", sep=""))
>>        > d <- data.frame (x=x,y=y, f=f)
>>        > str(d)
>>        >
>>        > d[] <- lapply(d, function(x) ifelse(is.factor(x), sub(" +$",
>>         "", x), x))
>>        > str(d)
>>        >
>>        > How should I modify this?
>>
>>         Try this instead:
>>
>>         d[] <- lapply(d, function(x) if (is.factor(x)) sub(" +$", "",
>>         x) else x)
>>
>>        > str(d)
>>         'data.frame':   60 obs. of  3 variables:
>>         $ x: chr  "1" "1" "1" "1" ...
>>         $ y: num  6.70 4.42 8.03 4.90 6.98 ...
>>         $ f: chr  "lev1" "lev1" "lev1" "lev1" ...
>>
>>         Note that by using grep(), the factors are coerced to
>>         character vectors
>>         as expected. You would need to coerce back to factors if you
>>         need them
>>         as such.
>>
>>         HTH,
>>
>>         Marc Schwartz
>>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>