[R] Trim trailng space from data.frame factor variables

Marc Schwartz marc_schwartz at comcast.net
Thu Aug 16 18:29:57 CEST 2007


The easiest way might be to modify the lapply() call as follows:

d[] <- lapply(d, function(x) if (is.factor(x)) factor(sub(" +$", "", x)) else x)

> str(d)
'data.frame':   60 obs. of  3 variables:
 $ x: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ y: num  7.01 8.33 5.48 6.51 5.61 ...
 $ f: Factor w/ 3 levels "lev1","lev2",..: 1 1 1 1 1 1 1 1 1 1 ...


This way the coercion back to a factor takes place within the loop as
needed.

Note that I also meant to type sub() and not grep() below. The default
behavior for both is to return a character vector (if 'value = TRUE' in
grep()). There is not an argument to override that behavior.

HTH,

Marc


On Thu, 2007-08-16 at 19:19 +0300, Lauri Nikkinen wrote:
> Thanks Marc! What would be the easiest way to coerce char-variables
> back to factor-variables? Is there a way to prevent the coercion in
> d[] <- lapply(d, function(x) if ( is.factor(x)) sub(" +$", "", x) else
> x) ?
> 
>  
> 
> -Lauri
> 
> 
> 
> 2007/8/16, Marc Schwartz <marc_schwartz at comcast.net>: 
>         On Thu, 2007-08-16 at 17:54 +0300, Lauri Nikkinen wrote:
>         > Hi folks,
>         >
>         > I would like to trim the trailing spaces in my factor
>         variables using lapply 
>         > (described in this post by Marc Schwartz:
>         > http://tolstoy.newcastle.edu.au/R/e2/help/07/08/22826.html)
>         but the code is
>         > not functioning (in this example there is only one factor
>         with trailing 
>         > spaces):
>         
>         Ayep....as I noted in that post, it was untested....my error.
>         
>         The problem is that by using ifelse() as I did, the test for
>         the column
>         being a factor returns a single result, not one result per
>         element. 
>         Hence, the appropriate conditional code is only performed on
>         the first
>         element in each column, rather than being vectorized on the
>         entire
>         column.
>         
>         > y1 <- rnorm(20) + 6.8
>         > y2 <- rnorm(20) + (1:20* 1.7 + 1)
>         > y3 <- rnorm(20) + (1:20*6.7 + 3.7)
>         > y <- c(y1,y2,y3)
>         > x <- gl(5,12)
>         > f <- gl(3,20, labels=paste("lev", 1:3, "   ", sep=""))
>         > d <- data.frame (x=x,y=y, f=f)
>         > str(d)
>         >
>         > d[] <- lapply(d, function(x) ifelse(is.factor(x), sub(" +$",
>         "", x), x))
>         > str(d)
>         >
>         > How should I modify this?
>         
>         Try this instead: 
>         
>         d[] <- lapply(d, function(x) if (is.factor(x)) sub(" +$", "",
>         x) else x)
>         
>         > str(d)
>         'data.frame':   60 obs. of  3 variables:
>         $ x: chr  "1" "1" "1" "1" ... 
>         $ y: num  6.70 4.42 8.03 4.90 6.98 ...
>         $ f: chr  "lev1" "lev1" "lev1" "lev1" ...
>         
>         Note that by using grep(), the factors are coerced to
>         character vectors
>         as expected. You would need to coerce back to factors if you
>         need them 
>         as such.
>         
>         HTH,
>         
>         Marc Schwartz
>



More information about the R-help mailing list