[R] Trim trailng space from data.frame factor variables
Phil Spector
spector at stat.Berkeley.EDU
Thu Aug 16 18:59:33 CEST 2007
If the problem is with the levels of the factor, why not change
them directly?
> d = data.frame(a=1:5,
+ b=c('one ','two','three ','three ','two'))
> d$b
[1] one two three three two
Levels: one three two
> levels(d$b) = sub(' +$','',levels(d$b))
> d$b
[1] one two three three two
Levels: one three two
- Phil Spector
Statistical Computing Facility
Department of Statistics
UC Berkeley
spector at stat.berkeley.edu
On Thu, 16 Aug 2007, Marc Schwartz wrote:
> The easiest way might be to modify the lapply() call as follows:
>
> d[] <- lapply(d, function(x) if (is.factor(x)) factor(sub(" +$", "", x)) else x)
>
>> str(d)
> 'data.frame': 60 obs. of 3 variables:
> $ x: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
> $ y: num 7.01 8.33 5.48 6.51 5.61 ...
> $ f: Factor w/ 3 levels "lev1","lev2",..: 1 1 1 1 1 1 1 1 1 1 ...
>
>
> This way the coercion back to a factor takes place within the loop as
> needed.
>
> Note that I also meant to type sub() and not grep() below. The default
> behavior for both is to return a character vector (if 'value = TRUE' in
> grep()). There is not an argument to override that behavior.
>
> HTH,
>
> Marc
>
>
> On Thu, 2007-08-16 at 19:19 +0300, Lauri Nikkinen wrote:
>> Thanks Marc! What would be the easiest way to coerce char-variables
>> back to factor-variables? Is there a way to prevent the coercion in
>> d[] <- lapply(d, function(x) if ( is.factor(x)) sub(" +$", "", x) else
>> x) ?
>>
>>
>>
>> -Lauri
>>
>>
>>
>> 2007/8/16, Marc Schwartz <marc_schwartz at comcast.net>:
>> On Thu, 2007-08-16 at 17:54 +0300, Lauri Nikkinen wrote:
>> > Hi folks,
>> >
>> > I would like to trim the trailing spaces in my factor
>> variables using lapply
>> > (described in this post by Marc Schwartz:
>> > http://tolstoy.newcastle.edu.au/R/e2/help/07/08/22826.html)
>> but the code is
>> > not functioning (in this example there is only one factor
>> with trailing
>> > spaces):
>>
>> Ayep....as I noted in that post, it was untested....my error.
>>
>> The problem is that by using ifelse() as I did, the test for
>> the column
>> being a factor returns a single result, not one result per
>> element.
>> Hence, the appropriate conditional code is only performed on
>> the first
>> element in each column, rather than being vectorized on the
>> entire
>> column.
>>
>> > y1 <- rnorm(20) + 6.8
>> > y2 <- rnorm(20) + (1:20* 1.7 + 1)
>> > y3 <- rnorm(20) + (1:20*6.7 + 3.7)
>> > y <- c(y1,y2,y3)
>> > x <- gl(5,12)
>> > f <- gl(3,20, labels=paste("lev", 1:3, " ", sep=""))
>> > d <- data.frame (x=x,y=y, f=f)
>> > str(d)
>> >
>> > d[] <- lapply(d, function(x) ifelse(is.factor(x), sub(" +$",
>> "", x), x))
>> > str(d)
>> >
>> > How should I modify this?
>>
>> Try this instead:
>>
>> d[] <- lapply(d, function(x) if (is.factor(x)) sub(" +$", "",
>> x) else x)
>>
>> > str(d)
>> 'data.frame': 60 obs. of 3 variables:
>> $ x: chr "1" "1" "1" "1" ...
>> $ y: num 6.70 4.42 8.03 4.90 6.98 ...
>> $ f: chr "lev1" "lev1" "lev1" "lev1" ...
>>
>> Note that by using grep(), the factors are coerced to
>> character vectors
>> as expected. You would need to coerce back to factors if you
>> need them
>> as such.
>>
>> HTH,
>>
>> Marc Schwartz
>>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list