[R] Character manipulation using "strsplit" & vectorization

David Winsemius dwinsemius at comcast.net
Tue Sep 8 14:19:14 CEST 2009


On Sep 8, 2009, at 12:39 AM, Steven Kang wrote:

> Dear R users,
>
>
> Suppose I have a data set with inconsistent names for a field.
>
> I desire to make these to consistent names.
>
> i.e
>
> "University of New Jersey", "New Jersey Uni", "New Jersey  
> University" (3
> different inconsistent names) to "The University of New  
> Jersey" (consistent
> name)
>
> Below are arbitrary data set produced from "state.name" (built in  
> data set
> in R) and associated scripts.
>
>
> d <- as.data.frame(c(state.name[30:40], paste(state.name[30:40],
> "University", sep=" "), paste("Th University of", state.name[30:40],  
> sep="
> "),paste("University o", state.name[30:40], sep=" ")))
> da <- sapply(d, as.character)   # factor to character transformation
>
> spl <- strsplit(da, " ")   # spliting components
>
> dd <- character(dim(da)[1])   # initializing empty vector
> for (i in 1:dim(da)[1])   {
>    if (sum(c("New", "Jersey", "University") %in% spl[[i]]) >= 3)
> dd[i] <- "The University of New Jersey"
>     else if (sum(c("New", "Mexico", "University") %in% spl[[i]]) >= 3)
> dd[i] <- "The University of New Mexico"
>     else if (sum(c("New", "York") %in% spl[[i]]) >=
> 2)                         dd[i] <- "The University of New York"
>     else if (sum(c("North", "Carolina") %in% spl[[i]]) >=
> 2)                   dd[i] <- "The university of North Carolina"
> }
>
> Note: above shows only partial (if/else if) conditions.

The if (cond ){ }else{} construct is for program control rather  
revision of vectors. You should consider using the   <- ifelse(cond )  
val1 , val2) construct.

>
> Q1: The above "for" loop works fine (but very slow on large data  
> set..),
> thus I would like to explore whether there is an alternative  
> VECTORIZATION
> method that may speed up the process.
>
>
> Q2: Also, is there other way to extract a string from a phrase  
> without using
> "%in%"?

Many grep-isch functions are available that are vectorised regular  
expression "machines".

? grep  will show quite a few.


>
> i.e
>> "ac" %in% unlist(strsplit("ac dc", " "))
> [1] TRUE
>
-- 

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list