[R] Character manipulation using "strsplit" & vectorization
David Winsemius
dwinsemius at comcast.net
Tue Sep 8 14:19:14 CEST 2009
On Sep 8, 2009, at 12:39 AM, Steven Kang wrote:
> Dear R users,
>
>
> Suppose I have a data set with inconsistent names for a field.
>
> I desire to make these to consistent names.
>
> i.e
>
> "University of New Jersey", "New Jersey Uni", "New Jersey
> University" (3
> different inconsistent names) to "The University of New
> Jersey" (consistent
> name)
>
> Below are arbitrary data set produced from "state.name" (built in
> data set
> in R) and associated scripts.
>
>
> d <- as.data.frame(c(state.name[30:40], paste(state.name[30:40],
> "University", sep=" "), paste("Th University of", state.name[30:40],
> sep="
> "),paste("University o", state.name[30:40], sep=" ")))
> da <- sapply(d, as.character) # factor to character transformation
>
> spl <- strsplit(da, " ") # spliting components
>
> dd <- character(dim(da)[1]) # initializing empty vector
> for (i in 1:dim(da)[1]) {
> if (sum(c("New", "Jersey", "University") %in% spl[[i]]) >= 3)
> dd[i] <- "The University of New Jersey"
> else if (sum(c("New", "Mexico", "University") %in% spl[[i]]) >= 3)
> dd[i] <- "The University of New Mexico"
> else if (sum(c("New", "York") %in% spl[[i]]) >=
> 2) dd[i] <- "The University of New York"
> else if (sum(c("North", "Carolina") %in% spl[[i]]) >=
> 2) dd[i] <- "The university of North Carolina"
> }
>
> Note: above shows only partial (if/else if) conditions.
The if (cond ){ }else{} construct is for program control rather
revision of vectors. You should consider using the <- ifelse(cond )
val1 , val2) construct.
>
> Q1: The above "for" loop works fine (but very slow on large data
> set..),
> thus I would like to explore whether there is an alternative
> VECTORIZATION
> method that may speed up the process.
>
>
> Q2: Also, is there other way to extract a string from a phrase
> without using
> "%in%"?
Many grep-isch functions are available that are vectorised regular
expression "machines".
? grep will show quite a few.
>
> i.e
>> "ac" %in% unlist(strsplit("ac dc", " "))
> [1] TRUE
>
--
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
More information about the R-help
mailing list