[R] Split strings based on multiple patterns
ggrothendieck at gmail.com
Sat Oct 15 13:50:28 CEST 2016
Replace newlines and colons with a space since they seem to be junk,
generate a pattern to replace the attributes with a comma and do the
replacement and finally read in what is left into a data frame using
the attributes as column names.
(I have indented each line of code below by 2 spaces so if any line
starts before that then it's been wrapped around by the email and
needs to be adjusted.)
c("Water temp", "Waterbody type", "Water pH", "Conductivity",
"Water color", "Water turbidity", "Manmade", "Permanence", "Max water depth",
"Primary substrate", "Evidence of cattle grazing", "Shoreline
"Fish present", "Fish species")
ugly2 <- gsub("[:\n]", " ", ugly)
pat <- paste(gsub("([[:punct:]])", ".", attributes), collapse = "|")
ugly3 <- gsub(pat, ",", ugly2)
dd <- read.table(text = ugly3, sep = ",", strip.white = TRUE,
col.names = c("", attributes))[-1]
On Fri, Oct 14, 2016 at 7:16 PM, Joe Ceradini <joeceradini at gmail.com> wrote:
> I unfortunately inherited a dataframe with a column that has many fields
> smashed together. My goal is to split the strings in the column into
> separate columns based on patterns.
> Example of what I'm working with:
> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water
> Conductivity:Unkwn: Water color: Clear: Water turbidity: clear:
> Manmade:no Permanence:permanent: Max water depth: <3: Primary
> substrate: Silt/Mud: Evidence of cattle grazing: none:
> Shoreline Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
> amphibians observed")
> Far as I can tell, there is not a single pattern that would work for
> splitting this string. Splitting on ":" is close but not quite consistent.
> Each of these attributes should be a separate column:
> attributes <- c("Water temp", "Waterbody type", "Water pH", "Conductivity",
> "Water color", "Water turbidity", "Manmade", "Permanence", "Max water
> depth", "Primary substrate", "Evidence of cattle grazing", "Shoreline
> Emergent Veg(%)", "Fish present", "Fish species")
> So, conceptually, I want to do something like this, where the string is
> split for each of the patterns in attributes. However, strsplit only uses
> the 1st value of attributes
> strsplit(ugly, attributes)
> Should I loop through the values of "attributes"?
> Is there an argument in strsplit I'm missing that will do what I want?
> Different approach altogether?
> Thanks! Happy Friday.
> [[alternative HTML version deleted]]
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
email: ggrothendieck at gmail.com
More information about the R-help