[R] Split strings based on multiple patterns (plain text)
Joe Ceradini
joeceradini at gmail.com
Sat Oct 15 22:32:18 CEST 2016
Thank you David Wolfskill, David Winsemius, and Gabor! All very
helpful and interesting fixes for the problem (compiled below)! Now I
will see which one works best on the 944 rows that each have a cell of
smooshed attributes...the attribute names should be the same in all
the rows, if there is any mercy :)
Joe Ceradini
University of Wyoming
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On 10/14/16, David Wolfskill <david at catwhisker.org> wrote:
> Happy Friday, indeed.
>
> It seems to me that the data need a bit of cleamup before attempting to
> parse -- for example, that "F" looks to be improperly delimited by ':'
> on either side. I can't tell from a single example if that's typical
> (either for that field, or for random fields throughout the complete
> dataset). On the off-chance it's the former, here's a bit of exercise
> that may lead you a bit closer to a solution:
>
> First, starting with "ugly":
>
>> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water
>> pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity: clear:
>> Manmade:no Permanence:permanent: Max water depth: <3: Primary substrate:
>> Silt/Mud: Evidence of cattle grazing: none: Shoreline Emergent Veg(%):
>> 1-25: Fish present: yes: Fish species: unkwn: no amphibians observed")
>> ugly
> [1] "Water temp:14: F Waterbody type:Permanent Lake/Pond: Water pH:Unkwn:
> Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: Manmade:no
> Permanence:permanent: Max water depth: <3: Primary substrate: Silt/Mud:
> Evidence of cattle grazing: none: Shoreline Emergent Veg(%): 1-25: Fish
> present: yes: Fish species: unkwn: no amphibians observed"
>
> # First, see what a naive strsplit() does:
>
>> strsplit(ugly, ":")
> [[1]]
> [1] "Water temp" "14"
> [3] " F Waterbody type" "Permanent Lake/Pond"
> [5] " Water pH" "Unkwn"
> [7] " Conductivity" "Unkwn"
> [9] " Water color" " Clear"
> [11] " Water turbidity" " clear"
> [13] " Manmade" "no Permanence"
> [15] "permanent" " Max water depth"
> [17] " <3" " Primary substrate"
> [19] " Silt/Mud" " Evidence of cattle grazing"
> [21] " none" " Shoreline Emergent Veg(%)"
> [23] " 1-25" " Fish present"
> [25] " yes" " Fish species"
> [27] " unkwn" " no amphibians observed"
>
> # OK; let's fix the "F":
>
>> ugly1 <- sub(": F ", "F: ", ugly)
>> ugly1
> [1] "Water temp:14F: Waterbody type:Permanent Lake/Pond: Water pH:Unkwn:
> Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: Manmade:no
> Permanence:permanent: Max water depth: <3: Primary substrate: Silt/Mud:
> Evidence of cattle grazing: none: Shoreline Emergent Veg(%): 1-25: Fish
> present: yes: Fish species: unkwn: no amphibians observed"
>
> # Now, that substring "Manmade:no Permanence:permanent:" is problematic;
> # the " " in there should apparently be ": " -- but we can't just do that
> # to all " " substrings, because that would also affect
> # "Permanence:permanent: Max water depth: <3:" -- the differnce, though,
> # is that the one we don't want to change contains ": ", so let's change
> # those. I'm assuming(!) that we don't really care about leading or
> # trailing spaces in the fields:
>
>> ugly2 <- gsub(" *: *", ":", ugly1)
>> ugly2
> [1] "Water temp:14F:Waterbody type:Permanent Lake/Pond:Water
> pH:Unkwn:Conductivity:Unkwn:Water color:Clear:Water
> turbidity:clear:Manmade:no Permanence:permanent:Max water depth:<3:Primary
> substrate:Silt/Mud:Evidence of cattle grazing:none:Shoreline Emergent
> Veg(%):1-25:Fish present:yes:Fish species:unkwn:no amphibians observed"
>
> # Now that " " shows up like a sore thumb. Just to make the point even
> # clearer, try the "naive" strsplit on what we have:
>
>> strsplit(ugly2, ":")
> [[1]]
> [1] "Water temp" "14F"
> [3] "Waterbody type" "Permanent Lake/Pond"
> [5] "Water pH" "Unkwn"
> [7] "Conductivity" "Unkwn"
> [9] "Water color" "Clear"
> [11] "Water turbidity" "clear"
> [13] "Manmade" "no Permanence"
> [15] "permanent" "Max water depth"
> [17] "<3" "Primary substrate"
> [19] "Silt/Mud" "Evidence of cattle grazing"
> [21] "none" "Shoreline Emergent Veg(%)"
> [23] "1-25" "Fish present"
> [25] "yes" "Fish species"
> [27] "unkwn" "no amphibians observed"
>
>>
>
> # Note element [14]: that's the one we need to fix. I'll assume(!)
> # that that sort of thing may occur just about anywhere, so let's just
> # whack 'em all:
>
>> ugly3 <- gsub(" ", ":", ugly2)
>> ugly3
> [1] "Water temp:14F:Waterbody type:Permanent Lake/Pond:Water
> pH:Unkwn:Conductivity:Unkwn:Water color:Clear:Water
> turbidity:clear:Manmade:no:Permanence:permanent:Max water depth:<3:Primary
> substrate:Silt/Mud:Evidence of cattle grazing:none:Shoreline Emergent
> Veg(%):1-25:Fish present:yes:Fish species:unkwn:no amphibians observed"
>
> # Again, check a naive strsplpit():
>
>> strsplit(ugly3, ":")
> [[1]]
> [1] "Water temp" "14F"
> [3] "Waterbody type" "Permanent Lake/Pond"
> [5] "Water pH" "Unkwn"
> [7] "Conductivity" "Unkwn"
> [9] "Water color" "Clear"
> [11] "Water turbidity" "clear"
> [13] "Manmade" "no"
> [15] "Permanence" "permanent"
> [17] "Max water depth" "<3"
> [19] "Primary substrate" "Silt/Mud"
> [21] "Evidence of cattle grazing" "none"
> [23] "Shoreline Emergent Veg(%)" "1-25"
> [25] "Fish present" "yes"
> [27] "Fish species" "unkwn"
> [29] "no amphibians observed"
>
>>
>
> # OK; not what we want, but it's a lot closer. Now, watch this:
>
>> ugly4 <- gsub("([^:]*:[^:]*): *", "\\1\001", ugly3, perl = TRUE)
>> strsplit(ugly4, "\001")
> [[1]]
> [1] "Water temp:14F" "Waterbody type:Permanent
> Lake/Pond"
> [3] "Water pH:Unkwn" "Conductivity:Unkwn"
>
> [5] "Water color:Clear" "Water turbidity:clear"
>
> [7] "Manmade:no" "Permanence:permanent"
>
> [9] "Max water depth:<3" "Primary substrate:Silt/Mud"
>
> [11] "Evidence of cattle grazing:none" "Shoreline Emergent Veg(%):1-25"
>
> [13] "Fish present:yes" "Fish species:unkwn"
>
> [15] "no amphibians observed"
>
>>
>
> # At this point, at least elements [1] - [14] are each of the form
> # "tag:value", and thus, readily parsable. Element [15] appears to be
> # a somewhat-random comment; I suppose you could check for elements that
> # lack a (single) ':' and treat them "specially"....
>
> I hope that helps. Good luck!
>
> Peace,
> david
> --
> David H. Wolfskill david at catwhisker.org
> Those who would murder in the name of God or prophet are blasphemous
> cowards.
>
> See http://www.catwhisker.org/~david/publickey.gpg for my public key.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On 10/15/16, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> Replace newlines and colons with a space since they seem to be junk,
> generate a pattern to replace the attributes with a comma and do the
> replacement and finally read in what is left into a data frame using
> the attributes as column names.
>
> (I have indented each line of code below by 2 spaces so if any line
> starts before that then it's been wrapped around by the email and
> needs to be adjusted.)
>
> attributes <-
> c("Water temp", "Waterbody type", "Water pH", "Conductivity",
> "Water color", "Water turbidity", "Manmade", "Permanence", "Max water
> depth",
> "Primary substrate", "Evidence of cattle grazing", "Shoreline
> Emergent Veg(%)",
> "Fish present", "Fish species")
>
> ugly2 <- gsub("[:\n]", " ", ugly)
>
> pat <- paste(gsub("([[:punct:]])", ".", attributes), collapse = "|")
> ugly3 <- gsub(pat, ",", ugly2)
>
> dd <- read.table(text = ugly3, sep = ",", strip.white = TRUE,
> col.names = c("", attributes))[-1]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On 10/15/16, David Winsemius <dwinsemius at comcast.net> wrote:
>
>> On Oct 14, 2016, at 6:53 PM, Joe Ceradini <joeceradini at gmail.com> wrote:
>>
>> Hopefully this looks better. I did not realize gmail default was html.
>>
>> I have a dataframe with a column that has many field smashed together.
>> I need to split the strings in the column into separate columns based
>> on patterns.
>>
>> Example of a string that needs to be split:
>>
>> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water
>> pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity:
>> clear: Manmade:no Permanence:permanent: Max water depth: <3: Primary
>> substrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline
>> Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
>> amphibians observed")
>> ugly
>>
>> Far as I can tell, there is not a single pattern that would work for
>> splitting. Splitting on ":" is close, but not quite right. Each of the
>> below attributes should be in a separate column, and are present in
>> the string (above) that needs to be split:
>>
>> attributes <- c("Water temp", "Waterbody type", "Water pH",
>> "Conductivity", "Water color", "Water turbidity", "Manmade",
>> "Permanence", "Max water depth", "Primary substrate", "Evidence of
>> cattle grazing", "Shoreline Emergent Veg(%)", "Fish present", "Fish
>> species")
>>
>> Conceptually, I want to use the vector of attributes to split the
>> string. However, strsplit only uses the 1st value of the attributes
>> object:
>>
>> strplit(ugly, attributes).
>
> I tried this:
>
> strsplit( ugly, split=paste0(attributes, collapse="|") )
>
> And noticed soem of hte attributes were not actually splitting so went back
> and did the data entry after making sure that there were no "\n"'s in the
> middle of attribute names:
>
> dput(attributes)
> c("Water temp", "Waterbody type", "Water pH", "Conductivity",
> "Water color", "Water turbidity", "Manmade", "Permanence", "Max water
> depth",
> "Primary substrate", "Evidence of cattle grazing", "Shoreline Emergent
> Veg(%)",
> "Fish present", "Fish species")
>
> strsplit( ugly, split=paste0(attributes, collapse="|") )
> [[1]]
> [1] ""
>
> [2] ":14: F "
>
> [3] ":Permanent Lake/Pond: Water\npH:Unkwn: "
>
> [4] ":Unkwn: "
>
> [5] ": Clear: "
>
> [6] ":\nclear: "
>
> [7] ":no "
>
> [8] ":permanent: "
>
> [9] ": <3: Primary\nsubstrate: Silt/Mud: Evidence of cattle grazing: none:
> Shoreline\nEmergent Veg(%): 1-25: "
> [10] ": yes: Fish species: unkwn: no\namphibians observed"
>
>>
>> Should I loop through the values of "attributes"?
>> Is there an argument in strsplit I'm missing that will do what I want? \\
>
> I don't think strsplit has such an argument. There may be packages that will
> support this. Perhaps the gubfn package?
>
>
>> Different approach altogether?
>>
>> Thanks! Happy Friday.
>> Joe
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
More information about the R-help
mailing list