[R] How to remove square brackets, etc. from address strings?

Sat May 26 00:19:04 CEST 2012

Part of your problem is that your regexes have spaces in them, so
that's what you're matching.

A small reproducible example would be more useful. I'm not feeling
inclined to wade through all your linked files on Friday evening, but
see if this helps:

> testdata <- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem & Mol Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol & Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys & Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany"
> results <- gsub("\\[.*?\\]", "", testdata)
> results <- unlist(strsplit(results, ";"))
> results <- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$", "\\1", x))
> names(results) <- NULL
> results
[1] "New Zealand" "USA"         "Germany"     "Germany"     "Germany"
   "Germany"     "Germany"     "Germany"

Sarah

On Fri, May 25, 2012 at 4:31 PM, Sabina Arndt <sabina.arndt at hotmail.de> wrote:
> Hello r-help members,
>
> the solutions which Sarah Goslee and arun sent to me in such a prompt and
> helpful manner work well with the examples I cut from the data.frame I'm
> analyzing. Thank you very much for that!
> I incorporated them into my R-script and discovered that it still doesn't
> work properly, unfortunately. I have no idea why that's the case.
> You see, I want to extract country names from the contents of tab-delimited
> text files. This is an example of the data I'm using:
> http://pastebin.com/mYZNDXg6
> This is the script I'm using to import the data:
> http://pastebin.com/Z10UUH3z (It requires the text files to be in a folder
> which doesn't contain any other .txt files.)
> This is the script I'm using to extract the country names:
> http://pastebin.com/G37fuPba
> This is the string that's in the relevant field of the first record I'm
> working on:
>
> [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz,
> Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser,
> Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst
> Lab Med Clin Chem & Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern,
> Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept
> Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ
> Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany;
> [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol & Toxicol,
> Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel]
> Univ Leipzig, Fac Med, Inst Med Phys & Biophys, Leipzig, Germany;
> [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin,
> Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany
>
> This is the incorrect result my extraction script gives me for the first
> record:
>
>> C1s[1]
>  [1] "[ENGEL,  KATHRIN M. Y." "KRISTIN"                "TORSTEN"
>  [4] "GERMANY"                "DANIEL"                 "LESCA MIRIAM"
>  [7] "GERMANY"                "ANKE"                   "MATTHIAS"
> [10] "MATTHIAS"               "GERMANY"                "KERSTIN"
> [13] "GERMANY"                "GERMANY"                "[SCHEIDT,  HOLGER
> A."
> [16] "JUERGEN"                "GERMANY"                "HUMBOLDT"
> [19] "GERMANY"
>
> For some reason the first and sixth pair of the eight square brackets are
> not removed ... Do you understand why?
> Instead I'd like to get this result, though:
>
>> C1s[1]
>  [1] "GERMANY"        "GERMANY"        "GERMANY"
>  [4] "GERMANY"        "GERMANY"        "GERMANY"
>  [7] "HUMBOLDT"        "GERMANY"
>
> What am I doing wrong? What are the errors in my R-script?
> Would anybody be so kind as to take a look and help me out, please?
> Thank you very much in advance!
>
> Faithfully yours,
>
> Sabina Arndt
>

-- 
Sarah Goslee
http://www.functionaldiversity.org