[R] How to remove square brackets, etc. from address strings?

Sun May 27 18:29:48 CEST 2012

Hello r-help members,

I'm very grateful for the reply which Sarah Goslee sent to me in such a 
prompt and helpful manner.
It took me some time, but with a few amendments her suggestion now works 
not only for an example but for my entire data file as well:

 > results
   [1] "GERMANY"         "GERMANY"         "GERMANY"        "GERMANY"
   [5] "GERMANY"         "GERMANY"         "GERMANY"        "GERMANY"
...

Thank you very much for that, dear Sarah!

All these names actually belong to the very first record, though, which 
contains eight addresses instead of only one:

 > testdata[1]
   [1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; 
Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; 
[Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, 
Fac Med, Inst Lab Med Clin Chem & Mol Diagnost, Leipzig, Germany; 
[Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] 
Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, 
Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr 
Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf 
Boehm Inst Pharmacol & Toxicol, Leipzig, Germany; [Scheidt, Holger A.; 
Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys 
& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst 
Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium 
Pharmaceut AG, Martinsried, Germany"
 > results[1]
   [1] "GERMANY"

How can I put the country names back into their original lines / order?
This is an example of the correct result I'd like to receive:

 > results[1]
   [1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" 
"GERMANY" "GERMANY"

How can I achieve this result?

I think counting the semicolons outside square brackets - i.e. the ones 
before a "[" but behind a "]" would be helpful in this regard, but I'm 
not sure how to do that, unfortunately. These semicolons directly follow 
the country names, like this, e.g.: "... Germany; [..."
If I add "+ 1" to their number it results in the number of addresses for 
each record / line.

Thank you very much in advance!

Faithfully yours,

Sabina Arndt

Am 26.05.2012 00:19, schrieb Sarah Goslee:
> Part of your problem is that your regexes have spaces in them, so
> that's what you're matching.
>
> A small reproducible example would be more useful. I'm not feeling
> inclined to wade through all your linked files on Friday evening, but
> see if this helps:
>
>> testdata<- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem&  Mol Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol&  Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys&  Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany"
>> results<- gsub("\\[.*?\\]", "", testdata)
>> results<- unlist(strsplit(results, ";"))
>> results<- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$", "\\1", x))
>> names(results)<- NULL
>> results
> [1] "New Zealand" "USA"         "Germany"     "Germany"     "Germany"
>     "Germany"     "Germany"     "Germany"
>
>
> Sarah
>
> On Fri, May 25, 2012 at 4:31 PM, Sabina Arndt<sabina.arndt at hotmail.de>  wrote:
>> Hello r-help members,
>>
>> the solutions which Sarah Goslee and arun sent to me in such a prompt and
>> helpful manner work well with the examples I cut from the data.frame I'm
>> analyzing. Thank you very much for that!
>> I incorporated them into my R-script and discovered that it still doesn't
>> work properly, unfortunately. I have no idea why that's the case.
>> You see, I want to extract country names from the contents of tab-delimited
>> text files. This is an example of the data I'm using:
>> http://pastebin.com/mYZNDXg6
>> This is the script I'm using to import the data:
>> http://pastebin.com/Z10UUH3z (It requires the text files to be in a folder
>> which doesn't contain any other .txt files.)
>> This is the script I'm using to extract the country names:
>> http://pastebin.com/G37fuPba
>> This is the string that's in the relevant field of the first record I'm
>> working on:
>>
>> [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz,
>> Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser,
>> Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst
>> Lab Med Clin Chem&  Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern,
>> Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept
>> Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ
>> Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany;
>> [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol&  Toxicol,
>> Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel]
>> Univ Leipzig, Fac Med, Inst Med Phys&  Biophys, Leipzig, Germany;
>> [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin,
>> Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany
>>
>> This is the incorrect result my extraction script gives me for the first
>> record:
>>
>>> C1s[1]
>>   [1] "[ENGEL,  KATHRIN M. Y." "KRISTIN"                "TORSTEN"
>>   [4] "GERMANY"                "DANIEL"                 "LESCA MIRIAM"
>>   [7] "GERMANY"                "ANKE"                   "MATTHIAS"
>> [10] "MATTHIAS"               "GERMANY"                "KERSTIN"
>> [13] "GERMANY"                "GERMANY"                "[SCHEIDT,  HOLGER
>> A."
>> [16] "JUERGEN"                "GERMANY"                "HUMBOLDT"
>> [19] "GERMANY"
>>
>> For some reason the first and sixth pair of the eight square brackets are
>> not removed ... Do you understand why?
>> Instead I'd like to get this result, though:
>>
>>> C1s[1]
>>   [1] "GERMANY"        "GERMANY"        "GERMANY"
>>   [4] "GERMANY"        "GERMANY"        "GERMANY"
>>   [7] "HUMBOLDT"        "GERMANY"
>>
>> What am I doing wrong? What are the errors in my R-script?
>> Would anybody be so kind as to take a look and help me out, please?
>> Thank you very much in advance!
>>
>> Faithfully yours,
>>
>> Sabina Arndt