[R] How to remove square brackets, etc. from address strings?

Sun May 27 19:04:28 CEST 2012

Hello,

Though I've not been following this thread, it seems like a regular 
expressions problem.
In the code below, I've created a 'testdata' variable based on your post.

# create a vector with two elements.
x <- "[Engel, Kathrin M. Y.; Schroeck, ... etc ...
y <- gsub("Germany", "Portugal", x)
testdata <- c(x, y)

# 's' is a list of character vectors, each element's final word is a 
country
s <- strsplit(testdata, ";[[:space:]]+\\[")
lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))

If this isn't it, sorry for the intrusion.

Rui Barradas

Em 27-05-2012 17:29, Sabina Arndt escreveu:
> Hello r-help members,
>
> I'm very grateful for the reply which Sarah Goslee sent to me in such 
> a prompt and helpful manner.
> It took me some time, but with a few amendments her suggestion now 
> works not only for an example but for my entire data file as well:
>
> > results
>   [1] "GERMANY"         "GERMANY"         "GERMANY"        "GERMANY"
>   [5] "GERMANY"         "GERMANY"         "GERMANY"        "GERMANY"
> ...
>
> Thank you very much for that, dear Sarah!
>
> All these names actually belong to the very first record, though, 
> which contains eight addresses instead of only one:
>
> > testdata[1]
>   [1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; 
> Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; 
> [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, 
> Fac Med, Inst Lab Med Clin Chem & Mol Diagnost, Leipzig, Germany; 
> [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] 
> Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, 
> Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr 
> Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, 
> Rudolf Boehm Inst Pharmacol & Toxicol, Leipzig, Germany; [Scheidt, 
> Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, 
> Inst Med Phys & Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] 
> Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, 
> Martin] Ingenium Pharmaceut AG, Martinsried, Germany"
> > results[1]
>   [1] "GERMANY"
>
> How can I put the country names back into their original lines / order?
> This is an example of the correct result I'd like to receive:
>
> > results[1]
>   [1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" 
> "GERMANY" "GERMANY"
>
> How can I achieve this result?
>
> I think counting the semicolons outside square brackets - i.e. the 
> ones before a "[" but behind a "]" would be helpful in this regard, 
> but I'm not sure how to do that, unfortunately. These semicolons 
> directly follow the country names, like this, e.g.: "... Germany; [..."
> If I add "+ 1" to their number it results in the number of addresses 
> for each record / line.
>
> Thank you very much in advance!
>
> Faithfully yours,
>
> Sabina Arndt
>
>
> Am 26.05.2012 00:19, schrieb Sarah Goslee:
>> Part of your problem is that your regexes have spaces in them, so
>> that's what you're matching.
>>
>> A small reproducible example would be more useful. I'm not feeling
>> inclined to wade through all your linked files on Friday evening, but
>> see if this helps:
>>
>>> testdata<- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, 
>>> Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, 
>>> Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, 
>>> Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem&  Mol 
>>> Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias; Blueher, 
>>> Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal 
>>> Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ 
>>> Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; 
>>> [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol&  
>>> Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; 
>>> Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys&  Biophys, 
>>> Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim 
>>> Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut 
>>> AG, Martinsried, Germany"
>>> results<- gsub("\\[.*?\\]", "", testdata)
>>> results<- unlist(strsplit(results, ";"))
>>> results<- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$", 
>>> "\\1", x))
>>> names(results)<- NULL
>>> results
>> [1] "New Zealand" "USA"         "Germany"     "Germany"     "Germany"
>>     "Germany"     "Germany"     "Germany"
>>
>>
>> Sarah
>>
>> On Fri, May 25, 2012 at 4:31 PM, Sabina 
>> Arndt<sabina.arndt at hotmail.de>  wrote:
>>> Hello r-help members,
>>>
>>> the solutions which Sarah Goslee and arun sent to me in such a 
>>> prompt and
>>> helpful manner work well with the examples I cut from the data.frame 
>>> I'm
>>> analyzing. Thank you very much for that!
>>> I incorporated them into my R-script and discovered that it still 
>>> doesn't
>>> work properly, unfortunately. I have no idea why that's the case.
>>> You see, I want to extract country names from the contents of 
>>> tab-delimited
>>> text files. This is an example of the data I'm using:
>>> http://pastebin.com/mYZNDXg6
>>> This is the script I'm using to import the data:
>>> http://pastebin.com/Z10UUH3z (It requires the text files to be in a 
>>> folder
>>> which doesn't contain any other .txt files.)
>>> This is the script I'm using to extract the country names:
>>> http://pastebin.com/G37fuPba
>>> This is the string that's in the relevant field of the first record I'm
>>> working on:
>>>
>>> [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz,
>>> Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; 
>>> [Teupser,
>>> Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, 
>>> Inst
>>> Lab Med Clin Chem&  Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; 
>>> Kern,
>>> Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac 
>>> Med, Dept
>>> Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ
>>> Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany;
>>> [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol&  
>>> Toxicol,
>>> Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, 
>>> Daniel]
>>> Univ Leipzig, Fac Med, Inst Med Phys&  Biophys, Leipzig, Germany;
>>> [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin,
>>> Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, 
>>> Germany
>>>
>>> This is the incorrect result my extraction script gives me for the 
>>> first
>>> record:
>>>
>>>> C1s[1]
>>>   [1] "[ENGEL,  KATHRIN M. Y." "KRISTIN"                "TORSTEN"
>>>   [4] "GERMANY"                "DANIEL"                 "LESCA MIRIAM"
>>>   [7] "GERMANY"                "ANKE"                   "MATTHIAS"
>>> [10] "MATTHIAS"               "GERMANY"                "KERSTIN"
>>> [13] "GERMANY"                "GERMANY"                "[SCHEIDT,  
>>> HOLGER
>>> A."
>>> [16] "JUERGEN"                "GERMANY"                "HUMBOLDT"
>>> [19] "GERMANY"
>>>
>>> For some reason the first and sixth pair of the eight square 
>>> brackets are
>>> not removed ... Do you understand why?
>>> Instead I'd like to get this result, though:
>>>
>>>> C1s[1]
>>>   [1] "GERMANY"        "GERMANY"        "GERMANY"
>>>   [4] "GERMANY"        "GERMANY"        "GERMANY"
>>>   [7] "HUMBOLDT"        "GERMANY"
>>>
>>> What am I doing wrong? What are the errors in my R-script?
>>> Would anybody be so kind as to take a look and help me out, please?
>>> Thank you very much in advance!
>>>
>>> Faithfully yours,
>>>
>>> Sabina Arndt
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.