[R] How to remove square brackets, etc. from address strings?

Mon May 28 01:04:11 CEST 2012

Hello,

Em 27-05-2012 22:12, Sabina Arndt escreveu:
> Hello r-help members,
>
> thank you very much for your reply, Rui Barradas.
>
> Unfortunately, I'm not sure if I understand it correctly: I don't know 
> how to create the vector's second element y that way. The pattern you 
> used has to be extracted from the address strings first. This is more 
> complex as I'd tried to explain in my previous posts. It finally seems 
> to work now.

Your data file has more than one line. I've called it "sabrina.txt" and 
then processed with:

x <- readLines("sabrina.txt")

s <- strsplit(x, ";[[:space:]]\\[")
r <- lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))

length(r)
[1] 21

So a vector 'y' and 19 other would have been created.

> Do you happen to have any idea on how I could put the country names 
> back into their original lines / order, though?

r[[21]] <- NULL
r[[20]] <- r[[20]][ -length(r[[20]]) ]
r1 <- lapply(r, function(x) x[nchar(x) > 0])
country.list <- r1[ -which(sapply(r1, function(x) is.null(x))) ]
# clean up
rm(s, r, r1)

# See what we have
country.list

As far as I can tell they're in the original order. But what do you mean 
by "back into their original lines"?

> Thank you very much in advance!
Any time, glad to help.

Rui Barradas

> Faithfully yours,
>
> Sabina Arndt
>
>
> Am 27.05.2012 19:04, schrieb Rui Barradas:
>> Hello,
>>
>> Though I've not been following this thread, it seems like a regular 
>> expressions problem.
>> In the code below, I've created a 'testdata' variable based on your 
>> post.
>>
>> # create a vector with two elements.
>> x <- "[Engel, Kathrin M. Y.; Schroeck, ... etc ...
>> y <- gsub("Germany", "Portugal", x)
>> testdata <- c(x, y)
>>
>> # 's' is a list of character vectors, each element's final word is a 
>> country
>> s <- strsplit(testdata, ";[[:space:]]+\\[")
>> lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))
>>
>>
>> If this isn't it, sorry for the intrusion.
>>
>> Rui Barradas
>>
>> Em 27-05-2012 17:29, Sabina Arndt escreveu:
>>> Hello r-help members,
>>>
>>> I'm very grateful for the reply which Sarah Goslee sent to me in 
>>> such a prompt and helpful manner.
>>> It took me some time, but with a few amendments her suggestion now 
>>> works not only for an example but for my entire data file as well:
>>>
>>> > results
>>>   [1] "GERMANY"         "GERMANY"         "GERMANY"        "GERMANY"
>>>   [5] "GERMANY"         "GERMANY"         "GERMANY"        "GERMANY"
>>> ...
>>>
>>> Thank you very much for that, dear Sarah!
>>>
>>> All these names actually belong to the very first record, though, 
>>> which contains eight addresses instead of only one:
>>>
>>> > testdata[1]
>>>   [1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, 
>>> Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, 
>>> Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, 
>>> Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem & Mol 
>>> Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, 
>>> Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal 
>>> Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ 
>>> Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; 
>>> [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol & 
>>> Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; 
>>> Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys & Biophys, 
>>> Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim 
>>> Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut 
>>> AG, Martinsried, Germany"
>>> > results[1]
>>>   [1] "GERMANY"
>>>
>>> How can I put the country names back into their original lines / order?
>>> This is an example of the correct result I'd like to receive:
>>>
>>> > results[1]
>>>   [1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" 
>>> "GERMANY" "GERMANY"
>>>
>>> How can I achieve this result?
>>>
>>> I think counting the semicolons outside square brackets - i.e. the 
>>> ones before a "[" but behind a "]" would be helpful in this regard, 
>>> but I'm not sure how to do that, unfortunately. These semicolons 
>>> directly follow the country names, like this, e.g.: "... Germany; [..."
>>> If I add "+ 1" to their number it results in the number of addresses 
>>> for each record / line.
>>>
>>> Thank you very much in advance!
>>>
>>> Faithfully yours,
>>>
>>> Sabina Arndt
>>>
>>>
>>> Am 26.05.2012 00:19, schrieb Sarah Goslee:
>>>> Part of your problem is that your regexes have spaces in them, so
>>>> that's what you're matching.
>>>>
>>>> A small reproducible example would be more useful. I'm not feeling
>>>> inclined to wade through all your linked files on Friday evening, but
>>>> see if this helps:
>>>>
>>>>> testdata<- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, 
>>>>> Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, 
>>>>> Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam; 
>>>>> Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem&  
>>>>> Mol Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias; 
>>>>> Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept 
>>>>> Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] 
>>>>> Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, 
>>>>> Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst 
>>>>> Pharmacol&  Toxicol, Leipzig, Germany; [Scheidt, Holger A.; 
>>>>> Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med 
>>>>> Phys&  Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt 
>>>>> Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] 
>>>>> Ingenium Pharmaceut AG, Martinsried, Germany"
>>>>> results<- gsub("\\[.*?\\]", "", testdata)
>>>>> results<- unlist(strsplit(results, ";"))
>>>>> results<- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$", 
>>>>> "\\1", x))
>>>>> names(results)<- NULL
>>>>> results
>>>> [1] "New Zealand" "USA"         "Germany"     "Germany"     "Germany"
>>>>     "Germany"     "Germany"     "Germany"
>>>>
>>>>
>>>> Sarah
>>>>
>>>> On Fri, May 25, 2012 at 4:31 PM, Sabina 
>>>> Arndt<sabina.arndt at hotmail.de>  wrote:
>>>>> Hello r-help members,
>>>>>
>>>>> the solutions which Sarah Goslee and arun sent to me in such a 
>>>>> prompt and
>>>>> helpful manner work well with the examples I cut from the 
>>>>> data.frame I'm
>>>>> analyzing. Thank you very much for that!
>>>>> I incorporated them into my R-script and discovered that it still 
>>>>> doesn't
>>>>> work properly, unfortunately. I have no idea why that's the case.
>>>>> You see, I want to extract country names from the contents of 
>>>>> tab-delimited
>>>>> text files. This is an example of the data I'm using:
>>>>> http://pastebin.com/mYZNDXg6
>>>>> This is the script I'm using to import the data:
>>>>> http://pastebin.com/Z10UUH3z (It requires the text files to be in 
>>>>> a folder
>>>>> which doesn't contain any other .txt files.)
>>>>> This is the script I'm using to extract the country names:
>>>>> http://pastebin.com/G37fuPba
>>>>> This is the string that's in the relevant field of the first 
>>>>> record I'm
>>>>> working on:
>>>>>
>>>>> [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; 
>>>>> Schulz,
>>>>> Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; 
>>>>> [Teupser,
>>>>> Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac 
>>>>> Med, Inst
>>>>> Lab Med Clin Chem&  Mol Diagnost, Leipzig, Germany; [Toenjes, 
>>>>> Anke; Kern,
>>>>> Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac 
>>>>> Med, Dept
>>>>> Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] 
>>>>> Univ
>>>>> Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany;
>>>>> [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst 
>>>>> Pharmacol&  Toxicol,
>>>>> Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, 
>>>>> Daniel]
>>>>> Univ Leipzig, Fac Med, Inst Med Phys&  Biophys, Leipzig, Germany;
>>>>> [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin,
>>>>> Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, 
>>>>> Germany
>>>>>
>>>>> This is the incorrect result my extraction script gives me for the 
>>>>> first
>>>>> record:
>>>>>
>>>>>> C1s[1]
>>>>>   [1] "[ENGEL,  KATHRIN M. Y." "KRISTIN"                "TORSTEN"
>>>>>   [4] "GERMANY"                "DANIEL"                 "LESCA 
>>>>> MIRIAM"
>>>>>   [7] "GERMANY"                "ANKE"                   "MATTHIAS"
>>>>> [10] "MATTHIAS"               "GERMANY"                "KERSTIN"
>>>>> [13] "GERMANY"                "GERMANY"                "[SCHEIDT,  
>>>>> HOLGER
>>>>> A."
>>>>> [16] "JUERGEN"                "GERMANY"                "HUMBOLDT"
>>>>> [19] "GERMANY"
>>>>>
>>>>> For some reason the first and sixth pair of the eight square 
>>>>> brackets are
>>>>> not removed ... Do you understand why?
>>>>> Instead I'd like to get this result, though:
>>>>>
>>>>>> C1s[1]
>>>>>   [1] "GERMANY"        "GERMANY"        "GERMANY"
>>>>>   [4] "GERMANY"        "GERMANY"        "GERMANY"
>>>>>   [7] "HUMBOLDT"        "GERMANY"
>>>>>
>>>>> What am I doing wrong? What are the errors in my R-script?
>>>>> Would anybody be so kind as to take a look and help me out, please?
>>>>> Thank you very much in advance!
>>>>>
>>>>> Faithfully yours,
>>>>>
>>>>> Sabina Arndt