[R] How to remove square brackets, etc. from address strings?
Rui Barradas
ruipbarradas at sapo.pt
Mon May 28 01:04:11 CEST 2012
Hello,
Em 27-05-2012 22:12, Sabina Arndt escreveu:
> Hello r-help members,
>
> thank you very much for your reply, Rui Barradas.
>
> Unfortunately, I'm not sure if I understand it correctly: I don't know
> how to create the vector's second element y that way. The pattern you
> used has to be extracted from the address strings first. This is more
> complex as I'd tried to explain in my previous posts. It finally seems
> to work now.
Your data file has more than one line. I've called it "sabrina.txt" and
then processed with:
x <- readLines("sabrina.txt")
s <- strsplit(x, ";[[:space:]]\\[")
r <- lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))
length(r)
[1] 21
So a vector 'y' and 19 other would have been created.
> Do you happen to have any idea on how I could put the country names
> back into their original lines / order, though?
r[[21]] <- NULL
r[[20]] <- r[[20]][ -length(r[[20]]) ]
r1 <- lapply(r, function(x) x[nchar(x) > 0])
country.list <- r1[ -which(sapply(r1, function(x) is.null(x))) ]
# clean up
rm(s, r, r1)
# See what we have
country.list
As far as I can tell they're in the original order. But what do you mean
by "back into their original lines"?
> Thank you very much in advance!
Any time, glad to help.
Rui Barradas
> Faithfully yours,
>
> Sabina Arndt
>
>
> Am 27.05.2012 19:04, schrieb Rui Barradas:
>> Hello,
>>
>> Though I've not been following this thread, it seems like a regular
>> expressions problem.
>> In the code below, I've created a 'testdata' variable based on your
>> post.
>>
>> # create a vector with two elements.
>> x <- "[Engel, Kathrin M. Y.; Schroeck, ... etc ...
>> y <- gsub("Germany", "Portugal", x)
>> testdata <- c(x, y)
>>
>> # 's' is a list of character vectors, each element's final word is a
>> country
>> s <- strsplit(testdata, ";[[:space:]]+\\[")
>> lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))
>>
>>
>> If this isn't it, sorry for the intrusion.
>>
>> Rui Barradas
>>
>> Em 27-05-2012 17:29, Sabina Arndt escreveu:
>>> Hello r-help members,
>>>
>>> I'm very grateful for the reply which Sarah Goslee sent to me in
>>> such a prompt and helpful manner.
>>> It took me some time, but with a few amendments her suggestion now
>>> works not only for an example but for my entire data file as well:
>>>
>>> > results
>>> [1] "GERMANY" "GERMANY" "GERMANY" "GERMANY"
>>> [5] "GERMANY" "GERMANY" "GERMANY" "GERMANY"
>>> ...
>>>
>>> Thank you very much for that, dear Sarah!
>>>
>>> All these names actually belong to the very first record, though,
>>> which contains eight addresses instead of only one:
>>>
>>> > testdata[1]
>>> [1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg,
>>> Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem,
>>> Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery,
>>> Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem & Mol
>>> Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher,
>>> Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal
>>> Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ
>>> Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany;
>>> [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol &
>>> Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen;
>>> Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys & Biophys,
>>> Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim
>>> Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut
>>> AG, Martinsried, Germany"
>>> > results[1]
>>> [1] "GERMANY"
>>>
>>> How can I put the country names back into their original lines / order?
>>> This is an example of the correct result I'd like to receive:
>>>
>>> > results[1]
>>> [1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY"
>>> "GERMANY" "GERMANY"
>>>
>>> How can I achieve this result?
>>>
>>> I think counting the semicolons outside square brackets - i.e. the
>>> ones before a "[" but behind a "]" would be helpful in this regard,
>>> but I'm not sure how to do that, unfortunately. These semicolons
>>> directly follow the country names, like this, e.g.: "... Germany; [..."
>>> If I add "+ 1" to their number it results in the number of addresses
>>> for each record / line.
>>>
>>> Thank you very much in advance!
>>>
>>> Faithfully yours,
>>>
>>> Sabina Arndt
>>>
>>>
>>> Am 26.05.2012 00:19, schrieb Sarah Goslee:
>>>> Part of your problem is that your regexes have spaces in them, so
>>>> that's what you're matching.
>>>>
>>>> A small reproducible example would be more useful. I'm not feeling
>>>> inclined to wade through all your linked files on Friday evening, but
>>>> see if this helps:
>>>>
>>>>> testdata<- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg,
>>>>> Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem,
>>>>> Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam;
>>>>> Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem&
>>>>> Mol Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias;
>>>>> Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept
>>>>> Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter]
>>>>> Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig,
>>>>> Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst
>>>>> Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.;
>>>>> Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med
>>>>> Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt
>>>>> Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin]
>>>>> Ingenium Pharmaceut AG, Martinsried, Germany"
>>>>> results<- gsub("\\[.*?\\]", "", testdata)
>>>>> results<- unlist(strsplit(results, ";"))
>>>>> results<- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$",
>>>>> "\\1", x))
>>>>> names(results)<- NULL
>>>>> results
>>>> [1] "New Zealand" "USA" "Germany" "Germany" "Germany"
>>>> "Germany" "Germany" "Germany"
>>>>
>>>>
>>>> Sarah
>>>>
>>>> On Fri, May 25, 2012 at 4:31 PM, Sabina
>>>> Arndt<sabina.arndt at hotmail.de> wrote:
>>>>> Hello r-help members,
>>>>>
>>>>> the solutions which Sarah Goslee and arun sent to me in such a
>>>>> prompt and
>>>>> helpful manner work well with the examples I cut from the
>>>>> data.frame I'm
>>>>> analyzing. Thank you very much for that!
>>>>> I incorporated them into my R-script and discovered that it still
>>>>> doesn't
>>>>> work properly, unfortunately. I have no idea why that's the case.
>>>>> You see, I want to extract country names from the contents of
>>>>> tab-delimited
>>>>> text files. This is an example of the data I'm using:
>>>>> http://pastebin.com/mYZNDXg6
>>>>> This is the script I'm using to import the data:
>>>>> http://pastebin.com/Z10UUH3z (It requires the text files to be in
>>>>> a folder
>>>>> which doesn't contain any other .txt files.)
>>>>> This is the script I'm using to extract the country names:
>>>>> http://pastebin.com/G37fuPba
>>>>> This is the string that's in the relevant field of the first
>>>>> record I'm
>>>>> working on:
>>>>>
>>>>> [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten;
>>>>> Schulz,
>>>>> Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany;
>>>>> [Teupser,
>>>>> Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac
>>>>> Med, Inst
>>>>> Lab Med Clin Chem& Mol Diagnost, Leipzig, Germany; [Toenjes,
>>>>> Anke; Kern,
>>>>> Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac
>>>>> Med, Dept
>>>>> Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter]
>>>>> Univ
>>>>> Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany;
>>>>> [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst
>>>>> Pharmacol& Toxicol,
>>>>> Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster,
>>>>> Daniel]
>>>>> Univ Leipzig, Fac Med, Inst Med Phys& Biophys, Leipzig, Germany;
>>>>> [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin,
>>>>> Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried,
>>>>> Germany
>>>>>
>>>>> This is the incorrect result my extraction script gives me for the
>>>>> first
>>>>> record:
>>>>>
>>>>>> C1s[1]
>>>>> [1] "[ENGEL, KATHRIN M. Y." "KRISTIN" "TORSTEN"
>>>>> [4] "GERMANY" "DANIEL" "LESCA
>>>>> MIRIAM"
>>>>> [7] "GERMANY" "ANKE" "MATTHIAS"
>>>>> [10] "MATTHIAS" "GERMANY" "KERSTIN"
>>>>> [13] "GERMANY" "GERMANY" "[SCHEIDT,
>>>>> HOLGER
>>>>> A."
>>>>> [16] "JUERGEN" "GERMANY" "HUMBOLDT"
>>>>> [19] "GERMANY"
>>>>>
>>>>> For some reason the first and sixth pair of the eight square
>>>>> brackets are
>>>>> not removed ... Do you understand why?
>>>>> Instead I'd like to get this result, though:
>>>>>
>>>>>> C1s[1]
>>>>> [1] "GERMANY" "GERMANY" "GERMANY"
>>>>> [4] "GERMANY" "GERMANY" "GERMANY"
>>>>> [7] "HUMBOLDT" "GERMANY"
>>>>>
>>>>> What am I doing wrong? What are the errors in my R-script?
>>>>> Would anybody be so kind as to take a look and help me out, please?
>>>>> Thank you very much in advance!
>>>>>
>>>>> Faithfully yours,
>>>>>
>>>>> Sabina Arndt
More information about the R-help
mailing list