[R] translating HTML character entities to accented characters
David L Carlson
dcarlson at tamu.edu
Sun Aug 12 22:36:44 CEST 2012
This may work for your needs with a little fine tuning. Special and accented
characters can be represented in HTML with a character name or a numeric
value. For example, " can be represented as " or as " and it
appears from your example that both are used. I've attached a
dput(HTMLChars) to the end of this message with the concordances. The
following works on your data, but I haven't included any error checking.
Assuming your .csv file is called txt and the data.frame HTMLChars is
loaded:
# Search for &Name;
lsta <- unique(unlist(regmatches(txt, gregexpr("&[[:alpha:]]+;", txt))))
lsta <- data.frame(Name=lsta)
matches <- merge(HTMLChars, lsta)
for (i in 1:nrow(matches)) {
txt <- gsub(matches$Name[i], matches$Character[i], txt)
}
# Search for &#Number;
lstn <- unique(unlist(regmatches(txt, gregexpr("&#[[:digit:]]+;", txt))))
lstn <- data.frame(Number=lstn)
matches <- merge(HTMLChars, lstn)
for (i in 1:nrow(matches)) {
txt <- gsub(matches$Number[i], matches$Character[i], txt)
}
txt now contains the converted characters.
dput(HTMLChars)
structure(list(Character = c("\"", "'", "&", "<", ">", "", "¡",
"¢", "£", "¤", "¥", "¦", "§", "¨", "©", "ª", "«", "¬", "",
"®", "¯", "°", "±", "²", "³", "´", "µ", "¶", "·", "¸", "¹", "º",
"»", "¼", "½", "¾", "¿", "×", "÷", "À", "Á", "Â", "Ã", "Ä", "Å",
"Æ", "Ç", "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò",
"Ó", "Ô", "Õ", "Ö", "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à",
"á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í",
"î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "ø", "ù", "ú", "û",
"ü", "ý", "þ"), Number = c(""", "'", "&", "<",
">", " ", "¡", "¢", "£", "¤", "¥",
"¦", "§", "¨", "©", "ª", "«", "¬",
"", "®", "¯", "°", "±", "²", "³",
"´", "µ", "¶", "·", "¸", "¹", "º",
"»", "¼", "½", "¾", "¿", "×", "÷",
"À", "Á", "Â", "Ã", "Ä", "Å", "Æ",
"Ç", "È", "É", "Ê", "Ë", "Ì", "Í",
"Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô",
"Õ", "Ö", "Ø", "Ù", "Ú", "Û", "Ü",
"Ý", "Þ", "ß", "à", "á", "â", "ã",
"ä", "å", "æ", "ç", "è", "é", "ê",
"ë", "ì", "í", "î", "ï", "ð", "ñ",
"ò", "ó", "ô", "õ", "ö", "ø", "ù",
"ú", "û", "ü", "ý", "þ"), Name = c(""",
"'", "&", "<", ">", " ", "¡", "¢",
"£", "¤", "¥", "¦", "§", "¨",
"©", "ª", "«", "¬", "", "®", "¯",
"°", "±", "²", "³", "´", "µ",
"¶", "·", "¸", "¹", "º", "»",
"¼", "½", "¾", "¿", "×", "÷",
"À", "Á", "Â", "Ã", "Ä", "Å",
"Æ", "Ç", "È", "É", "Ê", "Ë",
"Ì", "Í", "Î", "Ï", "Ð", "Ñ",
"Ò", "Ó", "Ô", "Õ", "Ö", "Ø",
"Ù", "Ú", "Û", "Ü", "Ý", "Þ",
"ß", "à", "á", "â", "ã", "ä",
"å", "æ", "ç", "è", "é", "ê",
"ë", "ì", "í", "î", "ï", "ð",
"ñ", "ò", "ó", "ô", "õ", "ö",
"ø", "ù", "ú", "û", "ü", "ý",
"þ")), .Names = c("Character", "Number", "Name"), row.names = c(NA,
100L), class = "data.frame")
-------
David
> -----Original Message-----
> From: Michael Friendly [mailto:friendly at yorku.ca]
> Sent: Friday, August 10, 2012 12:14 PM
> To: dcarlson at tamu.edu
> Cc: 'R-help'
> Subject: Re: [R] translating HTML character entities to accented
> characters
>
> Thanks, David
>
> I need an all-R solution for this, because the author.csv file is
> exported from a database that enforces the HTML
> encoding and the import into R may have to be repeated several times as
> the database is updated.
>
> -Michael
>
> On 8/10/2012 12:40 PM, David L Carlson wrote:
> > It's not quite an R solution, but I just pasted your examples into a
> script
> > window in R and saved it as chars.html. Then I opened it in Firefox
> and
> > pasted the results here (with returns inserted to match your
> original).
> >
> >> grep("&", author$lname, value=TRUE)
> > [1] "Frère de Montizon" "Lumière"
> > [3] "Lumière" "Niépce"
> > [5] "Süssmilch" "Schüpbach"
> >> grep("&", author$birthplace, value=TRUE)
> > [1] "Marbach, Württemberg"
> > [2] "Côte-d'Or"
> > [3] "Chalon-sur-Saône, Saône-et-Loire"
> > [4] "Groß Särchen, Germany"
> >> apropos("HTML")
> > For a CSV file you would want to preserve the lines by adding <br> to
> the
> > end of each line first.
> >
> > ----------------------------------------------
> > David L Carlson
> > Associate Professor of Anthropology
> > Texas A&M University
> > College Station, TX 77843-4352
> >
> >
> >
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> >> project.org] On Behalf Of Michael Friendly
> >> Sent: Friday, August 10, 2012 11:15 AM
> >> To: R-help
> >> Subject: [R] translating HTML character entities to accented
> characters
> >>
> >> I've imported a .csv file where character strings that contained
> >> accented characters were written as HTML
> >> character entities. Is there a function that works on a vector to
> >> translate them back to accented (latin1) characters?
> >>
> >> Some examples:
> >>
> >> > grep("&", author$lname, value=TRUE)
> >> [1] "Frère de Montizon" "Lumière"
> >> [3] "Lumière" "Niépce"
> >> [5] "Süssmilch" "Schüpbach"
> >> > grep("&", author$birthplace, value=TRUE)
> >> [1] "Marbach, Württemberg"
> >> [2] "Côte-d'Or"
> >> [3] "Chalon-sur-Saône, Saône-et-Loire"
> >> [4] "Groß Särchen, Germany"
> >> > apropos("HTML")
> >>
> >> thx,
> >> -Michael
> >>
> >> --
> >> Michael Friendly Email: friendly AT yorku DOT ca
> >> Professor, Psychology Dept.
> >> York University Voice: 416 736-2100 x66249 Fax: 416 736-5814
> >> 4700 Keele Street Web: http://www.datavis.ca
> >> Toronto, ONT M3J 1P3 CANADA
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-
> >> guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>
>
> --
> Michael Friendly Email: friendly AT yorku DOT ca
> Professor, Psychology Dept.
> York University Voice: 416 736-2100 x66249 Fax: 416 736-5814
> 4700 Keele Street Web: http://www.datavis.ca
> Toronto, ONT M3J 1P3 CANADA
More information about the R-help
mailing list