[R] translating HTML character entities to accented characters

Mon Aug 13 15:17:45 CEST 2012

Beautiful, David.  thanks so much!

I packaged this as a function, html2latin1(), with this simple test

 > grep("&", author$givennames, value=TRUE)
  [1] "Adolphe d'" "Émile"
  [3] "Louis Jacques Mandé" "René"
  [5] "André Michel" "Léon"
  [7] "Émile"                 "Maurice d'"
  [9] "Louis Ézéchiel" "Louis-Léger"
[11] "Pierre-François"
 > grep("&", author$givennames)
  [1]   5  33  36  37  59  79  84 108 117 140 153
 > html2latin1(author$givennames)[grep("&", author$givennames)]
  [1] "Adolphe d'"          "Émile"               "Louis Jacques Mandé" 
"René"
  [5] "André Michel"        "Léon" "Émile"               "Maurice d'"
  [9] "Louis Ézéchiel"      "Louis-Léger" "Pierre-François"
 >

html2latin1 <- function(txt) {
     # Search for &Name;
     lsta <- unique(unlist(regmatches(txt, gregexpr("&[[:alpha:]]+;", 
txt))))
     lsta <- data.frame(Name=lsta)
     matches <- merge(HTMLChars, lsta)
     for (i in 1:nrow(matches)) {
          txt <- gsub(matches$Name[i], matches$Character[i], txt)
     }

     # Search for &#Number;
     lstn <- unique(unlist(regmatches(txt, gregexpr("&#[[:digit:]]+;", 
txt))))
     lstn <- data.frame(Number=lstn)
     matches <- merge(HTMLChars, lstn)
     for (i in 1:nrow(matches)) {
          txt <- gsub(matches$Number[i], matches$Character[i], txt)
     }
     txt
}

And this seems to work for the whole file:

authorfile <- readLines(file("author.csv"))
authorfilet <- html2latin1(authorfile)
writeLines(authorfilet, file("authort.csv"))

best,
-Michael

On 8/12/2012 4:36 PM, David L Carlson wrote:
> This may work for your needs with a little fine tuning. Special and accented
> characters can be represented in HTML with a character name or a numeric
> value. For example, " can be represented as " or as " and it
> appears from your example that both are used. I've attached a
> dput(HTMLChars) to the end of this message with the concordances. The
> following works on your data, but I haven't included any error checking.
> Assuming your .csv file is called txt and the data.frame HTMLChars is
> loaded:
>
> # Search for &Name;
> lsta <- unique(unlist(regmatches(txt, gregexpr("&[[:alpha:]]+;", txt))))
> lsta <- data.frame(Name=lsta)
> matches <- merge(HTMLChars, lsta)
> for (i in 1:nrow(matches)) {
>       txt <- gsub(matches$Name[i], matches$Character[i], txt)
> }
>
> # Search for &#Number;
> lstn <- unique(unlist(regmatches(txt, gregexpr("&#[[:digit:]]+;", txt))))
> lstn <- data.frame(Number=lstn)
> matches <- merge(HTMLChars, lstn)
> for (i in 1:nrow(matches)) {
>       txt <- gsub(matches$Number[i], matches$Character[i], txt)
> }
>
> txt now contains the converted characters.
>
> dput(HTMLChars)
> structure(list(Character = c("\"", "'", "&", "<", ">", "", "¡",
> "¢", "£", "¤", "¥", "¦", "§", "¨", "©", "ª", "«", "¬", "",
> "®", "¯", "°", "±", "²", "³", "´", "µ", "¶", "·", "¸", "¹", "º",
> "»", "¼", "½", "¾", "¿", "×", "÷", "À", "Á", "Â", "Ã", "Ä", "Å",
> "Æ", "Ç", "È", "É", "Ê", "Ë", "Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò",
> "Ó", "Ô", "Õ", "Ö", "Ø", "Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à",
> "á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í",
> "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", "ö", "ø", "ù", "ú", "û",
> "ü", "ý", "þ"), Number = c(""", "'", "&", "<",
> ">", " ", "¡", "¢", "£", "¤", "¥",
> "¦", "§", "¨", "©", "ª", "«", "¬",
> "", "®", "¯", "°", "±", "²", "³",
> "´", "µ", "¶", "·", "¸", "¹", "º",
> "»", "¼", "½", "¾", "¿", "×", "÷",
> "À", "Á", "Â", "Ã", "Ä", "Å", "Æ",
> "Ç", "È", "É", "Ê", "Ë", "Ì", "Í",
> "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô",
> "Õ", "Ö", "Ø", "Ù", "Ú", "Û", "Ü",
> "Ý", "Þ", "ß", "à", "á", "â", "ã",
> "ä", "å", "æ", "ç", "è", "é", "ê",
> "ë", "ì", "í", "î", "ï", "ð", "ñ",
> "ò", "ó", "ô", "õ", "ö", "ø", "ù",
> "ú", "û", "ü", "ý", "þ"), Name = c(""",
> "'", "&", "<", ">", " ", "¡", "¢",
> "£", "¤", "¥", "¦", "§", "¨",
> "©", "ª", "«", "¬", "", "®", "¯",
> "°", "±", "&sup2;", "&sup3;", "´", "µ",
> "¶", "·", "¸", "&sup1;", "º", "»",
> "&frac14;", "&frac12;", "&frac34;", "¿", "×", "÷",
> "À", "Á", "Â", "Ã", "Ä", "Å",
> "Æ", "Ç", "È", "É", "Ê", "Ë",
> "Ì", "Í", "Î", "Ï", "Ð", "Ñ",
> "Ò", "Ó", "Ô", "Õ", "Ö", "Ø",
> "Ù", "Ú", "Û", "Ü", "Ý", "Þ",
> "ß", "à", "á", "â", "ã", "ä",
> "å", "æ", "ç", "è", "é", "ê",
> "ë", "ì", "í", "î", "ï", "ð",
> "ñ", "ò", "ó", "ô", "õ", "ö",
> "ø", "ù", "ú", "û", "ü", "ý",
> "þ")), .Names = c("Character", "Number", "Name"), row.names = c(NA,
> 100L), class = "data.frame")
>
> -------
> David
>
>> -----Original Message-----
>> From: Michael Friendly [mailto:friendly at yorku.ca]
>> Sent: Friday, August 10, 2012 12:14 PM
>> To: dcarlson at tamu.edu
>> Cc: 'R-help'
>> Subject: Re: [R] translating HTML character entities to accented
>> characters
>>
>> Thanks, David
>>
>> I need an all-R solution for this, because the author.csv file is
>> exported from a database that enforces the HTML
>> encoding and the import into R may have to be repeated several times as
>> the database is updated.
>>
>> -Michael
>>
>> On 8/10/2012 12:40 PM, David L Carlson wrote:
>>> It's not quite an R solution, but I just pasted your examples into a
>> script
>>> window in R and saved it as chars.html. Then I opened it in Firefox
>> and
>>> pasted the results here (with returns inserted to match your
>> original).
>>>> grep("&", author$lname, value=TRUE)
>>> [1] "Frère de Montizon" "Lumière"
>>> [3] "Lumière" "Niépce"
>>> [5] "Süssmilch" "Schüpbach"
>>>> grep("&", author$birthplace, value=TRUE)
>>> [1] "Marbach, Württemberg"
>>> [2] "Côte-d'Or"
>>> [3] "Chalon-sur-Saône, Saône-et-Loire"
>>> [4] "Groß Särchen, Germany"
>>>> apropos("HTML")
>>> For a CSV file you would want to preserve the lines by adding <br> to
>> the
>>> end of each line first.
>>>
>>> ----------------------------------------------
>>> David L Carlson
>>> Associate Professor of Anthropology
>>> Texas A&M University
>>> College Station, TX 77843-4352
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>>>> project.org] On Behalf Of Michael Friendly
>>>> Sent: Friday, August 10, 2012 11:15 AM
>>>> To: R-help
>>>> Subject: [R] translating HTML character entities to accented
>> characters
>>>> I've imported a .csv file where character strings that contained
>>>> accented characters were written as HTML
>>>> character entities.  Is there a function that works on a vector to
>>>> translate them back to accented (latin1) characters?
>>>>
>>>> Some examples:
>>>>
>>>>    > grep("&", author$lname, value=TRUE)
>>>> [1] "Frère de Montizon" "Lumière"
>>>> [3] "Lumière"           "Niépce"
>>>> [5] "Süssmilch"           "Schüpbach"
>>>>    > grep("&", author$birthplace, value=TRUE)
>>>> [1] "Marbach, Württemberg"
>>>> [2] "Côte-d'Or"
>>>> [3] "Chalon-sur-Saône, Saône-et-Loire"
>>>> [4] "Groß Särchen, Germany"
>>>>    > apropos("HTML")
>>>>
>>>> thx,
>>>> -Michael
>>>>
>>>> --
>>>> Michael Friendly     Email: friendly AT yorku DOT ca
>>>> Professor, Psychology Dept.
>>>> York University      Voice: 416 736-2100 x66249 Fax: 416 736-5814
>>>> 4700 Keele Street    Web:   http://www.datavis.ca
>>>> Toronto, ONT  M3J 1P3 CANADA
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-
>>>> guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Michael Friendly     Email: friendly AT yorku DOT ca
>> Professor, Psychology Dept.
>> York University      Voice: 416 736-2100 x66249 Fax: 416 736-5814
>> 4700 Keele Street    Web:   http://www.datavis.ca
>> Toronto, ONT  M3J 1P3 CANADA
>

-- 
Michael Friendly     Email: friendly AT yorku DOT ca
Professor, Psychology Dept.
York University      Voice: 416 736-2100 x66249 Fax: 416 736-5814
4700 Keele Street    Web:   http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA