[R] cannot read iso639 table

Prof Brian Ripley ripley at stats.ox.ac.uk
Sun Dec 9 00:04:33 CET 2012


For the record, in R-devel you can do

> f <-
read.table(url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt", 
encoding = "UTF-8-BOM"), quote="", sep="|", stringsAsFactors=FALSE)
> f[1,]
    V1 V2 V3   V4   V5
1 aar    aa Afar afar
> charToRaw(f[1,1])
[1] 61 61 72

Whether this works with "UTF-8" depends on the implementation of iconv: 
strangely Microsoft remove BOMs in UTF-16 but not in UTF-8 (although 
almost the only people to put them there in UTF-8 are Microsoft's 
applications).



On 13/09/2012 21:43, peter dalgaard wrote:
> Pragmatically, one can zap the BOM from the output with
>
> language.ISO.table[1,1] <- substring(language.ISO.table[1,1],2)
>
> and be gone with it.
>
> It would be nicer to zap the BOM before read.table, though. It does work for me with the below (notice that the BOM is a single character if you don't use useBytes=).
>
>> get.language.ISO.table
> function () {
>   socket <- url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt",
>                 open="r",encoding="utf-8");
>   readChar(socket, nchar=1)
>   data <- read.table(socket, as.is = TRUE, sep = "|", header = FALSE,
>                      col.names = c("a3bibliographic","a3terminologic",
>                        "a2","english","french"), quote="");
>   close(socket);
>   data
> }
>
>
> On Sep 13, 2012, at 22:26 , William Dunlap wrote:
>
>> It would be helpful if you showed your commands and printed
>> outputs, copied directly from your R session, from the beginning
>> to the end.  I put the call to sessionInfo() in my message because
>> it is probably relevant.  It is nice to completely include the original
>> email when responding to it so others can see the whole story in
>> one place.
>>
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>>
>>
>>> -----Original Message-----
>>> From: Sam Steingold [mailto:sam.steingold at gmail.com] On Behalf Of Sam Steingold
>>> Sent: Thursday, September 13, 2012 1:18 PM
>>> To: William Dunlap
>>> Cc: peter dalgaard; r-help at r-project.org
>>> Subject: Re: [R] cannot read iso639 table
>>>
>>>> * William Dunlap <jqhaync at gvopb.pbz> [2012-09-13 19:50:21 +0000]:
>>>>
>>>> On Windows with R-2.15.1 in a 1252 locale, I had to read (and toss) out
>>>> the initial 3 bytes (the byte-order mark?) to make things work:
>>>>
>>>>> socket <-
>>>>> url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-
>>> 8.txt",open="r",encoding="utf-8")
>>>>> readChar(socket, nchars=3, useBytes=TRUE)
>>>>   [1] ""
>>>
>>> confirmed - first 3 bytes are "\357\273\277"
>>>
>>>>> d <- read.table(socket, quote="", sep="|", stringsAsFactors=FALSE)
>>>>> dim(d)
>>>>   [1] 485   5
>>>>> head(d)
>>>>      V1 V2 V3             V4      V5
>>>>   1 aar    aa           Afar    afar
>>>>   2 abk    ab      Abkhazian abkhaze
>>>>   3 ace             Achinese    aceh
>>>>   4 ach                Acoli   acoli
>>>>   5 ada              Adangme adangme
>>>>   6 ady       Adyghe; Adygei  adyghé
>>>
>>> alas, this is all I get:
>>>
>>> Warning message:
>>> In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>>>   invalid input found on input connection 'http://www.loc.gov/standards/iso639-2/ISO-
>>> 639-2_utf-8.txt'
>>>
>>>   a3bibliographic a3terminologic a2        english  french
>>> 1             aar             NA aa           Afar    afar
>>> 2             abk             NA ab      Abkhazian abkhaze
>>> 3             ace             NA          Achinese    aceh
>>> 4             ach             NA             Acoli   acoli
>>> 5             ada             NA           Adangme adangme
>>> 6             ady             NA    Adyghe; Adygei   adygh
>>>
>>> note that the first non-ASCII character terminates the input.
>>>
>>> so, I still cannot read the data from the URL.
>>>
>>> I can read the file though - with quote="" (thanks Peter!) -
>>> except that the first record is "\357\273\277aar".
>>>
>>>
>>> --
>>> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
>>> http://www.childpsy.net/ http://thereligionofpeace.com
>>> http://mideasttruth.com http://iris.org.il http://jihadwatch.org
>>> The only thing worse than X Windows: (X Windows) - X
>


-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595



More information about the R-help mailing list