[R] cannot read iso639 table

peter dalgaard pdalgd at gmail.com
Thu Sep 13 22:43:14 CEST 2012


Pragmatically, one can zap the BOM from the output with 

language.ISO.table[1,1] <- substring(language.ISO.table[1,1],2)

and be gone with it.

It would be nicer to zap the BOM before read.table, though. It does work for me with the below (notice that the BOM is a single character if you don't use useBytes=). 

> get.language.ISO.table
function () {
 socket <- url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt",
               open="r",encoding="utf-8");
 readChar(socket, nchar=1)
 data <- read.table(socket, as.is = TRUE, sep = "|", header = FALSE,
                    col.names = c("a3bibliographic","a3terminologic",
                      "a2","english","french"), quote="");
 close(socket);
 data
}


On Sep 13, 2012, at 22:26 , William Dunlap wrote:

> It would be helpful if you showed your commands and printed
> outputs, copied directly from your R session, from the beginning
> to the end.  I put the call to sessionInfo() in my message because
> it is probably relevant.  It is nice to completely include the original
> email when responding to it so others can see the whole story in
> one place.
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
> 
> 
>> -----Original Message-----
>> From: Sam Steingold [mailto:sam.steingold at gmail.com] On Behalf Of Sam Steingold
>> Sent: Thursday, September 13, 2012 1:18 PM
>> To: William Dunlap
>> Cc: peter dalgaard; r-help at r-project.org
>> Subject: Re: [R] cannot read iso639 table
>> 
>>> * William Dunlap <jqhaync at gvopb.pbz> [2012-09-13 19:50:21 +0000]:
>>> 
>>> On Windows with R-2.15.1 in a 1252 locale, I had to read (and toss) out
>>> the initial 3 bytes (the byte-order mark?) to make things work:
>>> 
>>>> socket <-
>>>> url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-
>> 8.txt",open="r",encoding="utf-8")
>>>> readChar(socket, nchars=3, useBytes=TRUE)
>>>  [1] ""
>> 
>> confirmed - first 3 bytes are "\357\273\277"
>> 
>>>> d <- read.table(socket, quote="", sep="|", stringsAsFactors=FALSE)
>>>> dim(d)
>>>  [1] 485   5
>>>> head(d)
>>>     V1 V2 V3             V4      V5
>>>  1 aar    aa           Afar    afar
>>>  2 abk    ab      Abkhazian abkhaze
>>>  3 ace             Achinese    aceh
>>>  4 ach                Acoli   acoli
>>>  5 ada              Adangme adangme
>>>  6 ady       Adyghe; Adygei  adyghé
>> 
>> alas, this is all I get:
>> 
>> Warning message:
>> In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>>  invalid input found on input connection 'http://www.loc.gov/standards/iso639-2/ISO-
>> 639-2_utf-8.txt'
>> 
>>  a3bibliographic a3terminologic a2        english  french
>> 1             aar             NA aa           Afar    afar
>> 2             abk             NA ab      Abkhazian abkhaze
>> 3             ace             NA          Achinese    aceh
>> 4             ach             NA             Acoli   acoli
>> 5             ada             NA           Adangme adangme
>> 6             ady             NA    Adyghe; Adygei   adygh
>> 
>> note that the first non-ASCII character terminates the input.
>> 
>> so, I still cannot read the data from the URL.
>> 
>> I can read the file though - with quote="" (thanks Peter!) -
>> except that the first record is "\357\273\277aar".
>> 
>> 
>> --
>> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
>> http://www.childpsy.net/ http://thereligionofpeace.com
>> http://mideasttruth.com http://iris.org.il http://jihadwatch.org
>> The only thing worse than X Windows: (X Windows) - X

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com




More information about the R-help mailing list