[R] Reading a txt file from internet

Sun Sep 8 13:47:57 CEST 2024

Say that you have several files from different places or times and you wanted to run your program on all of them without reprogramming. You could start with the readr package and use guess_encoding.
j <- 1
for (i in file_paths){
  file_encoding[j] <- as.character(readr::guess_encoding(i)$encoding)
  j=j+1
}

With the encoding of each file, you can combine file_paths and file_encoding and then break this into multiple data frames based on encoding. Read all the data, reformat for consistency, and then combine them.

More simply, you could just guess_encoding() on one file just to see what it might be like. It gives you a name like UTF-16LE that you can then use in the encoding statement as others have already shown.

Tim

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Duncan Murdoch
Sent: Sunday, September 8, 2024 5:06 AM
To: Jeff Newmiller <jdnewmil using dcn.davis.ca.us>; r-help using r-project.org; Enrico Schumann <es using enricoschumann.net>; Christofer Bogaso <bogaso.christofer using gmail.com>
Subject: Re: [R] Reading a txt file from internet

[External Email]

On 2024-09-07 7:37 p.m., Jeff Newmiller wrote:
> I tried it on R 4.4.1 on Linux Mint 21.3 just before I posted it, and I just tried it on R 3.4.2 on Ubuntu 16.04 and R 4.3.2 on Windows 11 just now and it works on all of them.
>
> I don't have a big-endian machine to test on, but the Unicode spec says to honor the BOM and if there isn't one to assume that it is big-endian data. But in this case there is a BOM so your machine has a buggy decoder?

Sounds like it!  I did it on a Mac running R 4.4.1.

Duncan Murdoch

>
> On September 7, 2024 2:43:24 PM PDT, Duncan Murdoch <murdoch.duncan using gmail.com> wrote:
>> On 2024-09-07 4:52 p.m., Jeff Newmiller via R-help wrote:
>>> When you specify LE in the encoding type, you are logically telling the decoder that you know the two-byte pairs are in little-endian order... which could override whatever the byte-order-mark was indicating. If the BOM indicated big-endian then the file decoding would break. If there is a BOM, don't override it unless you have to (e.g. for a wrong BOM)... leave off the LE unless you really need it.
>>
>> That sounds like good advice, but it doesn't work:
>>
>>> read.delim(
>> +     'https://online.stat.psu.edu/onlinecourses/sites/stat501/files /ch15/employee.txt',
>> +     fileEncoding = "UTF-16"
>> + )
>> [1] time
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> [2] vendor.洀攀琀愀氀........㐀㐀........㜀.㐀㐀........㤀.㐀㐀.㐀..㐀.....㐀..㐀..㔀...㜀.㐀..㠀..㘀...㠀.㐀㐀....㜀...㔀.㐀㐀.
>>
>> and so on.
>>>
>>> On September 7, 2024 1:22:23 PM PDT, Enrico Schumann <es using enricoschumann.net> wrote:
>>>> On Sun, 08 Sep 2024, Christofer Bogaso writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am trying to the data from
>>>>> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2F
>>>>> online.stat.psu.edu%2Fonlinecourses%2Fsites%2Fstat501%2Ffiles%2Fch
>>>>> 15%2Femployee.txt&data=05%7C02%7Ctebert%40ufl.edu%7C07d806c97fa945
>>>>> f64baf08dccfe57631%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C63
>>>>> 8613831690785878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQI
>>>>> joiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=icg%2
>>>>> BW984cnNyT1XEjXU8HA%2B%2Bm0euoDQjblE4gsdFl4c%3D&reserved=0
>>>>> without any success. Below is the error I am getting:
>>>>>
>>>>>> read.delim('http://h/
>>>>>> ttps%3A%2F%2Fonline.stat.psu.edu%2Fonlinecourses%2Fsites%2Fstat50
>>>>>> 1%2Ffiles%2Fch15%2Femployee.txt&data=05%7C02%7Ctebert%40ufl.edu%7
>>>>>> C07d806c97fa945f64baf08dccfe57631%7C0d4da0f84a314d76ace60a62331e1
>>>>>> b84%7C0%7C0%7C638613831690788947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoi
>>>>>> MC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C
>>>>>> %7C%7C&sdata=CKXMTrJdSo%2F%2FADPixS7XKkliVcETjbxLq0X2BSseCe4%3D&r
>>>>>> eserved=0')
>>>>>
>>>>> Error in make.names(col.names, unique = TRUE) :
>>>>>
>>>>>     invalid multibyte string at '<ff><fe>t'
>>>>>
>>>>> In addition: Warning messages:
>>>>>
>>>>> 1: In read.table(file = file, header = header, sep = sep, quote = quote,  :
>>>>>
>>>>>     line 1 appears to contain embedded nulls
>>>>>
>>>>> 2: In read.table(file = file, header = header, sep = sep, quote = quote,  :
>>>>>
>>>>>     line 2 appears to contain embedded nulls
>>>>>
>>>>> 3: In read.table(file = file, header = header, sep = sep, quote = quote,  :
>>>>>
>>>>>     line 3 appears to contain embedded nulls
>>>>>
>>>>> 4: In read.table(file = file, header = header, sep = sep, quote = quote,  :
>>>>>
>>>>>     line 4 appears to contain embedded nulls
>>>>>
>>>>> 5: In read.table(file = file, header = header, sep = sep, quote = quote,  :
>>>>>
>>>>>     line 5 appears to contain embedded nulls
>>>>>
>>>>> Is there any way to read this data directly onto R?
>>>>>
>>>>> Thanks for your time
>>>>>
>>>>
>>>> The <ff><fe> looks like a byte-order mark
>>>> (https://en.wikipedia.org/wiki/Byte_order_mark).
>>>> Try this:
>>>>
>>>>      fn <- file('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
>>>>                 encoding = "UTF-16LE")
>>>>      read.delim(fn)
>>>>
>>>
>>
>

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.r-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.