[R] URLdecode problems

Jeff Newmiller jdnewmil at dcn.davis.CA.us
Mon Sep 1 23:52:40 CEST 2014


I would guess that the original URLs were encoded somehow (non-ASCII), and the person who received them didn't understand how to deal with them either and url-encoded them with the thought that they would not lose information that way. Unfortunately, they probably lost the meta information as to how they were originally encoded, and without that this turns into a detective job that will likely need C's ability (perhaps via RCpp) to ignore type information to put things back. If you are lucky all strings were originally encoded the same way... if really lucky they were all UTF8 or UTF16 (which would have nuls and other odd bytes). Proceeding with the broken strings you have now will almost certainly not work. The fragments shown are not even vaguely recognizable as URLs, so I don't see how we can do anything meaningful with them.

Please read the Posting Guide. One point made there to note is that if C becomes part of the question then R-devel becomes the more appropriate list. The other is that for all of these lists plain text email is expected (nor HTML). 
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On September 1, 2014 9:02:33 AM PDT, Oliver Keyes <okeyes at wikimedia.org> wrote:
>Hey all,
>
>So, I'm attempting to decode some (and I don't know why anyone did
>this)
>URl-encoded user agents. Running URLdecode over them generates the
>error:
>
>"Error in rawToChar(out) : embedded nul in string"
>
>Okay, so there's an embedded nul - fair enough. Presumably decoding the
>URL
>is exposing it in a format R doesn't like. Except when I try to dig
>down
>and work out what an encoded nul looks like, in order to simply remove
>them
>with something like gsub(), I end up with several different strings,
>all of
>which apparently resolve to an embedded nul:
>
>> URLdecode("0;%20@%gIL")
>Error in rawToChar(out) : embedded nul in string: '0; @\0L'
>In addition: Warning message:
>In URLdecode("0;%20@%gIL") :
>  out-of-range values treated as 0 in coercion to raw
>> URLdecode("%20%use")
>Error in rawToChar(out) : embedded nul in string: ' \0e'
>In addition: Warning message:
>In URLdecode("%20%use") :
>  out-of-range values treated as 0 in coercion to raw
>
>I'm a relative newb to encodings, so maybe the fault is simply in my
>understanding of how this should work, but - why are both strings being
>read as including nuls, despite having different values? And how would
>I go
>about removing said nuls?



More information about the R-help mailing list