[Rd] Bug in URLencode and patch

Sun Jan 11 14:34:34 CET 2015

I believe the implementation of utils::URLencode is non-compliant with
RFC 3986, which it claims to implement
(http://tools.ietf.org/html/rfc3986). Specifically, its percent
encoding uses lowercase letters a-f, which it should use uppercase
letters A-F.

Here's what URLencode currently produces:

library("utils")
URLencode("*+,;=:/?", reserved = TRUE)
# "%2a%2b%2c%3b%3d%3a%2f%3f"

According to RFC 3986 (references below), these should be uppercase:

toupper(URLencode("*+,;=:/?", reserved = TRUE))
# "%2A%2B%2C%3B%3D%3A%2F%3F"

This is a problem for me because I'm working with a web API that
authenticates using, in part, a hashed version of the URL-escaped
query arguments and this bug yields different hashes even though the
URLs are substantively the same. Here's a trivial example using just a
colon:

library("digest")
URLencode(":", reserved = TRUE)
# [1] "%3a"
digest("%3a")
# [1] "77fff19a933ae715d006469545892caf"
digest("%3A")
# [1] "8f270f6ac6fe3260f52293ea1d911093"

As an aside, I know that RCurl::curlEscape implements this correctly,
but I don't see any reason why URLencode shouldn't comply with RFC
3986.

The fix should be relatively simple. Here's updated code for URLencode
that simply adds a call to `toupper`:

function (URL, reserved = FALSE)
{
    OK <- paste0("[^", if (!reserved)
        "][!$&'()*+,;=:/?@#", "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"abcdefghijklmnopqrstuvwxyz0123456789._~-",
        "]")
    x <- strsplit(URL, "")[[1L]]
    z <- grep(OK, x)
    if (length(z)) {
        y <- sapply(x[z], function(x) paste0("%",
toupper(as.character(charToRaw(x))),
            collapse = ""))
        x[z] <- y
    }
    paste(x, collapse = "")
}

The relevant parts of RFC 3986 are (emphasis added):
2.1: "The uppercase hexadecimal digits 'A' through 'F' are equivalent
to the lowercase digits 'a' through 'f', respectively.  If two URIs
differ only in the case of hexadecimal digits used in percent-encoded
octets, they are equivalent.  For consistency, URI producers and
normalizers should use **uppercase** hexadecimal digits for all
percent-encodings."

6.2.2.1: "For all URIs, the hexadecimal digits within a
percent-encoding triplet (e.g., "%3a" versus "%3A") are
case-insensitive and therefore should be normalized to use
**uppercase** letters for the digits A-F."

Best,
-Thomas

Thomas J. Leeper
http://www.thomasleeper.com