[R-pkg-devel] Sanitize Input Code for a Shiny App

Sun Feb 26 22:42:35 CET 2023

On Sun, 26 Feb 2023 14:36:22 -0500
<bill using denney.ws> wrote:

> What I'd like to be able to do is to sanitize the inputs to ensure
> that it won't to things including installing packages, running system
> commands, reading and writing to the filesystem, and accessing the
> network.  I'd like to allow the user to do almost anything they want
> within R, so making a list of acceptable commands is not
> accomplishing the goal.  

How untrusted will the user be?

You will most likely have to forbid Rcpp and any similar packages (and
likely anything that depends on them), because running arbitrary C++
will definitely include system commands, filesystem I/O and sockets.

To think of it, you will also have to forbid dyn.load(): otherwise it's
relatively easy to compile a ELF shared object that would load on any
64-bit Linux and provide an entry point that performs a system call of
the user's choice.

While the base R packages set R_forceSymbols(dll, TRUE) on their
.C/.Call/.Fortran/.External routines, making them unavailable to a bare
.Call("symbol_name", whatever_arguments, ...), some of the packages
deemed acceptable might provide C symbols without disabling the dynamic
lookup, and those C symbols may be convinced to run user-supplied code
either by mistake of the programmer (user supplies data, programmer
misses a buffer overrun or a use-after-free, suddenly data gets
executed as code; normally not considered a security issue because "the
user is already on the inner side of the airtight hatchway and could
have run system()") or even by design (the C programmer relied on the
R code to provide safety features). CRAN checks with Valgrind and
address-/undefined behaviour sanitizers help a lot against this kind of
C bugs, but they aren't 100% effective. Either way, top-level .C and
friends must be forbidden too. (Forbidding them anywhere in the call
chain will break almost all of R, even lm() and read.table().)

A CRAN package deemed acceptable may still have system() shell
injections in it and other ways to convince it to run user code.

The triple-colon protection can probably be bypassed in lots of clever
ways, starting with a perfectly normal R function `get()` combined with
a package function that has the private symbols visible to it. Similar
techniques may bypass the top-level .C/dyn.load()/system() restrictions.

Will the user be allowed to upload data? I think that R's serialization
engine is not designed to handle untrusted data. It has known stack
overflows [*], so a dedicated attacker is likely to be able to find a
code execution pathway from unserialize().

R code is also data, so a "forbidden" function or call could be
smuggled as data, parsed/deserialized and then evaluated. There are
standard classes in at least one of the base packages that use a lot of
closures, so a print() call on a doctored object can already be running
user-supplied code. You may forbid parse(), but some well-meaning code
inside R or a CRAN package may be convinced to run as.formula() on a
string, which may be enough for an attacker.

Some simple restrictions may be implemented by recursively walking the
return value of parse(user_input), but that ignores the possibility of
attacks on the parser itself. More significant restrictions may require
you to patch R and somehow determine at the point of an unsafe
(socket() / dlopen() / system() / ...) call whether it's
user-controlled or required for R to function normally (or both?).

I think it's a better idea to put the user-controlled process in a
virtual machine or a container or use a sandbox such as firejail or
bubblewrap. Make R run as an unprivileged user (this matters much less
once inside a sandbox, but let's not help the attacker skip privilege
escalation exploits before they try sandbox breakout exploits). Put
some firewall rules into place. Some attackers will doubtlessly try to
run a cryptocurrency miner without destroying the system/sending
spam/breaching your data, so make sure to limit CPU time, the size of
the temporary storage you're giving to the process and the amount of
RAM it can allocate.

In a game of walls and ladders, nothing will give you perfect
protection from attackers breaking out of such a sandbox (have I
mentioned the Spectre-class attack on the branch predictor in most
modern CPUs that makes it (very very hard but) possible to read from
address space not belonging to your process? The rowhammer attack on
non-ECC DRAM that (probabilistically) overwrites memory that is not
yours?), but in this case, the operating system is in a much better
position to restrict the user than R itself. Some languages (like Lua)
have such a tiny standard library that putting them in a sandbox is a
simple exercise, though both parser and bytecode execution
vulnerabilities (which might lead to code execution) still surface from
time to time. (SQLite went through a lot of fuzz-testing before its
authors could say that both its SQL parser and its database file parser
were safe against attacker-supplied input.) R (and the packages you
might want to run) just weren't designed with the idea of being a
security boundary on the user input side.

The JavaScript engines running in our browsers (well, the 2.5 browser
engines that are left) try to provide both the sandbox and the
performance to go with it, at the cost of a never-ending stream of
client-side vulnerabilities that have to be frequently patched. The
Python project now has Pyodide <https://pyodide.org/en/stable/>, an
interpreter that runs in the client's browser instead of your server,
which shifts the costs and the liabilities of user-supplied code on the
user. Maybe it's time to try to compile R for the browser?

-- 
Best regards,
Ivan

[*] https://bugs.r-project.org/show_bug.cgi?id=16034