Signs of Triviality

Opinions, mostly my own, on the importance of being and other things.
[homepage] [index] [jschauma@netmeister.org] [@jschauma] [RSS]

Contain yourself!

A look at ways to restrict Unix processes

November 28th, 2018

The following are the notes used for a lecture on restrictions available to Unix processes, first created for my Advanced Programming in the UNIX Environment class in November 2018.


A
shopping bag with the words 'contain yourself' on it.The Unix family of operating systems has been, by nature and from its first conception, been a multitasking, multiuser system. This implies the need for a number of concepts, such as separate accounts, user privileges, file permissions, process ownership, etc. In addition, as we've discussed repeatedly, all resources available to the system are finite in nature, so there is an inherent competition over these resources: CPU time and memory are limited, disks fill up, open file descriptors are run out of, etc. etc.

Some of these resources are managed by the system: for example, the scheduler places different processes on the available CPUs using an algorithm that ensures that no process is starved or overuses the resources. Disk space is finite, but the system may enforce e.g., user quotas to ensure that not one user can fill up all available disk space or use more than their share. The filesystem itself reserves a certain number of inodes for the superuser / system usage, so that even a completely filled up system can still run until at the administrator can at least clean up things.

Beyond this, we've also discussed a number of implications surrounding user privileges and how processes may influence one another. In this lecture, we'll take a look at the various ways in which processes can be restricted from (negatively) impacting one another. We will talk about the basic mechanisms we have available across the Unix systems as well as some mechanisms which build on top of these to create more restricted environments, eventually leading to the concept of sandboxing, Operating-System-level virtualization and containers.

Some of these approaches are operating system specific, while others utilize common system calls to reach their goal. In addition, standardized approaches and de-facto standards also come into play.


What we know so far...

What have we learned so far?

As so often, it's useful to start by reviewing what we already know. You should find that many of the things we've discussed throughout the semester are directly relevant to this topic. For example, in Lecture 02, we looked at how the system may limit the number of file descriptors a process may have open: the openmax.c program illustrated that there may be a per-process resource limitation (retrieved via getrlimit(2)), a system-wide defined value hard-coded into the kernel or derived from a fixed header (i.e., OPEN_MAX from sys/syslimits.h), as well as a system tunable configuration option, possibly changing at runtime from invocation to invocation (i.e., _SC_OPEN_MAX from sysconf(2)).

As we discussed the Unix file system, we identified the basic Unix access semantics for file access: user, group, other. Together with the access logic for directory access outlined in Lecture 03, this simple model allows us to restrict what resources in the file system a process may access. For example, we can restrict the ability of a web server to serve contents from my ~/public_html/ directory without allowing any user to list the contents of my home directory by settings the following permissions:

$ pwd
/home/jschauma
$ ls -ld . public_html
drwx-----x  52 jschauma  users  3584 Nov 27 18:17 .
drwx-----x  37 jschauma  users  2048 Nov 15 23:07 public_html
$ 

However, the granularity of this model is somewhat limited, as it only allows distinguishing amongst these three sets of users: owner, group, and everybody else. Even though a user may be a member of multiple groups, there are different limits on the number of groups you may be a member of. On a system that uses e.g., NFS, you may be restricted to only 16 groups!

What's more, access control via group membership is particularly cumbersome: for starters, unlike users, a file can only belong to a single group. So if you want to share your file with members of group A and B, but not with everybody else, you're out of luck. In addition, any changes to group membership require action by the administrator of the system; users can't self-control their group membership or create new groups ad-hoc.


POSIX.1e Access Control Lists (ACLs)

Illustration of a normal and a
torn ACL (Anterior Cruciate Ligament) in the knee.

Several Unix filesystems overcome this restriction through the use of so-called Access Control Lists, or ACLs, most notably via a POSIX extension (POSIX.1e). These ACLs allow the user to specify in more granularity the access they wish to grant. In the default 'ls(1)' output, the presence of extended filesystem attributes (EAs, which ACLs are generally implemented as) are usually indicated by the '+' sign after the usual permissions string, and interactions from a user's perspective is via the setfacl(1) and getfacl(1) tools:

$ whoami
jschauma
$ groups
professor abcxyz null nova one threedot sigsegv flag
$ ls -l hole.c
-rw------- 1 jschauma professor 984 Sep 10 19:50 hole.c
$ getfacl hole.c
# file: hole.c
# owner: jschauma
# group: professor
user::rw-
group::---
other::---
$ setfacl -m g:student:r hole.c
setfacl: hole.c: Operation not supported
$ # Whoops! The filesystem we're on doesn't support ACLs!
$ # Let's try again on a local filesystem:
$ cp hole.c /tmp
$ cd /tmp
$ ls -l hole.c
-rw------- 1 jschauma professor 984 Nov 27 21:51 hole.c
$ setfacl -m g:student:r hole.c
$ getfacl hole.c
# file: hole.c
# owner: jschauma
# group: professor
user::rw-
group::---
group:student:r--
mask::r--
other::---
$ 

That is: the filesystem we're on must have support for ACLs, or else we can't set them (duh). Where there is support, we are now able to grant access to individuals as well as to members of other groups, including those we are not a member of (!). Neat!

For more information on ACLs, see e.g.:


Changing effective User-IDs

A woman pulling a mask off her
face. With ACLs, we can better control which users can access which files. Or rather, we can control which processes, when running with a given user's effective UID, can access which files. Recall also from Lecture 03 our discussion around the use of effective and real UIDs and how we can change or elevate privileges (setuid.c). A common example of using this mechanism are daemons that need super-user privileges initially (for example, to bind a privileged port) and then drop permissions to those of an unprivileged user (e.g., httpd, nobody, ...), an approach known as 'privilege separation'.

We've also seen examples of changing our (effective) UID when using e.g., su(1). But being able to run su(1) relies on the password of that user and grants you full access to everything that user controls. That is, you can only allow another user to su(1) and thus to do anything and everything as the target user, but not restrict it to a subset of functionality or commands.

To selectively allow a single program to be run with the privileges of another user, we have seen many examples of using setuid bit on an executable, but that is again somewhat clumsy as it then allows anybody who can execute the file to run it with the owner's privileges.

To overcome some of these limitations as well as to provide a bit more control, logging, and additional protections around the changing of eUIDs or running programs with the privileges of another user, many Unix systems use the sudo(8) utility. Although not part of every Unix system, it is the most commonly used mechanism here, allowing for fine-grained control and careful definitions of which user may execute which commands assuming which other user's eUIDs.

However, sudo(8) turns out to be a complex beast, and configuring it correctly and securely isn't easy. All too often, trying to restrict privileges to a set of commands results in the user being able to break out of one of them, gaining more elevated privileges than intended. This is a common theme in trying to restrict processes, and something we will see again further down. For the time being, let's note this difficulty together with another requirement for using sudo(8): you need to actually have an interactive shell, able to run the various commands to begin with. (It is by and large impractical to configure a user account such that every command they wish to run requires sudo(8) authentication, although it's worth noting that, just like su(1), sudo(8) is not limited to granting elevated privileges, but can be used to grant access to any user account, not just root's.)

For more information on changing UIDs and elevating privileges, see e.g.:


Securelevels and mount options

A Super-Mario screen.

Before we further explore the ways in which we can restrict processes from running certain commands, let us take a brief detour back to the filesystem level as well as what changes can be made to the system as a whole.

One of the challenges we face is the existence and requirement for a superuser. With eUID 0, a process can do anything on the system -- this is by design as well as necessity. However, we also have a need to run certain services with such elevated privileges, and yet we want to be able to limit the damage a rogue eUID 0 process can wreak.

One solution to this problem are so-called 'securelevels', which have been around in the BSD family of Unix systems for a while, and for which e.g., Linux has later gotten support. In effect, you are configuring the system such that certain things are not possible without lowering the 'securelevel' of the OS. Lowering the 'securelevel' of the OS requires a system reboot, however, so is (a) noisy and more likely to be detected, and (b) terminates any current connections you may have (thereby requiring persistent access to the system).

At the same time, you can always raise the securelevel at any time, thereby allowing the system to bootstrap and finally give up privileges it no longer needs. This concept of voluntarily giving up privileges that you are then unable to regain after the fact is an important principle we will see throughout this lecture: the goal is to restrict yourself as well as any child processes so that even if you're compromised, the damage is limited.

The restrictions enforced by the different securelevels are listed in the manual page secmodel_securelevel(9). Some of the more interesting restrictions are those relating to 'append' and 'immutable' file flags as well as the ability to mount, unmount, or remount filesystems.

Which gets us to a few aspects of filesystems we have not yet covered (in detail): in addition to ACLs, a filesystem may also support "file flags" or "file attributes", a way to, for example, indicate that a file may not be changed at all, even by the owner and not even root. Such a file is called immutable; if this flag is set, then only root can unset the flag (provided that the securelevel allows this operation!).

These file flags or attributes are implemented on the BSD derived systems via the chflags(2) system call with the corresponding chflags(1) command (ls(1) supports an option to display these flags), while on Linux there exist the command-line utilities chattr(1) and lsattr(1):

$ echo foo > append-only
$ chflags uappend append-only
$ echo bar > append-only 
ksh: cannot create append-only: Operation not permitted
$ echo bar >>append-only 
$ cat append-only 
foo
bar
$ echo "you can't touch this" >hammertime
$ su root -c "chflags schg hammertime"
$ touch hammertime 
touch: hammertime: Operation not permitted
$ echo stop >>hammertime
ksh: cannot create hammertime: Operation not permitted
$ rm -f hammertime
rm: hammertime: Operation not permitted
$ su root -c "rm -f hammertime"
rm: hammertime: Operation not permitted
$ echo "I told you:"; cat hammertime
I told you:
you can't touch this
$ ls -lo append-only hammertime 
-rw-r--r--  1 jschauma  wheel  uappnd  8 Nov 28 01:04 append-only
-rw-r--r--  1 jschauma  wheel  schg   21 Nov 28 01:05 hammertime
$ 

So we have ways to mark e.g., individual files as being append only, but if we want to extend this to larger parts of the filesystem, a more efficient method would be to use mount options. Consider that during the regular system operations it is unlikely that anything under e.g., /usr needs to change, we could prevent any possibly compromised process -- even one running with eUID 0! -- from installing a backdoor into /usr/bin by marking the entire filesystem hierarchy as read-only.

We briefly discussed the benefits of having different partitions for different components of the filesystem hierarchy, but here we see clear benefits from separating variable data from static data: we can, for example, mount the entire base filesystem as read-only; we can mount the data partitions as noexec; and we can mount partitions containing e.g., user executables nosuid. With these changes being protected from being reversed even by root by way of a raised securelevel, we have significantly hardened our systems and protected against even a compromised eUID 0 process. We will revisit how to use some of these options further below when we combine them with other techniques.

For more information on securelevels, file flags and attributes, and the various mount options, please see:


Restricted Shells

Several sea shells on strings. Ok, back to our problem of restricting a user process and the commands it should be able to execute...

In some circumstances, it's necessary to provide an interactive shell to a user, but to limit what they can do. Providing a full, interactive, regular shell would allow the user to perform many tasks that we may not wish to let them do. Based on the principle of Least Privilege, if a user should only be able to run one of ten or twelve commands, then we should not offer them a full, interactive login with access to all commands under /bin, /usr/bin, etc.

Enter the concept of a restricted shell. A restricted shell is, generally speaking, any shell that limits the user's ability to execute commands. Having taken a look at how a shell is implemented, we should already have an idea of how this might be accomplished. Just as with a regular shell, there are different variations of the concept of a restricted shell. They usually are invoked as 'r${SHELL}' (e.g., rsh, although confusingly the most common 'rsh' is actually a 'remote shell') or with the '-r' flag. The Bourne, Korn, and Bourne-again shells all support a restricted invocation.

When running in restricted mode, these shells prohibit, amongst other things:

  • changing the current working directory (i.e., no 'cd')
  • changing the ENV, PATH, and SHELL environment variables
  • specifying commands containing a '/' (i.e., only commands found in the (fixed) PATH can be executed)
  • redirecting output into files

With these restriction, you can reasonably control any of the commands the user could invoke by only providing the executables you want them to run into a specific location and setting the PATH prior to invoking the restricted shell accordingly.

Note: as with the case of configuring sudo(8), it is up to the administrator to know and understand the commands they let the user execute in a restricted environment. Many commands can be used to shell out, to run other programs. Any command thusly invoked would not be restricted in the same manner. For starters, you'd have to make sure to not let an unrestricted shell remain in the PATH available to the user:

$ ksh -r
restricted$ cd /
ksh: cd: restricted shell - can't cd
restricted$ /bin/csh
ksh: /bin/csh: restricted
restricted$ csh
% cd /
% pwd
/
% 

Likewise, many editors allow you to invoke external commands:

$ ksh -r
restricted$ cd / && /bin/ls
ksh: cd: restricted shell - can't cd
restricted$ vi

~
~
/tmp/vi.r5BVIa: new file: line 1
:!cd / && /bin/ls
altroot   boot.cfg  home      lib       misc      rescue    stand     var
bin       dev       htdocs    libdata   mnt       root      tmp
boot      etc       kern      libexec   oldroot   sbin      usr
Press any key to continue [: to enter more ex commands]: 

In other words, you'll have to create a separate directory containing just the binaries you want to allow, review that those binaries cannot be abused or broken out of, set the PATH to this new directory, and then invoke the restricted shell. Such a carefully constructed environment for a restricted shell will certainly confine the user such that they can only run the commands you grant them access to.

For more information on restricted shells, please see


Chroot

Illustration of how to repot a
plant. At the same time, though, you may have a need to allow the user to change directories, to perform I/O on local files, or to more generally interact with parts of the filesystem without making available the entire filesystem.

ACLs and Unix permissions could work here, but would require a significant amount of careful and tedious effort to identify the right groups and permissions without breaking the normal operations of the system. And of course you'd also have to rely on there not being any flaws in the tools you allow the user to execute such that they might be granted access to other files.

To overcome this problem, it'd be useful to expose a restricted copy or view of the filesystem to the process: similar to how you might populate a custom PATH for a restricted shell, you could construct a filesystem containing the necessary files and restrict the user to only operate within the confines of this changed root. Enter the chroot(8) command and chroot(2) system call, added to Unix in 1979.

This is particularly useful, since this means that you can restrict a process that needs to run with superuser privileges, a common problem when running system daemons that you want to protect against attackers exploiting. Note that a process can carry e.g., open file descriptors into the chroot, thereby allowing improved privilege separation even for a superuser process. This, however, also carries with it a risk: any file descriptors opened prior to the chroot(2) call may then still be able to access from within the chroot resources that are outside the chroot.

In fact, this approach was used early on to break out of a chroot. Try to run the command break-chroot.c on a Unix system of your choice to see whether it currently protects against the trick to fchdir(2) using a file descriptor pointing outside of the chroot. (Most modern Unix versions are able to detect this and will prevent you from escaping your chroot.)

Note that after calling chroot(2), the process will view the given directory as the root of the entire filesystem. That means that any and all operations subsequently invoked will try to look for any required files under this directory. This includes executables, configuration files, and shared libraries for any dynamically linked executables you're trying to invoke as well as any absolute paths specified.

Let's create a minimal chroot that allows the user only to run the commands id(1), ps(1), and sh(1). Since these executables are dynamically linked, we need to copy all the required shared libraries as well as the dynamic link loader (review Lecture 11). Then, we enter the chroot and run some commands:

$ cat >mkchroot <<"EOF"
CHROOT="/tmp/chroot"

FILES="/bin/sh /bin/ps /usr/bin/id"

rm -fr ${CHROOT}
mkdir -p ${CHROOT}/libexec ${CHROOT}/usr/libexec

cp /libexec/ld.elf_so ${CHROOT}/libexec
ln /libexec/ld.elf_so ${CHROOT}/usr/libexec/ld.elf_so

for f in ${FILES}; do
        mkdir -p ${CHROOT}${f%/*}
        for lib in $(ldd ${f} | sed -n -e 's/.*> //p'); do
                mkdir -p ${CHROOT}${lib%/*}
                test -f ${CHROOT}${lib} || cp ${lib} ${CHROOT}${lib}
        done
        cp "${f}" ${CHROOT}/bin/
done
EOF
$ sh mkchroot
$ su root -c "chroot /tmp/chroot /bin/sh"
# echo "We're in the chroot!"
We're inthe chroot!
# pwd
/
# ls
ls: not found
# echo *
bin lib libexec usr
# echo bin/*
bin/id bin/ps bin/sh
# id
uid=0 gid=0 groups=0,2,3,4,5,20,31
# cd /usr/bin
# pwd
/usr/bin
# echo *
*
# ps
 PID TTY   STAT    TIME COMMAND
1296 pts/0 S    0:00.00 /bin/sh 
1340 pts/0 S    0:00.01 sh -c chroot /tmp/chroot /bin/sh 
1941 pts/0 O+   0:00.00 ps 
 760 ?     Is+  0:00.00 /usr/libexec/getty Pc console 
 558 ?     Is+  0:00.00 /usr/libexec/getty Pc ttyE1 
 772 ?     Is+  0:00.00 /usr/libexec/getty Pc ttyE2 
 739 ?     Is+  0:00.00 /usr/libexec/getty Pc ttyE3 
# exit
$ 

Note that after we are in the chroot, we can't invoke e.g., ls(1), since we didn't copy that executable into the chroot. But we can use the shell builtin commands, such as pwd, cd, or echo.

Note also that the output of the id(1) command gives us numeric user IDs only; the file needed to translate UIDs to usernames (i.e., /etc/passwd) does not exist inside the chroot.

Finally, while we are able to restrict the process to a very limited view of the filesystem and are able to tightly restrict what commands it can invoke, we see that from inside the chroot we are still able to see e.g., process information for processes outside of the chroot! We probably want to be able to restrict that, too...

For more information on chroots, please see:


Jails

A jail cell.

Now in order to restrict a process not only to a particular view of the filesystem as in the above chroot example, around 2000 or so, the FreeBSD project added the jail(2) system call (and jail(8) utility). A jail restricts the process with respect to the other resources on the system such that from within a jail, it's almost impossible to notice that you are not running on a real system. You don't get to see other processes, system accounts or uids, and of course you get your own chroot as well. In addition, a jail may be bound to a particular IP address, and network functionality is then also restricted to this address only.

In this fashion, a jail effectively implements a process sandbox, or virtual environment. You can even create jails for different OS versions of the parent OS, so long as your parent kernel is capable of running or emulating the environment. That is, on a FreeBSD 11 system, you could have a FreeBSD 10 jail to run an application that requires this version of the OS.

In addition to the chroot restrictions noted above, jails enforce:

  • per-jail process view
  • changing sysctls or securelevels is prohibited
  • modifying the network configuration is prohibited, raw sockets are disabled
  • mounting and unmounting filesystems is prohibited
  • mknod(2) is prohibited

Following jails and utilizing the capabilities to clone, snapshot, and partition filesystems using it's ZFS, Sun implemented Solaris Containers and Solaris Zones in 2004 / 2005, which take this concept further: each zone is an isolated execution environment providing lightweight virtualization while being bound to the defined restrictions.

For more information on jails, please see:


Back to processes

Port Lympia of Nice.
By Tobi 87 - Own work, CC BY-SA 3.0

Even if we are able to restrict with fine granularity which users may execute which commands and, by way of file system permissions (extended or otherwise) or even through the use of a chroot or jail which files may be accessed or which processes may be viewed, we are still facing a number of problems: we are still using the same resources (CPUs, memory, ...), all software contains bugs, and humans tend to misconfigure things, thereby oftentimes allowing a process to do things that they shouldn't be able to, or e.g., to interfere with the normal operations of the system.

All processes, even those running in jails, effectively compete for the same resources and may continue to run forever, or consume certain resources to a degree that we'd rather it not. We've discussed some way to restrict a given process's resource utilization in Lecture 06 by way of getrlimit(2)/ setrlimit(2) and the ulimit shell builtin:

$ ulimit -a
time(cpu-seconds)    unlimited
file(blocks)         unlimited
coredump(blocks)     unlimited
data(kbytes)         262144
stack(kbytes)        4096
lockedmem(kbytes)    2026214
memory(kbytes)       6078644
nofiles(descriptors) 128
processes            160
threads              160
vmemory(kbytes)      unlimited
sbsize(bytes)        unlimited
$ 

The use of resource limits brings us back to a significant and important concept: self-restriction. That is, a process can voluntarily restrict its own usage such that it itself cannot later regain the privileges it had. This applies equally to any children this process may create, thereby allowing you to create more confined processes or process groups. (Similarly, dropping the elevated privileges when calling setuid(2) mean that you cannot regain your higher privileges.)

Now going back to the idea that processes may compete for resoures, let's briefly take a look at how the scheduler works when placing processes on the CPU:

To put it in really simple terms, each process has a priority, and the scheduler goes through the list of processes sitting in the wait queue. For example, the output of the w(1) command will show you the load average numbers of jobs in the run queue over 1, 5, and 15 minute intervals:

$ w | head -1
10:47  up 23:40, 6 users, load averages: 1.65 2.00 2.11
$ 

After a process gets on the CPU, it gets moved back to the end of the queue. In effect, there's a FIFO of processes waiting to get CPU time. Now if you look at your favorite CS Operating Systems lecture, you'll also have seen e.g., priority based scheduling algorithms, whereby each process also is assigned a given priority; processes with higher priority get precedence over those with lower when a CPU slot becomes available, while processes that have been waiting for a long time get their priority raised with time to avoid starvation.

Now on Unix systems, you can tell the scheduler the priority of your jobs. You may have some commands that you want to run with a higher priority than others, or you may want to run a command that may take a long time -- say, e.g., a backup process -- but you don't want to let it interfere with other activity on the system. To do this, you can use the nice(1) command.

Note that the value used to nice(1) indicates the "niceness" of the program: a program that's nice will have a lowered priority compared to others; a program that's not nice will claim a higher priority. So to lower your priority -- that is, to be nice -- you provide a high value to nice(1).

Every process begins with a default priority of 0. A niceness of 19 or 20 indicates that the process will never get the CPU as long as there are processes in the run queue with a niceness of <= 0 (i.e., default priority or explicitly set higher priority). (All processes with another positive nice value will eventually get the CPU, as the scheduler may increase their priority based on how long they've been in the queue.)

As indicated earlier, many of the restrictions we're discussing here are such that you can restrict yourself, but afterwards not gain the previous privileges, and so it is with nice(1): a process cannot raise its priority (lower its "niceness") unless it's running with euid 0. That is, if you begin a shell with a default priority of 0 and then runs a command with a niceness of 10, then that command could further lower its priority and become even nicer, but cannot go back to a default niceness of 0.

The nice(1) utility is used to start a process with the given priority, but we may also have a need to adjust the niceness of processes already running. To do that, we can use the renice(1) utlitity.

Note that renice(1) can be used to adjust the priority not just of a single process, but also of a process group (revisit Lecture 07) or of all processes by a given owner, thereby allowing for careful adjustment of processes in a possibly overloaded system without having to terminate any processes.

To inspect the priority, use e.g., ps(1):

$ cat /dev/zero >/dev/null &
[1] 43
$ cat /dev/zero >/dev/null &
[2] 41
$ ps -l
 UID PID PPID   CPU PRI NI   VSZ  RSS WCHAN STAT TTY   TIME    COMMAND
1000  41  635 19097  34  0  7656  876 -     R    pts/0 0:05.13 cat /dev/zero 
1000  42  635     0  43  0 11996 1024 -     O+   pts/0 0:00.00 ps -l 
1000  43  635 18005  35  0 10244  876 -     R    pts/0 0:36.57 cat /dev/zero 
1000 635  707     0  85  0  7884 1288 pause Ss   pts/0 0:00.01 -ksh 
$ renice +10 41
41: old priority 0, new priority 10
$ ps -l
 UID PID PPID   CPU PRI NI   VSZ  RSS WCHAN STAT TTY   TIME    COMMAND
1000  41  635 18371  25 10  7656  876 -     RN   pts/0 0:27.66 cat /dev/zero
1000  43  635 34367  27  0 10244  876 -     R    pts/0 1:04.86 cat /dev/zero 
1000 477  635     0  43  0 12000 1040 -     O+   pts/0 0:00.00 ps -l 
1000 635  707     0  85  0  7884 1288 pause Ss   pts/0 0:00.01 -ksh 
$ renice -10 41
renice: 41: setpriority: Permission denied
$ su root -c "renice -10 43"
43: old priority 0, new priority -10
$ ps -l
 UID PID PPID   CPU PRI  NI   VSZ  RSS WCHAN STAT TTY   TIME    COMMAND
1000  41  635  1056  33  10  7656  876 -     RN   pts/0 0:44.29 cat /dev/zero 
1000  43  635 35092  36 -10 10244  876 -     R<   pts/0 2:03.00 cat /dev/zero 
1000 635  707     0  85   0  7884 1288 pause Ss   pts/0 0:00.01 -ksh 
1000 709  635     0  43   0 12000 1040 -     O+   pts/0 0:00.00 ps -l 
$ 

Here's another quick example of using the relevant system calls used to implement the nice(1) and renice(1) commands, i.e., getpriority(2) and setpriority(2): priority.c.

$ cc -Wall -Wextra -Werror priority.c 
$ ./a.out 
My current priority is: 0
My new priority is: 20
Unable to setpriority(): Permission denied
My priority still is: 20
$ nice -n 19 ./a.out 
My current priority is: 19
My new priority is: 20
Unable to setpriority(): Permission denied
My priority still is: 20
$ nice -n 20 ./a.out
My current priority is: 20
My new priority is: 20
My priority still is: 20
$ su root -c "nice -n 5  ./a.out"
My current priority is: 5
My new priority is: 20
My priority still is: 5
$ 

Finally, the system administrator may choose to set the default priority for a given user via e.g., a login-time setting or enforcement. The methods here differ across the unix versions; on NetBSD, login.conf(5) allows you to set this value for a login class, usually defined in the master.passwd file (see passwd(5)). On some Linux versions, this can be set via limits.conf(5); on FreeBSD, more fine-grained control of resource limits is also possible via rctl(1) / rctl(4).

For more information on process priorities, please see:


CPU Affinity / cpusets

Futurama's Hermes sorting tubes.

Another way of controlling the distribution of processes across the available CPUs is by way of pinning a process or process group to a subset of the available CPUs. This is done by way of so-called "CPU sets" or by creating and assigning "CPU affinity". In addition, cpusets may also restrict both kernel and user memory.

Pinning a process or process group to a CPU can improve performance (since keeping a process on the same CPU allows it to take advantage of some cached properties), or to ensure that e.g., the system processes are not starved by setting aside a dedicated CPU for all system processes and placing all user processes on the remaining CPUs. My oldest memory here is having IRIX systems configured in this fashion to prevent CS student runaway homework assignment from interfering with the regular system operations and core services provided by a central shared server.

Unfortunately, the semantics of CPU sets are not standardized, and different Unix versions have implemented this concept in different ways, utilizing different (incompatible) APIs and command-line tools. On Linux, for example, much of the functionality is exposed to the users via a pseudo-filesystem, /dev/cpuset.

For example, the following sequence of commands (taken from that manual page) will set up a cpuset named "Charlie", containing just CPUs 2 and 3, and memory node 1, and then attach the current shell to that cpuset.

$ mkdir /dev/cpuset
$ mount -t cpuset cpuset /dev/cpuset
$ cd /dev/cpuset
$ mkdir Charlie
$ cd Charlie
$ /bin/echo 2-3 > cpuset.cpus
$ /bin/echo 1 > cpuset.mems
$ /bin/echo $$ > tasks
# The current shell is now running in
# cpuset Charlie
# The next line should display '/Charlie'
$ cat /proc/self/cpuset 

On FreeBSD, the cpuset(1) utility can be used to create and manage CPU sets:

     Create a new cpuset that is restricted to CPUs 0
     and 2 and move pid into the new set:
	cpuset -C -c -l 0,2 -p <pid> 

On NetBSD, the psrset(1) utility is used to control "processor sets", utilizing the pset(3) API and intended to be compatible with Solaris and HP-UX variants (at that time). (The cpuset(3) and affinity(3) library functions are thread-specific.)

For more information on cpusets, please see:


Control Groups, Namespaces, Capabilities, and, finally, Containers

A control group cartoon. With all of the above, you can probably predict where we're headed now: in addition to restricting CPU usage as well as filesystem views, memory and process table access, we'll also want to restrict other capabilities such that we can better contain and control process groups. There are a large number of approaches to this, and many of them are implementations of the above or combinations of separate approaches.

One approach to define the more generic requirements here are "POSIX Capabilities". In this model, rather than trying to specifically solve a given problem (as in the case of a restricted shell or a chroot), we identify the generic "capability" that a process needs and grant fine-grained access controls over these. For example, the following capabilities may be defined, for example:

  • CAP_CHOWN - the ability to chown files
  • CAP_SETUID - allow setuid
  • CAP_LINUX_IMMUTABLE - allow append-only or immutable flags
  • CAP_NET_BIND_SERVICE- allow network sockets <1024
  • CAP_NET_BROADCAST - allow broadcast traffic
  • CAP_NET_ADMIN- allow interface configuration, routing table manipulation, ...
  • CAP_NET_RAW - raw packets
  • CAP_SYS_PTRACE - allow tracing of processes
  • CAP_SYS_ADMIN- broad sysadmin privs (mounting file systems, setting hostname, handling swap, ...)
  • CAP_SYS_TIME - allow manipulation of time
  • ...

As so often, the standard is interpreted and implemented by different operating systems in different ways. For example, on FreeBSD, capsicum(4) implements a capability and sandbox framework; for a list of capabilities supported there, see rights(4).

Another solution to control process groups and their resource utilization Linux control groups ('cgroups') and Linux namespaces is one such solution. Per the Wikipedia page, cgroups include:

Resource limiting
    groups can be set to not exceed a configured
    memory limit, which also includes the file system
    cache
Prioritization
    some groups may get a larger share of CPU
    utilization or disk I/O throughput
Accounting
    measures a group's resource usage, which may be
    used, for example, for billing purposes
Control
    freezing groups of processes, their checkpointing
    and restarting

cgroups are implemented as a virtual file system, often using the /sys/fs/cgroup mountpoint, allowing for enabling of different controllers via mount options. cgroups support the following controllers:

  • blkio - block device I/O
  • cpu - ability to schedule tasks
  • cpuacct - CPU usage accounting
  • cpuset - CPUs and memory nodes
  • devices - ability of tasks can create or use device nodes
  • freezer - activity of a control groups. Tasks in frozen groups would not be scheduled
  • hugetlb - large Page support (HugeTLB) usage
  • memory - memory, kernel memory, swap memory
  • net_cls - ability to tag packets based on control group. These tags can be used by a traffic controller to assign priorities
  • net_prio - ability to set network traffic priority
  • perf_event - ability to monitor threads

Interactions with the groups are then similar to what we've seen above when using cpusets:

# create a new memory cgroup:
mkdir /sys/fs/cgroup/memory/group0
# move the current shell into the memory controller group:
echo 0 > /sys/fs/cgroup/memory/group0/tasks
# limit the shell's memory usage:
echo 40M > /sys/fs/cgroup/memory/group0/memory.limit_in_bytes
# 

In a similar manner, 'namespaces' offer a form of lightweight process virtualization and allow you to restrict what process groups can see with respect to the resources of the system, such as filesystem mounts, process IDs, network interfaces, System V IPC, user IDs, ...

cgroups and namespaces are frequently discussed together, as they complement each other well. In fact, the combination of cgroups and namespaces forms the basis for many operating-system-level virtualization and container technologies, such as CoreOS, LXC, or Docker.

Containers finally combine all of the above features:

  • null and union mounts to provide the right environment
  • restricting processes in their utilization
  • restricting filesystem views
  • restricting processes from what they can see
  • restricting processes from what they can do

Notably, however, containers are still processes or process groups. That is, they are still running on the same kernel as the "host" or "parent"; this is both an advantage (instantiation of the virtual environment is much faster than e.g., booting a virtual machine), but also has limitations (you can only run a container of the given OS, not another OS).

This approach of confining and controlling process groups can also be used for other things: systemd, for example, uses cgroups to control daemons using its own systemd-nspawn call and command to create lightweight namespace containers.

For more information on cgroups, namespaces, capabilities, as well as for some other related technologies and approaches, please see:


Summary

And that sums up our whirlwind tour of all the various ways of restricting processes, the techniques and technologies that lead up to the ever popular containers. There are many other related approaches, and we only just scratched the surface, but I hope that you've at least seen that there is no magic: everything we've covered in this semester so far should enable you to better understand e.g., Docker and friends.

Perhaps the most important lessons to draw here are that most process restrictions can be circumvented in some way, and that the goal is to voluntarily restrict yourself such that a compromise cannot gain you elevated privileges that you may have held previously; understanding Unix processes and base semantics are critical in setting up and configuring such restricted environments.

November 28th, 2018


[Jan's Twitter Animal Threads] [Index]