Today I'll be talking about a security facility called Seccomp.
As we may know there are about 400 system-calls,
you can take a glimpse in the linux tree, here: syscall table. according to the path arch/x86/entry/syscalls/syscall_64.tbl, you can easily understand the table's content varies among other architectures.
The seccomp facility is used to restrict specific system-calls which are invoked by a process, so in other words it's security mechanism similar to a sandbox, which is embedded into the kernel.
We can see below a linked list of filters:
struct seccomp_filter { refcount_t usage; bool log; struct seccomp_filter *prev; struct bpf_prog *prog; };
It resides in the task_struct (sched.h), and holds the actual filters:
struct seccomp { int mode; struct seccomp_filter *filter; };
if the program executes an unexpected system-calls then the kernel will terminate the process (kill signal will be sent) since it might be a malicious code.
Probably you are saying to yourself that's COOL,
so how do I set the seccomp filters?
compile the kernel with CONFIG_SECCOMP_FILTER flag set.
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER,..)
Apply filters on system-calls. the filters make use of the well known Berkeley Packet filter, which was implemented years ago for packet filtering such as tcpdump.
So who is using secomp?
1) Chrome browser (in chrome's address bar enter: chrome://sandbox/)
2) OpenSSH
3) systemd
4) Firfox OS
5) Docker (Seccomp security profiles for Docker)
6) LXC
if you would like to know if the software is in seccomp mode,
you can easily read /proc/<pid>/status.
there would be a field called seccomp:
0 means SECCOMP_MODE_DISABLED;
1 means SEC‐COMP_MODE_STRICT
2 means SECCOMP_MODE_FILTER.
If the value is 2, once we created the filter and installed it into the kernel now every system call we make will be tested through the list of filters.
So on each system-call the kernel would return one of the 5 return values:
#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */ #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */ #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */ #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow in case you are using a debugger*/ #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
Taken from: http://elixir.free-electrons.com/linux/latest/source/include/uapi/linux/seccomp.h
So let's assume you have read a great article about a new functionality,
to test this functionality you are given access for downloading a shared object.
On the other hand this website might infect you with a malware, so perhaps you should use the seccomp mechanism, since the shared object might execute a malicious code.
so a solution suggestion would be using seccomp, which would filter the unwanted system from being executed.
I have illustrated a flowchart:
So the code I wrote, looks like this:
#include <stdlib.h> #include <stdio.h> #include <stddef.h> #include <string.h> #include <unistd.h> #include <errno.h> #include <sys/types.h> #include <sys/prctl.h> #include <sys/syscall.h> #include <sys/socket.h> #include <linux/filter.h> #include <linux/seccomp.h> #include <linux/audit.h> #define ArchField offsetof(struct seccomp_data, arch) #define Allow(syscall) \ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, SYS_##syscall, 0, 1), \ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW) void complex_computation(int *); struct sock_filter filter[] = { /* validate arch */ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, ArchField), BPF_JUMP( BPF_JMP+BPF_JEQ+BPF_K, AUDIT_ARCH_X86_64, 1, 0), BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL), /* load syscall */ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)), /* list of allowed syscalls */ Allow(exit_group), /* exits a processs */ Allow(brk), /* for malloc(), inside libc */ Allow(mmap), /* also for malloc() */ Allow(munmap), /* for free(), inside libc */ Allow(write), /* called by printf */ Allow(fstat), /* called by printf */ /* and if we don't match above, die */ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL), }; struct sock_fprog filterprog = { .len = sizeof(filter)/sizeof(filter[0]), .filter = filter }; int main(int argc, char **argv) { char buf[1024]; /* set up the restricted environment */ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) { perror("Could not start seccomp:"); exit(1); } if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &filterprog) == -1) { perror("Could not start seccomp:"); exit(1); } complex_computation(buf); /* fuctionality taken form the shared object*/ printf("Task was completed (no malware was reported)!\n"); }
for example here below we see the actual strace's dump of the bin file,
which have blocked the actual unlink system-call since the malicious code intention was to damage my file-system:
unlink("/home/gil/my_important_file.txt" <unfinished ...>
+++ killed by SIGSYS +++
Bad system call (core dumped)