Android/Linux Root 的那些事儿

evilpan 收录于类别 Android

2020-12-06 2020-12-06 约 7174 字预计阅读 32 分钟次阅读

玩过安卓的朋友应该都对 root 这个名词不陌生，曾几何时，一台 root 过的手机是发烧友标配；对于开发者来说，root 后的手机是黑灰产外挂的温床，是想要极力避免和打击的目标；而对于安全研究人员来说，root 则意味着更多 —— Towelroot、PingPongRoot、DirtyC0w、ReVent，那些有趣的漏洞和精妙的利用，承载了不少的汗水和回忆。

本文不会深入 Android 的 Root 漏洞利用细节，而是更多从生态出发，来聊聊 root 相关的访问控制原理和机制。

Root 的本质

在 Android 应用安全中，经常能看到所谓的 root 检测 方案，主要用于评估客户端的安全性，以及检测用户潜在的可疑危险行为，比如 Hook、调试、群控等。这些检测，更多是基于特征，比如是否存在su文件，某些属性是否存在，……这实际上是一种偷懒取巧的做法，所谓的 root，实际上就是高权限的用户，可以做到普通权限所做不到的事。因此 root 检测本质上是个伪命题 —— 低权限无法违抗高权限，这是由系统设计决定的。

这里的 root，一般是指 Linux 中的的超级用户，拥有系统最高的执行权限，相当于 Windows 中的 Administrator。不过在现代操作系统的权限划分中，早已不再是这种基于用户的粗犷式管理，我们也只是沿用这个代称。有时在系统中明明已经是 root 了，可还是 Permission Denied，如果不清楚这背后的机理，就很容易抓破脑袋。

用一句话来说，所谓 root 的本质，就是 当前任务访问系统资源的能力。

为什么这么说呢？考虑我们在系统中执行下面的这条命令:

cat /etc/passwd

直观来看，是输出某个文件的内容；准确一点说，是当前用户运行了可执行文件/bin/cat，并且新生成一个子进程，在该进程中读取了/etc/passwd文件。在 Linux 操作系统中，进程是为了地址空间隔离而设计的，不同进程之间的虚拟空间互相隔离，所以 A 进程访问 0x08000000 地址不会影响 B 进程中的相同地址，因为它们实际映射到了不同的物理空间。线程则是计算机 CPU 对程序进行调度的最小单位，在一般的实时操作系统中，这个单位也称为任务(task)。

对于 CPU 而言，它的工作就是取址、译码、访存、执行、写回，一直循环直到出错或停止，其本身只有无状态的寄存器，而操作系统想要并行实时地执行多个任务，并在这些程序间来回切换、交替、穿梭，就需要提供任务调度，即保存和恢复执行上下文的能力。可以简单地将线程和任务认为是同一个东西，即操作系统进行 CPU 调度的基本单位。

所以说，要看所执行的 cat /etc/passwd 是否能够成功，其实是看在内核中下面这些条件能否满足:

当前的线程是否可访问、执行cat可执行文件，是否可以创建新进程；
新进程如何继承当前进程的属性，这决定了新线程是否可以访问 /etc/passwd；

既然内核需要决定某个线程是否能够访问资源，那么在内核中就应该有访问控制相关的结构与当前任务所对应，实际上也确实有。在 Linux 内核中，描述一个任务的结构体定义在 include/linux/sched.h 中:

struct task_struct {
	#ifdef CONFIG_THREAD_INFO_IN_TASK
	/*
	 * For reasons of header soup (see current_thread_info()), this
	 * must be the first element of task_struct.
	 */
	struct thread_info		thread_info;
#endif
	/* -1 unrunnable, 0 runnable, >0 stopped: */
	volatile long			state;

	/*
	 * This begins the randomizable portion of task_struct. Only
	 * scheduling-critical items should be added above here.
	 */
	randomized_struct_fields_start

	void				*stack;
	refcount_t			usage;
	/* Per task flags (PF_*), defined further below: */
	unsigned int			flags;
	unsigned int			ptrace;

  //...
  struct sched_info		sched_info;

	struct list_head		tasks;

	struct mm_struct		*mm;
	struct mm_struct		*active_mm;
  /* Real parent process: */
	struct task_struct __rcu	*real_parent;

	/* Recipient of SIGCHLD, wait4() reports: */
	struct task_struct __rcu	*parent;
	struct list_head		ptraced;
	struct list_head		ptrace_entry;

	/* PID/PID hash table linkage. */
	struct pid			*thread_pid;
	struct hlist_node		pid_links[PIDTYPE_MAX];
	struct list_head		thread_group;
	struct list_head		thread_node;
	/* Process credentials: */

	/* Tracer's credentials at attach: */
	const struct cred __rcu		*ptracer_cred;

	/* Objective and real subjective task credentials (COW): */
	const struct cred __rcu		*real_cred;

	/* Effective (overridable) subjective task credentials (COW): */
	const struct cred __rcu		*cred;
  char				comm[TASK_COMM_LEN];
  struct seccomp			seccomp;
  // ...
  	/*
	 * New fields for task_struct should be added above here, so that
	 * they are included in the randomized portion of task_struct.
	 */
	randomized_struct_fields_end

	/* CPU-specific state of this task: */
	struct thread_struct		thread;

	/*
	 * WARNING: on x86, 'thread_struct' contains a variable-sized
	 * structure.  It *MUST* be at the end of 'task_struct'.
	 *
	 * Do not put anything below here!
	 */
}

这里只是截取了其中一部分，该结构体的内容非常多，主要包括运行时所需要的上下文、权限、链表和统计信息等，下面在分析到对应部分时候会再次介绍，DON’T PANIC！

总而言之，只需要记住，所谓 root 的本质，就是 当前任务访问系统资源的能力。

Access Control

让我们先跳出内核，回到系统管理员的视角。在计算机安全领域，访问控制表示操作系统对某个主体(subject)访问或者执行某种操作的约束，主体可以是线程或者进程，操作可以是访问文件、目录、TCP/UDP 端口、共享内存段、IO 设备等对象。这类约束可以抽象成两大类，一类可以由对象的属主对自己的访问者进行管理，称为自主访问控制(DAC)；另外一类由操作系统统一管理，称为强制访问控制(MAC)。

DAC

DAC 即 Discretionary Access Control，因为权限的控制是自主的，因此称为自主访问控制。在 Linux 中这是最为常见的一种访问控制方案，即用户可以自主选择控制哪些用户可以共享他的文件，有两种自主访问控制策略，分别是文件权限码和访问控制列表 ACL (Access Control List)。

文件权限码就是我们常说的9位权限码:

$ ls -l test
-rw-r--r-- 1 pan staff 0 Nov 21 14:35 test

分别表示当前用户(user/owner)、用户组(group) 和其他用户(other) 对应的读、写、执行 (rwx) 访问权限，可以参考 chmod(1)。实际上在 Linux 操作系统中在前面还增加了三位，分别是:

S_ISUID (04000): SETUID 位，用于在 exeve 系统调用时设置进程的有效用户ID(effective user ID)；
S_ISGID (02000): SETGID 位，和 SETUID 类似，从父目录中继承；
S_ISVTX (01000): sticky bit，即防删除位，防止其他用户删除公共文件，通常用于/tmp目录下；

通过文件权限码可以实现一定程度上的自主访问控制，但是对于多用户系统而言只能通过用户组去管理，无法控制某个文件可以让用户A访问而不让用户B访问。ACL 就是为了实现这个目标而出现的。例如，需要单独给某个用户添加文件的读权限如下:

$ setfacl -m u:evilpan:r /etc/passwd

具体命令可以参考 setfacl(1)，值得一提的是，ACL需要内核和文件系统的支持。

MAC

MAC 即 Mandatory Access Control，用于将系统中的信息分密级和类进行管理，以保证每个用户只能访问到那些被标明可以由他访问的信息的一种访问约束机制。通俗的来说，在强制访问控制下，用户(或其他主体)与文件((其他客体)都被标记了固定的安全属性(如安全级、访问权限等)，在每次访问发生时，系统检测安全属性以便确定一个用户是否有权访问该文件。其中 SELinux 和 AppArmor 就是 Linux 中典型的强制访问控制实现，在后文会详细介绍。

UID

在上面介绍 DAC 的时候说到，其中的访问控制策略是根据用户和组去进行管理的。对于操作系统而言，为了方便管理，用户和组都分别对应数字 ID，即 UID 和 GID。传统上获取 root 权限就是执行下su程序，就神奇地获得了一个为所欲为的 root shell。一般情况下 su 是一个设置了 SETUID 位的程序，并且 owner 是 root 用户。普通用户执行该程序只是上是对该文件执行了execve系统调用，也就是说，内核会根据 SETUID 位来调整当前进程的权限，这主要是通过有效用户ID去实现的。

Linux中的用户ID分为 real user id 和 effective user id，这样区分的原因是进程在执行过程中需要动态切换到其他用户，如果只有一个用户ID，那么切换之后就无法换回原来的用户了。因此前者用来表示进程的真实用户，后者用来表示当前所表示的有效用户。

在内核上面介绍的 task_struct 中有一个 struct cred 字段，该字段对应的结构就包含了当前任务的安全相关上下文信息，其中就有 uid，如下所示:

struct cred {
	atomic_t	usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
	atomic_t	subscribers;	/* number of processes subscribed */
	void		*put_addr;
	unsigned	magic;
#define CRED_MAGIC	0x43736564
#define CRED_MAGIC_DEAD	0x44656144
#endif
	kuid_t		uid;		/* real UID of the task */
	kgid_t		gid;		/* real GID of the task */
	kuid_t		suid;		/* saved UID of the task */
	kgid_t		sgid;		/* saved GID of the task */
	kuid_t		euid;		/* effective UID of the task */
	kgid_t		egid;		/* effective GID of the task */
	kuid_t		fsuid;		/* UID for VFS ops */
	kgid_t		fsgid;		/* GID for VFS ops */
	// ...
} __randomize_layout;

在 execve(2) 的文档中有说到:

If the set-user-ID bit is set on the program file pointed to by filename, and the underlying file system is not mounted nosuid (the MS_NOSUID flag for mount(2)), and the calling process is not being ptraced, then the effective user ID of the calling process is changed to that of the owner of the program file.

对于喜欢寻根问底的同学，也可以从内核中的 execve 实现中找到对应的逻辑，以 linux-v5.10-rc4 为例，内核里主要的执行逻辑如下:

load_elf_binary (fs/binfmt_elf.c)
begin_new_exec (fs/exec.c)
bprm_creds_from_file
bprm_fill_uid

static void bprm_fill_uid(struct linux_binprm *bprm, struct file *file)
{
	/* Handle suid and sgid on files */
	struct inode *inode;
	unsigned int mode;
	kuid_t uid;
	kgid_t gid;

	if (!mnt_may_suid(file->f_path.mnt))
		return;

	if (task_no_new_privs(current))
		return;

	inode = file->f_path.dentry->d_inode;
	mode = READ_ONCE(inode->i_mode);
	if (!(mode & (S_ISUID|S_ISGID)))
		return;

	/* Be careful if suid/sgid is set */
	inode_lock(inode);

	/* reload atomically mode/uid/gid now that lock held */
	mode = inode->i_mode;
	uid = inode->i_uid;
	gid = inode->i_gid;
	inode_unlock(inode);

	/* We ignore suid/sgid if there are no mappings for them in the ns */
	if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
		 !kgid_has_mapping(bprm->cred->user_ns, gid))
		return;

	if (mode & S_ISUID) {
		bprm->per_clear |= PER_CLEAR_ON_SETID;
		bprm->cred->euid = uid;
	}

	if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
		bprm->per_clear |= PER_CLEAR_ON_SETID;
		bprm->cred->egid = gid;
	}
}

这就是 SETUID 程序可以用来提权原因。

Capabilities

传统 Linux 执行权限检测主要是基于 UID，而且只有两个分类，即 (effective) UID 为 0 的超级用户和其他普通用户。这样一来就会面临权限划分粒度太粗的问题，比如只想让普通用户可以访问 ping 程序，就需要给 ping 文件加上 SETUID 位，如果该可执行文件的实现存在漏洞，就可能被利用造成权限提升。

因此，从 Linux 2.2 开始，就引入了 capabilities，将超级用户的权限进行切分，并且按需要给普通用户进行分配，解决了传统 UID-0 的局限性。

capabilities 以任务(线程)为单位，还是在上面内核的 struct cred 结构体中，其相关的字段为:

struct cred {
	// ...
	kernel_cap_t	cap_inheritable; /* caps our children can inherit */
	kernel_cap_t	cap_permitted;	/* caps we're permitted */
	kernel_cap_t	cap_effective;	/* caps we can actually use */
	kernel_cap_t	cap_bset;	/* capability bounding set */
	kernel_cap_t	cap_ambient;	/* Ambient capability set */
  // ...
}

从用户空间看，获取、设置线程的系统调用为 capget、capset，如下所示:

#include <sys/capability.h>

int capget(cap_user_header_t hdrp, cap_user_data_t datap);
int capset(cap_user_header_t hdrp, const cap_user_data_t datap);

参数的结构体定义如下:

typedef struct __user_cap_header_struct {
  __u32 version;
  int pid;
} *cap_user_header_t;

typedef struct __user_cap_data_struct {
  __u32 effective;
  __u32 permitted;
  __u32 inheritable;
} *cap_user_data_t;

从定义上看，一共有三类 capability，分别是 effective、permitted 和 inheritable，这和 UID 的设计初衷是类似的，因为进程可以被复制(fork)，因此增加了 inheritable 的控制。对于每一类 capabilities，由于其类型是__u32，每项 capability 通过位与方式进行组合，因此最多可以支持 32 种 capability，其中一些常见的包括:

CAP_NET_RAW: 创建和使用 RAW/PACKET socket 的权限以及绑定透明代理地址的权限；
CAP_NET_ADMIN: 各类网关相关的操作，比如网卡接口配置、路由表修改等；
CAP_SETUID: 设置和修改进程 UID 的权限；
CAP_SYS_PTRACE: 使用 ptrace 跟踪任意其他进程的能力；
….

完整的权限列表可以参考 capabilities(7)。

对于系统管理员而言，更多是使用 capsh、getcap、setcap 等命令行工具，不过本质上都是通过 libcap 对系统调用进行封装实现的。

SELinux

SELinux 即 Security Enhanced Linux，是 Linux 中强制访问控制的两大实现之一(另一个是 AppArmor)，作为 Linux 的拓展，最初由 NSA 开发，后集成到了开源内核主线中。

用户态

在 SELinux 中，访问控制通过 context 来描述访问权限，例如对于文件系统，可以使用 ls -Z 查看文件对应的标签:

generic:/ # ls -lZ /
total 2424
dr-xr-xr-x  3 root   root   u:object_r:cgroup:s0                 0 1970-01-01 08:00 acct
lrwxrwxrwx  1 root   root   u:object_r:rootfs:s0                50 1970-01-01 08:00 bugreports -> /data/user_de/0/com.android.shell/files/bugreports
drwxrwx---  6 system cache  u:object_r:cache_file:s0          4096 2019-12-23 16:52 cache
lrwxrwxrwx  1 root   root   u:object_r:rootfs:s0                13 1970-01-01 08:00 charger -> /sbin/healthd
dr-x------  2 root   root   u:object_r:rootfs:s0                40 1970-01-01 08:00 config
lrwxrwxrwx  1 root   root   u:object_r:rootfs:s0                17 1970-01-01 08:00 d -> /sys/kernel/debug
drwxrwx--x 38 system system u:object_r:system_data_file:s0    4096 1970-01-01 08:00 data

对于网络端口的标签，可以用 netstat -Z查看；对于进程标签，则可以通过ps -Z查看:

generic:/ # ps -Z
LABEL                          USER      PID   PPID  VSIZE  RSS   WCHAN            PC  NAME
u:r:init:s0                    root      1     0     7856   1556  SyS_epoll_ 0008e458 S /init
u:r:kernel:s0                  root      2     0     0      0       kthreadd 00000000 S kthreadd
u:r:platform_app:s0:c512,c768  u0_a28    1642  885   1669496 105556 binder_thr ab777494 D com.android.systemui
u:r:fingerprintd:s0            system    901   1     8260   3536  binder_thr b12da494 S /system/bin/fingerprintd
u:r:gatekeeperd:s0             system    902   1     7244   2888  binder_thr b21a1494 S /system/bin/gatekeeperd
u:r:perfprofd:s0               root      903   1     4104   1740  hrtimer_na a72f0378 S /system/xbin/perfprofd
u:r:logd:s0                    logd      906   1     4644   2184  __skb_recv ac3d3584 S /system/bin/logcat
u:r:shell:s0                   shell     909   884   3544   1896  sigsuspend af915698 S /system/bin/sh

context 可以分为几个部分，使用冒号:分隔，分别是:

user: 表示 SELinux 用户账号，与 Linux 用户账号不同，前者在 policy 中定义，包含多层级权限；
role: 定义了主体(subject)在特定域(domain)中可以对客体(object)进行的操作；
type: 定义了文件的类型；
sensitivity: 即最后一个字段，表示涉密等级，范围可以从c0到c1023，c3表示Top Secret。该字段仅在 MLS 模式中使用，用于高敏感度的国防军事机构，对于客户端或者一般数据服务器而言只需保留默认值。

对一系列系统资源增加标签后，系统就可以根据标签来判断访问是否应该允许，一个示例的访问拒绝日志如下:

type=1400 audit(18.250:15): avc: denied { getattr } for pid=939 comm="ls" path="/ueventd.rc" dev="rootfs" ino=2842 scontext=u:r:shell:s0 tcontext=u:object_r:rootfs:s0 tclass=file permissive=0

访问权限的判断是在内核中实现的，但是访问规则可以动态生成和更新，内核中只预置了一系列触发点。SELinux 规则(policy)通常使用自定义的高级语言去描述，目前正在开发的是 CIL(Common Intermediate Language)，但使用更多的是传统的 MLS Statements，比如访问规则的定义如下:

rule_name source_type target_type:class perm_set;

一个具体的例子:

allow initrc_t acct_exec_t:file { getattr read execute };

表示允许拥有initrc_t标签类型的主体访问带有acct_exec_t标签的目标文件，访问权限为 getattr、read和write。其中类型是使用type关键字定义的，一般使用单独的file_contexts文件记录。MLS 的完整语法见 Kernel Policy Language Definition Links。

对于系统管理员而言，常用的相关命令有:

chcon: 修改目标文件的 SELinux 标签；
resotrecon: 重新加载(恢复)系统文件的 SELinux 标签；
semanage: 实时修改当前系统的 SELinux 规则；
…

使用 MLS 提供的 SELinux Policy 语法，我们可以定义非常细粒度的访问控制，比如根据应用属性甚至签名来控制IPC访问。但是与此同时，规则调试也经常困扰 ROM 开发者，有一些脚本比如audit2allow、audit2why等可以辅助定位和添加规则，不过还是要注意避免添加过度宽泛的权限导致攻击面扩大。

内核态

前面说 SELinux 是在内核中进行检查的，那么就以打开文件的操作为例来简单分析下 SELinux 的校验过程。打开文件使用的系统调用是openat，该系统调用在内核中的大致调用路径如下:

sys_openat
do_sys_open
do_filp_open
path_openat
do_last
may_open
inode_permission
- do_inode_permission -> generic_permission
- devcgroup_inode_permission
- security_inode_permission

inode_permission 是在文件打开之前检查文件系统 inode 权限的操作，其中包含常规的 DAC 检查、cgroup 权限检查以及我们所关心的 SELinux 检查:

#define call_int_hook(FUNC, IRC, ...) ({			\
	int RC = IRC;						\
	do {							\
		struct security_hook_list *P;			\
								\
		list_for_each_entry(P, &security_hook_heads.FUNC, list) { \
			RC = P->hook.FUNC(__VA_ARGS__);		\
			if (RC != 0)				\
				break;				\
		}						\
	} while (0);						\
	RC;							\
})

int security_inode_permission(struct inode *inode, int mask)
{
	if (unlikely(IS_PRIVATE(inode)))
		return 0;
	return call_int_hook(inode_permission, 0, inode, mask);
}

struct security_hook_heads 是一个结构体，其中包含一系列链表，每个链表都对应一类 SELinux hook:

struct security_hook_heads {
	struct list_head binder_set_context_mgr;
	struct list_head binder_transaction;
	struct list_head binder_transfer_binder;
	struct list_head binder_transfer_file;
	struct list_head ptrace_access_check;
	struct list_head ptrace_traceme;
	struct list_head capget;
	struct list_head capset;
  //...
  struct list_head inode_permission;
  // ...
}

每个链表都是在内核启动时进行初始化的，inode_permission也不例外。在security/linux/hooks.c中定义了静态数组selinux_hooks:

static struct security_hook_list selinux_hooks[] = {
	LSM_HOOK_INIT(binder_set_context_mgr, selinux_binder_set_context_mgr),
	LSM_HOOK_INIT(binder_transaction, selinux_binder_transaction),
	LSM_HOOK_INIT(binder_transfer_binder, selinux_binder_transfer_binder),
	LSM_HOOK_INIT(binder_transfer_file, selinux_binder_transfer_file),

	LSM_HOOK_INIT(ptrace_access_check, selinux_ptrace_access_check),
	LSM_HOOK_INIT(ptrace_traceme, selinux_ptrace_traceme),
	LSM_HOOK_INIT(capget, selinux_capget),
	LSM_HOOK_INIT(capset, selinux_capset),
	// ...
	LSM_HOOK_INIT(inode_permission, selinux_inode_permission),
	// ...
}

因此，selinux_inode_permission 就是实际进行 SELinux 检查的函数:

static int selinux_inode_permission(struct inode *inode, int mask)
{
	const struct cred *cred = current_cred();
	u32 perms;
	bool from_access;
	unsigned flags = mask & MAY_NOT_BLOCK;
	struct inode_security_struct *isec;
	u32 sid;
	struct av_decision avd;
	int rc, rc2;
	u32 audited, denied;

	from_access = mask & MAY_ACCESS;
	mask &= (MAY_READ|MAY_WRITE|MAY_EXEC|MAY_APPEND);

	/* No permission to check.  Existence test. */
	if (!mask)
		return 0;

	validate_creds(cred);

	if (unlikely(IS_PRIVATE(inode)))
		return 0;

	perms = file_mask_to_av(inode->i_mode, mask);

	sid = cred_sid(cred);
	isec = inode->i_security;

	rc = avc_has_perm_noaudit(sid, isec->sid, isec->sclass, perms, 0, &avd);
	audited = avc_audit_required(perms, &avd, rc,
				     from_access ? FILE__AUDIT_ACCESS : 0,
				     &denied);
	if (likely(!audited))
		return rc;

	rc2 = audit_inode_permission(inode, perms, audited, denied, rc, flags);
	if (rc2)
		return rc2;
	return rc;
}

这里有几个值得注意的地方，一个是 selinux_hooks 中注册了很多回调列表，这些模块就是内核中预置的检查点；另外，在 selinux_inode_permission 函数中，使用 file_mask_to_av 来将打开文件的 flag 转换成 SELinux 对应的访问动作(Access Vector):

/* Convert a Linux mode and permission mask to an access vector. */
static inline u32 file_mask_to_av(int mode, int mask)
{
	u32 av = 0;

	if (!S_ISDIR(mode)) {
		if (mask & MAY_EXEC)
			av |= FILE__EXECUTE;
		if (mask & MAY_READ)
			av |= FILE__READ;

		if (mask & MAY_APPEND)
			av |= FILE__APPEND;
		else if (mask & MAY_WRITE)
			av |= FILE__WRITE;

	} else {
		if (mask & MAY_EXEC)
			av |= DIR__SEARCH;
		if (mask & MAY_WRITE)
			av |= DIR__WRITE;
		if (mask & MAY_READ)
			av |= DIR__READ;
	}

	return av;
}

这些宏定义在 <build>/security/selinux/av_permissions.h 中，是编译内核时自动生成的。在确认该次访问需要审计后，就接着调用 audit_inode_permission -> slow_avc_audit 进行实际的判断了。因为这类访问控制判断需要频繁调用，出于性能考虑判断过程所使用的访问规则预先编译好并已经加载到内核缓存中，称为 avc (Access Vector Cache)，这也是前面日志中 avc 的来源。

Seccomp

seccomp 是 secure computing 的缩写，准确来说不算是访问控制，但也能对进程起到限制作用。其设计的初衷是为了减少内核的攻击面，限制目标进程对系统调用的访问范围。第一个版本的 seccomp 于 2005 年在 Linux 2.6.12 中引入，在 /proc/PID/seccomp 写入 1 后，对应进程就只能执行四个系统调用: read、write、exit 和 sigreturn。

随后在 2007 年，引入了一个 prctl 系统调动的新操作 PR_SET_SECCOMP，使用 SECCOMP_MODE_STRICT 参数，并且去除了 /proc/PID/seccomp 的接口。后来，内核进行了一系列重构，添加了新的 seccomp 系统调用并去除了 prctl 的对应接口，从这个时期开始，seccomp 就使用 BPF 程序来控制系统调用以及调用参数的限制。

通过 /proc/PID/status 中的 Seccomp 字段可以查看当前的 seccomp 状态

说到 BPF，全称为 Berkeley packet filter，其历史比 Linux 内核本身还要悠久。从名字也能看出，BPF 最初的功能是用来进行数据包过滤，使用了一种基于寄存器的自定义指令在内核中动态更新规则。

以 tcpdump 为例，可以通过 -d 选项查看编译后的 BPF 指令:

$ tcpdump -d -i lo0 tcp
(000) ld       [0]
(001) jeq      #0x1e000000      jt 2	jf 7
(002) ldb      [10]
(003) jeq      #0x6             jt 10	jf 4
(004) jeq      #0x2c            jt 5	jf 11
(005) ldb      [44]
(006) jeq      #0x6             jt 10	jf 11
(007) jeq      #0x2000000       jt 8	jf 11
(008) ldb      [13]
(009) jeq      #0x6             jt 10	jf 11
(010) ret      #262144
(011) ret      #0

在新版本的 Linux 内核中，引入了 eBPF (extended BPF)，在原始指令集的基础上进行了更加通用的更新。BPF Program 本身不能直接运行，而是注册到内核预置的位置，去响应特定事件，包括但不限于:

network
syscall (seccomp)
tracepoints
kprobes
uprobes
perf_events
…

在本文中主要关心 seccomp 的实现。从系统调用层面，主要通过 seccomp(2) 来操作进程的 seccomp 状态:

int seccomp(unsigned int operation, unsigned int flags, void *args);

其中 operation 主要包括:

SECCOMP_SET_MODE_STRICT: 严格限制模式，限制目标进程只能执行 4 个系统调用；
SECCOMP_SET_MODE_FILTER: BPF模式，通过用户指定的 BPF 程序去控制目标进程的系统调用过滤规则；
SECCOMP_GET_ACTION_AVAIL: 测试内核是否支持指定的 action；

这些 operation 里面，灵活性最大的就是 SECCOMP_SET_MODE_FILTER，使用 BPF 程序指定过滤规则，不过相应的使用方式也比较复杂 —— 即便头文件中提供了一些辅助宏来方便编写 filter。在 man-page 给了一个示例程序，如下所示:

#include <errno.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/audit.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <sys/prctl.h>

#define X32_SYSCALL_BIT 0x40000000
#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))

static int
install_filter(int syscall_nr, int t_arch, int f_errno)
{
   unsigned int upper_nr_limit = 0xffffffff;

   /* Assume that AUDIT_ARCH_X86_64 means the normal x86-64 ABI
      (in the x32 ABI, all system calls have bit 30 set in the
      'nr' field, meaning the numbers are >= X32_SYSCALL_BIT) */
   if (t_arch == AUDIT_ARCH_X86_64)
       upper_nr_limit = X32_SYSCALL_BIT - 1;

   struct sock_filter filter[] = {
       /* [0] Load architecture from 'seccomp_data' buffer into
              accumulator */
       BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                (offsetof(struct seccomp_data, arch))),

       /* [1] Jump forward 5 instructions if architecture does not
              match 't_arch' */
       BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5),

       /* [2] Load system call number from 'seccomp_data' buffer into
              accumulator */
       BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                (offsetof(struct seccomp_data, nr))),

       /* [3] Check ABI - only needed for x86-64 in deny-list use
              cases.  Use BPF_JGT instead of checking against the bit
              mask to avoid having to reload the syscall number. */
       BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0),

       /* [4] Jump forward 1 instruction if system call number
              does not match 'syscall_nr' */
       BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),

       /* [5] Matching architecture and system call: don't execute
          the system call, and return 'f_errno' in 'errno' */
       BPF_STMT(BPF_RET | BPF_K,
                SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),

       /* [6] Destination of system call number mismatch: allow other
              system calls */
       BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

       /* [7] Destination of architecture mismatch: kill process */
       BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
   };

   struct sock_fprog prog = {
       .len = ARRAY_SIZE(filter),
       .filter = filter,
   };

   if (seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog)) {
       perror("seccomp");
       return 1;
   }

   return 0;
}

int
main(int argc, char **argv)
{
   if (argc < 5) {
       fprintf(stderr, "Usage: "
               "%s <syscall_nr> <arch> <errno> <prog> [<args>]\n"
               "Hint for <arch>: AUDIT_ARCH_I386: 0x%X\n"
               "                 AUDIT_ARCH_X86_64: 0x%X\n"
               "\n", argv[0], AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
       exit(EXIT_FAILURE);
   }

   if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
       perror("prctl");
       exit(EXIT_FAILURE);
   }

   if (install_filter(strtol(argv[1], NULL, 0),
                      strtol(argv[2], NULL, 0),
                      strtol(argv[3], NULL, 0)))
       exit(EXIT_FAILURE);

   execv(argv[4], &argv[4]);
   perror("execv");
   exit(EXIT_FAILURE);
}

实际上在使用中更多的是用封装好的上层 API，比如 libseccomp 就提供了 seccomp_init、seccomp_load 等接口方便系统管理员的策略开发。

总结

Linux 的权限控制策略随着时间发展一直在不断进步，其中作为应用大户的 Android 正激进地使用 Linux 中的安全特性来保障用户应用和系统安全。简单的 UID=0 root 已经不再能对系统为所欲为，随之而来的是更多细粒度的权限管控，了解这些安全策略不仅有利于开发者全面认识系统的防护手法，也能对应用的威胁模型有更多认识。

目录