Filesystems are complex and performance-sensitive beasts. They can also present security concerns. Microkernel-based systems have long pushed filesystems into separate processes in order to contain any vulnerabilities that may be found there. Linux can do the same with the Filesystem in Userspace (FUSE) subsystem, but using FUSE brings a significant performance penalty. Darrick Wong is working on ways to eliminate that penalty, and he has a massive patch set showing how ext4 filesystems can be safely implemented in user space by unprivileged processes with good performance. This work has the potential to radically change how filesystems are managed on Linux systems.
文件系统是一种复杂且对性能高度敏感的“猛兽”,同时也可能带来安全隐患。微内核系统长期以来都将文件系统放入独立进程中,以隔离潜在的漏洞。Linux 也可以借助 FUSE(Filesystem in Userspace)子系统实现类似的隔离,但使用 FUSE 会带来显著的性能损失。Darrick Wong 正在努力消除这一损失,他提交了一个庞大的补丁集,展示了如何让非特权进程在用户空间安全且高性能地实现 ext4 文件系统。这项工作可能从根本上改变 Linux 系统管理文件系统的方式。
One of the biggest challenges faced by a filesystem implementation is the need to parse and maintain a complex data structure that can be entirely under an attacker's control. It is possible to compromise almost any Linux filesystem implementation with a maliciously crafted filesystem image. While most filesystems are robust against corrupted images in general, malicious corruption is another story, and most filesystem developers only try so hard to protect against that case. For this reason, mounting a filesystem, which would otherwise be an inherently safe thing to do when certain protections (against overmounting system directories or adding setuid binaries, for example) are in place, remains reserved to privileged users.
文件系统实现面临的最大挑战之一,是需要解析并维护复杂的数据结构,而这些数据结构完全可能被攻击者控制。几乎任何 Linux 文件系统都可能被恶意构造的文件系统镜像攻破。虽然大多数文件系统对一般性的损坏具有较强的鲁棒性,但恶意破坏是另外一回事,大多数开发者也只能在一定程度上进行防护。因此,尽管在采取一定保护措施(如防止覆盖系统目录、禁止加入 setuid 程序等)的情况下挂载文件系统本应是安全操作,Linux 仍将挂载权限保留给特权用户。
If the management of filesystem metadata is moved to user space, though, the potential for mayhem from a malicious image is greatly reduced. FUSE allows exactly that, but the overhead of passing all filesystem I/O over the connection to the user-space FUSE server makes FUSE filesystems slow. Wong's attempt to address this problem is somewhat intimidating at first look; it is a collection of five independent patch topics, most of which have multiple sub-parts. It comprises 182 patches in total. There is a lot of complexity, but the core idea is relatively simple.
如果将文件系统元数据的管理移到用户空间,那么恶意镜像造成破坏的可能性会大大降低。FUSE 正是提供了这种能力,但其代价是所有文件系统的 I/O 都必须通过用户空间的 FUSE 服务器转发,从而导致性能低下。Wong 为解决此问题所做的工作看上去令人望而生畏:它由五个独立的补丁主题组成,大多数主题又包含多个子部分,总计 182 个补丁。尽管内容繁杂,但核心思想相对简单。
Iomap
Filesystems move a lot of data around. Much of the added cost of a FUSE filesystem comes from the need to pass data between the kernel and the FUSE server as filesystem operations are requested. If a process writes some data to a file on a FUSE filesystem, the kernel must pass that data to the user-space FUSE server to implement that write; the server will, almost certainly, then pass the data back to the kernel to actually land it in persistent storage. An obvious way to improve this situation would be to somehow keep the data movement within the kernel, and just have the FUSE server tell the kernel where blocks of data should be read from or written to. That would allow the FUSE server to handle the metadata management while removing the extra cost from the I/O path.
文件系统需要处理大量数据。FUSE 文件系统性能损失的主要原因是在处理文件系统操作时,需要频繁在内核和 FUSE 服务器之间传输数据。如果一个进程向 FUSE 文件系统的文件写入数据,内核必须将这些数据发送给用户空间的 FUSE 服务器执行写操作;然后服务器几乎肯定会将数据再传回内核,以最终写入持久存储。显而易见的优化方式是尽可能让数据移动保持在内核内部,只让 FUSE 服务器告诉内核数据块应该从哪里读取或写到哪里。这样可以让 FUSE 服务器负责管理元数据,同时把 I/O 路径上的额外成本去除。
In the end, much of a filesystem's job consists of maintaining mappings between logical offsets within files and physical locations on persistent storage. Once that is done, file I/O boils down to using those mappings to move blocks of data back and forth — a task that is independent of any given filesystem. (Of course, every filesystem developer reading this text is now seething at this extreme oversimplification; there is little to be done for that.) The kernel has offered various mechanisms for managing this mapping, including buffer heads, which were part of the first public release of the Linux kernel.
归根结底,文件系统的大部分工作在于维护文件逻辑偏移和持久存储物理位置之间的映射。一旦映射确定,文件 I/O 就基本上变成按照映射移动数据块——这是一个与具体文件系统无关的任务。(当然,看到这种极其简化描述的文件系统开发者现在可能已经气炸了,但也没办法。)内核提供过多种机制实现这种映射管理,包括 buffer head——它自 Linux 最初公开版本就存在。
In more recent times, though, this mapping task is supported in the kernel by the iomap layer. It was first introduced by Christoph Hellwig (based on older code from the XFS filesystem) for the 4.8 release in 2016, and other filesystems have been slowly making use of it since then. The iomap layer abstracts out a lot of details, simplifying matters on the filesystem side. At its core are two callbacks that filesystems must provide:
近年来,内核通过 iomap 层支持这一映射任务。iomap 由 Christoph Hellwig 基于 XFS 中的旧代码开发,并在 2016 年的 4.8 版本中首次引入。随后其他文件系统也逐步采用这一机制。iomap 抽象了许多细节,让文件系统端更为简化。其核心是文件系统必须提供的两个回调函数:
int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
unsigned flags, struct iomap *iomap,
struct iomap *srcmap);
int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap);
Without getting into the details, iomap_begin() requests the filesystem to specify the on-disk mapping for the given inode over the range of length bytes starting at pos. When the kernel is done with the mapping, it will inform the filesystem with a call to iomap_end(). In between, the kernel may well use that mapping to move data between memory and the filesystem's storage device.
不深入细节的话,iomap_begin() 让文件系统为某个 inode 指定从 pos 开始、长度为 length 的磁盘映射。当内核使用完这些映射后,会调用 iomap_end() 通知文件系统。在这期间,内核可能会使用该映射在内存和存储设备之间移动数据。
Sub-Part 4 of Wong's series adds two new operations, FUSE_IOMAP_BEGIN and FUSE_IOMAP_END to the FUSE API. These operations correspond to the two callbacks above, allowing a user-space filesystem to build an I/O mapping in the kernel, which can then use that mapping to perform many I/O operations directly, without having to involve the user-space server further. While the longer-term goal is to enable unprivileged filesystem mounts, the ability to use iomap in FUSE is restricted to processes that have the CAP_SYS_RAWIO capability.
Wong 的系列补丁中第 4 部分为 FUSE API 添加了两个新操作:FUSE_IOMAP_BEGIN 和 FUSE_IOMAP_END。这两个操作与上述两个回调对应,使用户空间文件系统能够在内核中构建 I/O 映射,随后内核可直接使用这些映射执行大量 I/O 操作,而无需再与用户空间服务器交互。虽然长期目标是让非特权挂载成为可能,但 FUSE 中使用 iomap 的能力目前仅限具有 CAP_SYS_RAWIO 能力的进程。
Providing basic iomap access can speed FUSE servers by avoiding the need to move file data between the kernel and user space, but there is more to be done to reach a high level of performance. One step is in this series: it allows the kernel to cache iomap mappings created by the FUSE server. That reduces the number of round trips to the server required, but it is also needed to correctly manage mappings in cases where I/O might cause them to change. Another performance improvement comes with this series, which moves much of the management of timestamps and access-control lists into the kernel.
提供基本的 iomap 支持可以让 FUSE 服务器避免在内核与用户空间之间传输文件数据,从而提升速度,但要达到高性能仍需更多工作。该系列补丁中的一步是允许内核缓存 FUSE 服务器创建的 iomap 映射。这减少了往返服务器的次数,也有助于处理由于 I/O 导致映射变化的情况。补丁还将时间戳和 ACL 的大量管理工作移入内核,也提升了性能。
Finally, this short series allows a privileged mount helper to set a special bit enabling the FUSE server process to use the iomap capability, regardless of whether it has CAP_SYS_RAWIO. That makes it possible for a server process to run in an unprivileged mode, opening up the possibility of implementing filesystems in unprivileged processes that are unable to compromise the system.
最后,这个小系列允许特权挂载助手设置一个特殊标志,使 FUSE 服务器在没有 CAP_SYS_RAWIO 能力的情况下也能使用 iomap。这样,服务器进程就可以以非特权模式运行,从而为用非特权进程实现文件系统提供了可能性,而这些进程不会危及系统安全。
User space
用户空间
That, however, is only the kernel side of the equation. There are another five sub-parts of the series that add the equivalent support to the libfuse user-space library. Yet another six sub-parts add support for the new FUSE features to fuse2fs, the server program that implements the ext4 filesystem (and ext3 and ext2 as well) in user space. As Wong points out in sub-part 1, the results are encouraging:
然而,上述只是内核端的部分内容。补丁系列还有另外五个子部分将相应的支持加入到用户空间的 libfuse 库中。另外六个子部分为 fuse2fs(即用户空间的 ext4/ext3/ext2 实现)加入对新 FUSE 功能的支持。正如 Wong 在第 1 部分中指出的那样,结果令人鼓舞:
The performance of this new data path is quite stunning: on a warm system, streaming reads and writes through the pagecache go from 60-90MB/s to 2-2.5GB/s. Direct IO reads and writes improve from the same baseline to 2.5-8GB/s. FIEMAP and SEEK_DATA/SEEK_HOLE now work too. The kernel ext4 driver can manage about 1.6GB/s for pagecache IO and about 2.6-8.5GB/s [for direct I/O], which means that fuse2fs is about as fast as the kernel for streaming file IO.
新的数据路径性能相当惊人:在热缓存系统上,通过页缓存的顺序读写从 60–90MB/s 提升到 2–2.5GB/s。直接 I/O 的读写也从同样的基线提升到 2.5–8GB/s。FIEMAP 和 SEEK_DATA/SEEK_HOLE 也能正常工作。内核中的 ext4 驱动约能达到 1.6GB/s 的页缓存 I/O 和 2.6–8.5GB/s 的直接 I/O,这意味着 fuse2fs 的流式 I/O 性能几乎与内核实现一样快。
He does also acknowledge that the results for random buffered I/O are not as good at this point.
他也承认,目前随机缓冲 I/O 的表现尚不理想。
The patch series includes a fair amount of support for running unprivileged FUSE filesystem servers, further containing any fallout from a compromised (or malicious) FUSE server. The whole series ends with a 33-patch sub-part adding testing support for ext4 under FUSE.
补丁系列还包含大量支持,使 FUSE 文件系统服务器可以以非特权方式运行,从而进一步限制被攻破(或恶意)服务器造成的危害。整个系列最终以一个包含 33 个补丁的子部分收尾,为在 FUSE 下运行 ext4 添加测试支持。
Prospects
前景
This is a lot of work that offers some obvious benefits, but it is also a lot for the filesystem developers to absorb. Even so, Wong said in the cover letter that he would like to merge these patches for the 6.19 kernel release. That seems rather ambitious. Hellwig asked for the series to be split up and made easier to review; it is not clear whether Wong intends to do that. He has not gotten around to documenting the iomap changes, though that work must surely be at the top of his to-do list. And, of course, all of this work will need to be reviewed, and likely revised, before it can be merged.
这是一项巨大工程,带来了明显好处,但对文件系统开发者而言,也需要吸收大量新内容。即便如此,Wong 在封面信中表示希望在 6.19 内核中合并这些补丁,这个目标看起来相当激进。Hellwig 要求将补丁系列拆分,以便于审查;目前尚不清楚 Wong 是否会这么做。他还没有编写 iomap 相关改动的文档,虽然这显然是他待办事项的优先内容。当然,所有这些工作都需要审查,并很可能需要修改后才能合并。
So, in summary, it would be somewhat surprising to see these changes actually land for 6.19. But, given the obvious value that this work brings, Wong may well succeed in upstreaming it in the not-too-distant future. If his results bear out in wider usage, distributors and system integrators could start shipping systems with FUSE-implemented filesystems, which would be a significant change from how Linux systems have worked since the beginning. Linux may never be a microkernel, but it may soon look rather more microkernel-like than it does now.
总而言之,这些改动真正进入 6.19 的可能性不大。然而,鉴于其带来的价值,Wong 很可能会在不远的将来成功推动其上游合并。如果这些结果在更广泛的使用中得到验证,发行版和系统集成商可能会开始提供基于 FUSE 文件系统的系统,这将是 Linux 自诞生以来的一个重大变化。Linux 或许永远不会成为微内核,但它看起来可能会变得比现在更加“微内核化”。
