Copy avoidance in networked file systems ---------------------------------------- Jose' Carlos Brustoloni Bell Labs, Lucent Technologies jcb@research.bell-labs.com The point-to-point bandwidth of gigabit networks can surpass the main memory copy bandwidth of many current hosts. Therefore, researchers have been devoting considerable attention to the problem of copy avoidance in network I/O. In particular, a recent study shows that copying can be avoided without modifying the semantics of existing networking APIs [1]. In contrast, far less attention has been recently devoted to copy avoidance in file I/O. This neglect may be motivated by several subtle misperceptions: 1) Disks are far slower than main memory (or gigabit networks). This is indeed true, but copy avoidance can still be worthwhile in file I/O because: (a) copy avoidance can reduce CPU utilization and significantly improve the throughput of file servers, which often are CPU-bound, and (b) caching can avoid physical disk I/O and greatly speed up file systems. 2) Many copy avoidance techniques for network I/O are not useful in file I/O. Indeed, copy avoidance techniques for network I/O often exploit the fact that buffers are ephemeral, i.e. are deallocated as soon as processing of the corresponding input or output request completes. On the contrary, buffers used in file I/O often are cached. For example, emulated copy [1] is a copy avoidance scheme for network I/O that uses input alignment and page swapping on input and TCOW, a form of copy-on-write, on output. If used on file input, page swapping would corrupt the file system cache with the previous contents of the client input buffers. If used on file output, TCOW would allow cached output pages to be corrupted because, after output completes, the output reference is lost and therefore the pages can be overwritten or reused. This does not mean, however, that copy avoidance is unattainable in file I/O. Systems usually also offer mapped file I/O, which allows file data to be passed between applications and the operating system by page mapping and unmapping. Mapped files are a practical solution that is already widely available for copy avoidance in file I/O. 3) Copying between mapped files and network I/O buffers can be unavoidable because of page alignment constraints. For example, in a networked file server, data may be received from the network for output to the file system. The data will usually be preceded by an application-layer header specifying the file and offset from the beginning of the file (for simplicity, let us assume that the offset is multiple of the page size). This header can make copy avoidance difficult because (a) the application must read the header to determine the file and (b) the header may make the following data arbitrarily aligned, whereas data must be page-aligned for mapped file I/O. However, I show that: (a) If the network adapter supports system-aligned buffering (early demultiplexing or buffer snap-off) [2], then the application can peek at the header and, after decoding it, input the data directly to the correct mapped file region, using emulated copy. Data is passed between network and file system with copy avoidance and without any modifications in existing APIs. (b) Even without such adapter support, copy avoidance is possible with header patching, a novel software optimization. Let h' be the preferred alignment for input from the network (usually equal to the length of any unstripped protocol headers below the application layer), h be the length of the application-layer header, and l be the data length (less than or equal to the network's maximum transmission unit minus the lengths of headers at network or higher layers). h' must be fixed and known by both sender and receiver. On the contrary, h and l may vary from packet to packet. Using header patching, the sender transmits the application-layer header followed by the data starting at file offset o + h' + h and of length l - h' - h, followed by data starting at file offset o and of length h' + h (to achieve this out-of-order transmission, the sender may use, e.g., Unix's writev call with a gather list). The receiver peeks at the first h bytes of the input (using, e.g., Unix's recv with MSG_PEEK flag), decodes the application-layer header, and determines the address a corresponding to file offset o (multiple of the page size) in the correct mapped file region. The receiver then inputs l - h' bytes to address a + h', followed by h' + h bytes to address a. This causes most of the data to be passed by page swapping, after which the data corresponding to offset o and of length h' + h is patched on top of the application- and lower-layer headers at address a. After patching, the input buffer starts at the correct offset in the mapped file region and runs uninterrupted for length l with the data in correct order, as illustrated by the following figure. +----+---+----------+----+ Packet: | h' | h | d1 | d0 | +----+---+----------+----+ Pooled NW +----+---+----------+ +----+--------------+ buffers: | h' | h | d1 | | d0 | | +----+---+----------+ +----+--------------+ | | reverse | ^ | ^ | copyout | | | | swap | | | v | Mapped +----+---+----------+ | file: | | h | d1 | | +----+---+----------+ | \--------/ | ^ | | | patch +-----------------------+ My experiments on the Credit Net ATM network at 512 Mbps show that copy avoidance can substantially improve the performance of networked file systems. Because of cache effects, copy avoidance benefits are synergistic: Greatest benefits are obtained when copying is avoided on the entire end-to-end data path, including network and file I/O. Additionally, the experiments confirm each of the above claims. References ---------- [1] J. Brustoloni and P. Steenkiste. ``Effects of Buffering Semantics on I/O performance'', in Proc. OSDI'96, USENIX, Oct. 1996, pp. 277-291. Also available from http://www.cs.cmu.edu/~jcb/. [2] J. Brustoloni and P. Steenkiste. ``Copy Emulation in Checksummed, Multiple-Packet Communication'', in Proc. INFOCOM'97, IEEE, April 1997. Also available from http://www.cs.cmu.edu/~jcb/. --------------------------------------------------------------------------- Work performed while at the School of Computer Science, Carnegie Mellon University. To be presented at Gigabit Networking Workshop - GBN'98.