ossp-pkg/sio/BRAINSTORM/brustoloni-abs.txt
Copy avoidance in networked file systems
----------------------------------------
Jose' Carlos Brustoloni
Bell Labs, Lucent Technologies
jcb@research.bell-labs.com
The point-to-point bandwidth of gigabit networks can surpass
the main memory copy bandwidth of many current hosts. Therefore,
researchers have been devoting considerable attention to the problem
of copy avoidance in network I/O. In particular, a recent study shows
that copying can be avoided without modifying the semantics of existing
networking APIs [1].
In contrast, far less attention has been recently devoted to copy
avoidance in file I/O. This neglect may be motivated by several
subtle misperceptions:
1) Disks are far slower than main memory (or gigabit networks).
This is indeed true, but copy avoidance can still be worthwhile
in file I/O because:
(a) copy avoidance can reduce CPU utilization and significantly
improve the throughput of file servers, which often are CPU-bound, and
(b) caching can avoid physical disk I/O and greatly speed up file systems.
2) Many copy avoidance techniques for network I/O are not useful in file I/O.
Indeed, copy avoidance techniques for network I/O
often exploit the fact that buffers are ephemeral, i.e. are
deallocated as soon as processing of the corresponding input or
output request completes. On the contrary, buffers used in file I/O
often are cached. For example, emulated copy [1] is a copy avoidance
scheme for network I/O that uses input alignment and page swapping
on input and TCOW, a form of copy-on-write, on output. If used on
file input, page swapping would corrupt the file system cache
with the previous contents of the client input buffers.
If used on file output, TCOW would allow cached output pages to
be corrupted because, after output completes, the output reference
is lost and therefore the pages can be overwritten or reused.
This does not mean, however, that copy avoidance is unattainable
in file I/O. Systems usually also offer mapped file I/O,
which allows file data to be passed between applications and the
operating system by page mapping and unmapping. Mapped files are a practical
solution that is already widely available for copy avoidance in file I/O.
3) Copying between mapped files and network I/O buffers can be unavoidable
because of page alignment constraints.
For example, in a networked file server, data may be received from
the network for output to the file system. The data will usually be
preceded by an application-layer header specifying the file and
offset from the beginning of the file (for simplicity, let us assume
that the offset is multiple of the page size). This header can make
copy avoidance difficult because (a) the application must read the
header to determine the file and (b) the header may make the following data
arbitrarily aligned, whereas data must be page-aligned for mapped file I/O.
However, I show that:
(a) If the network adapter supports system-aligned buffering
(early demultiplexing or buffer snap-off) [2], then the application
can peek at the header and, after decoding it, input the data
directly to the correct mapped file region, using emulated copy.
Data is passed between network and file system with copy avoidance
and without any modifications in existing APIs.
(b) Even without such adapter support, copy avoidance is possible with
header patching, a novel software optimization.
Let h' be the preferred alignment for input from the network (usually
equal to the length of any unstripped protocol headers below the
application layer), h be the length of the application-layer header,
and l be the data length (less than or equal to the network's
maximum transmission unit minus the lengths of headers at network
or higher layers). h' must be fixed and known by both sender and
receiver. On the contrary, h and l may vary from packet to packet.
Using header patching, the sender transmits the application-layer
header followed by the data starting at file offset o + h' + h
and of length l - h' - h, followed by data starting at file offset
o and of length h' + h (to achieve this out-of-order transmission,
the sender may use, e.g., Unix's writev call with a gather list).
The receiver peeks at the first h bytes of the input
(using, e.g., Unix's recv with MSG_PEEK flag),
decodes the application-layer header, and determines the address a
corresponding to file offset o (multiple of the page size)
in the correct mapped file region. The receiver then inputs l - h'
bytes to address a + h', followed by h' + h bytes to address a.
This causes most of the data to be passed by page swapping, after which
the data corresponding to offset o and of length h' + h is
patched on top of the application- and lower-layer
headers at address a. After patching, the input buffer starts at
the correct offset in the mapped file region and runs uninterrupted
for length l with the data in correct order, as illustrated by the
following figure.
+----+---+----------+----+
Packet: | h' | h | d1 | d0 |
+----+---+----------+----+
Pooled NW +----+---+----------+ +----+--------------+
buffers: | h' | h | d1 | | d0 | |
+----+---+----------+ +----+--------------+
| |
reverse | ^ | ^ |
copyout | | | | swap |
| | v |
Mapped +----+---+----------+ |
file: | | h | d1 | |
+----+---+----------+ |
\--------/ |
^ |
| |
patch +-----------------------+
My experiments on the Credit Net ATM network at 512 Mbps show that
copy avoidance can substantially improve the performance of networked
file systems. Because of cache effects, copy avoidance benefits are
synergistic: Greatest benefits are obtained when copying is avoided on the
entire end-to-end data path, including network and file I/O.
Additionally, the experiments confirm each of the above claims.
References
----------
[1] J. Brustoloni and P. Steenkiste. ``Effects of Buffering
Semantics on I/O performance'', in Proc. OSDI'96,
USENIX, Oct. 1996, pp. 277-291. Also available from
http://www.cs.cmu.edu/~jcb/.
[2] J. Brustoloni and P. Steenkiste. ``Copy Emulation in
Checksummed, Multiple-Packet Communication'', in
Proc. INFOCOM'97, IEEE, April 1997. Also available from
http://www.cs.cmu.edu/~jcb/.
---------------------------------------------------------------------------
Work performed while at the School of Computer Science,
Carnegie Mellon University.
To be presented at Gigabit Networking Workshop - GBN'98.