Copy avoidance in networked file systems
----------------------------------------

  Jose' Carlos Brustoloni
  Bell Labs, Lucent Technologies
  jcb@research.bell-labs.com

The point-to-point bandwidth of gigabit networks can surpass
the main memory copy bandwidth of many current hosts. Therefore,
researchers have been devoting considerable attention to the problem
of copy avoidance in network I/O. In particular, a recent study shows
that copying can be avoided without modifying the semantics of existing
networking APIs [1].

In contrast, far less attention has been recently devoted to copy
avoidance in file I/O. This neglect may be motivated by several
subtle misperceptions:

1) Disks are far slower than main memory (or gigabit networks).

   This is indeed true, but copy avoidance can still be worthwhile
   in file I/O because:
   (a) copy avoidance can reduce CPU utilization and significantly
       improve the throughput of file servers, which often are CPU-bound, and
   (b) caching can avoid physical disk I/O and greatly speed up file systems.

2) Many copy avoidance techniques for network I/O are not useful in file I/O.

   Indeed, copy avoidance techniques for network I/O
   often exploit the fact that buffers are ephemeral, i.e. are
   deallocated as soon as processing of the corresponding input or
   output request completes. On the contrary, buffers used in file I/O
   often are cached. For example, emulated copy [1] is a copy avoidance
   scheme for network I/O that uses input alignment and page swapping
   on input and TCOW, a form of copy-on-write, on output. If used on
   file input, page swapping would corrupt the file system cache
   with the previous contents of the client input buffers.
   If used on file output, TCOW would allow cached output pages to
   be corrupted because, after output completes, the output reference
   is lost and therefore the pages can be overwritten or reused.

   This does not mean, however, that copy avoidance is unattainable
   in file I/O. Systems usually also offer mapped file I/O,
   which allows file data to be passed between applications and the
   operating system by page mapping and unmapping. Mapped files are a practical
   solution that is already widely available for copy avoidance in file I/O.

3) Copying between mapped files and network I/O buffers can be unavoidable
   because of page alignment constraints.

   For example, in a networked file server, data may be received from
   the network for output to the file system. The data will usually be
   preceded by an application-layer header specifying the file and
   offset from the beginning of the file (for simplicity, let us assume
   that the offset is multiple of the page size). This header can make
   copy avoidance difficult because (a) the application must read the
   header to determine the file and (b) the header may make the following data
   arbitrarily aligned, whereas data must be page-aligned for mapped file I/O.

   However, I show that:

   (a) If the network adapter supports system-aligned buffering
       (early demultiplexing or buffer snap-off) [2], then the application
       can peek at the header and, after decoding it, input the data
       directly to the correct mapped file region, using emulated copy.
       Data is passed between network and file system with copy avoidance
       and without any modifications in existing APIs.

   (b) Even without such adapter support, copy avoidance is possible with
       header patching, a novel software optimization.

       Let h' be the preferred alignment for input from the network (usually
       equal to the length of any unstripped protocol headers below the
       application layer), h be the length of the application-layer header,
       and l be the data length (less than or equal to the network's
       maximum transmission unit minus the lengths of headers at network
       or higher layers). h' must be fixed and known by both sender and
       receiver. On the contrary, h and l may vary from packet to packet.
       Using header patching, the sender transmits the application-layer
       header followed by the data starting at file offset o + h' + h
       and of length l - h' - h, followed by data starting at file offset
       o and of length h' + h (to achieve this out-of-order transmission,
       the sender may use, e.g., Unix's writev call with a gather list).
       The receiver peeks at the first h bytes of the input
       (using, e.g., Unix's recv with MSG_PEEK flag),
       decodes the application-layer header, and determines the address a
       corresponding to file offset o (multiple of the page size)
       in the correct mapped file region. The receiver then inputs l - h'
       bytes to address a + h', followed by h' + h bytes to address a.
       This causes most of the data to be passed by page swapping, after which
       the data corresponding to offset o and of length h' + h is
       patched on top of the application- and lower-layer
       headers at address a.  After patching, the input buffer starts at
       the correct offset in the mapped file region and runs uninterrupted
       for length l with the data in correct order, as illustrated by the
       following figure.

                      +----+---+----------+----+
        Packet:       | h' | h |    d1    | d0 |
                      +----+---+----------+----+

        Pooled NW     +----+---+----------+      +----+--------------+
        buffers:      | h' | h |    d1    |      | d0 |              |
                      +----+---+----------+      +----+--------------+
                      |    |
              reverse | ^  |      ^                |
              copyout | |  |      | swap           |
                      |    |      v                |
        Mapped        +----+---+----------+        |
        file:         |    | h |    d1    |        |
                      +----+---+----------+        |
                      \--------/                   |
                           ^                       |
                           |                       |
                    patch  +-----------------------+

My experiments on the Credit Net ATM network at 512 Mbps show that
copy avoidance can substantially improve the performance of networked
file systems. Because of cache effects, copy avoidance benefits are
synergistic: Greatest benefits are obtained when copying is avoided on the
entire end-to-end data path, including network and file I/O.
Additionally, the experiments confirm each of the above claims.

References
----------

[1] J. Brustoloni and P. Steenkiste. ``Effects of Buffering
    Semantics on I/O performance'', in Proc. OSDI'96,
    USENIX, Oct. 1996, pp. 277-291. Also available from
    http://www.cs.cmu.edu/~jcb/.

[2] J. Brustoloni and P. Steenkiste. ``Copy Emulation in
    Checksummed, Multiple-Packet Communication'', in
    Proc. INFOCOM'97, IEEE, April 1997. Also available from
    http://www.cs.cmu.edu/~jcb/.

---------------------------------------------------------------------------

Work performed while at the School of Computer Science,
Carnegie Mellon University.
To be presented at Gigabit Networking Workshop - GBN'98.