OSSP: CVS Repository: ossp-pkg/sio/BRAINSTORM/apache-stackedio.txt 1.1

ossp-pkg/sio/BRAINSTORM/apache-stackedio.txt 1.1
[djg: comments like this are from dean]

This past summer, Alexei and I wrote a spec for an I/O Filters API... 
this proposal addresses one part of that -- 'stacked' I/O with buff.c. 

We have a couple of options for stacked I/O: we can either use existing
code, such as sfio, or we can rewrite buff.c to do it.  We've gone over
the first possibility at length, though, and there were problems with each
implemenation which was mentioned (licensing and compatibility,
specifically); so far as I know, those remain issues. 

Btw -- sfio will be supported w/in this model... it just wouldn't be the
basis for the model's implementation. 

     -- Ed Korthof        |  Web Server Engineer --
     -- ed@organic.com    |  Organic Online, Inc --
     -- (415) 278-5676    |  Fax: (415) 284-6891 --

---------------------------------------------------------------------------
Stacked I/O With BUFFs
	Sections:

	1.) Overview
	2.) The API
		User-supplied structures
		API functions
	3.) Detailed Description
		The bfilter structure
		The bbottomfilter structure
		The BUFF structure
		Public functions in buff.c
	4.) Efficiency Considerations
		Buffering
		Memory copies
		Function chaining
		writev
	5.) Code in buff.c
		Default Functions
		Heuristics for writev
		Writing
		Reading
		Flushing data
		Closing stacks and filters
		Flags and Options

*************************************************************************
		Overview

The intention of this API is to make Apache's BUFF structure modular
while retaining high efficiency.  Basically, it involves rewriting
buff.c to provide 'stacked' I/O -- where the data passed through a
series of 'filters', which may modify it.

There are two parts to this, the core code for BUFF structures, and the
"filters" used to implement new behavior.  "filter" is used to refer to
both the sets of 5 functions, as shown in the bfilter structure in the
next section, and to BUFFs which are created using a specific bfliter.
These will also be occasionally refered to as "user-supplied", though
the Apache core will need to use these as well for basic functions.

The user-supplied functions should use only the public BUFF API, rather
than any internal details or functions.  One thing which may not be
clear is that in the core BUFF functions, the BUFF pointer passed in
refers to the BUFF on which the operation will happen.  OTOH, in the
user-supplied code, the BUFF passed in is the next buffer down the
chain, not the current one.

*************************************************************************
		The API

	User-supplied structures

First, the bfilter structure is used in all filters:
    typedef struct {
      int (*writev)(BUFF *, void *, struct iovect *, int);
      int (*read)(BUFF *, void *, char *, int);
      int (*write)(BUFF *, void *, const char *, int);
      int (*flush)(BUFF *, void *, const char *, int, bfilter *);
      int (*transmitfile)(BUFF *, void *, file_info_ptr *);
      void (*close)(BUFF *, void *);
    } bfilter;

bfilters are placed into a BUFF structure along with a
user-supplied void * pointer.

Second, the following structure is for use with a filter which can
sit at the bottom of the stack:

    typedef struct {
      void *(*bgetfileinfo)(BUFF *, void *);
      void (*bpushfileinfo)(BUFF *, void *, void *);
    } bbottomfilter;


	BUFF API functions

The following functions are new BUFF API functions:

For filters:

BUFF * bcreatestack(pool *p, int flags, struct bfilter *,
                    struct bbottomfilter *, void *);
BUFF * bpushfilter (BUFF *, struct bfilter *, void *);
BUFF * bpushbuffer (BUFF *, BUFF *);
BUFF * bpopfilter(BUFF *);
BUFF * bpopbuffer(BUFF *);
void bclosestack(BUFF *);

For BUFFs in general:

int btransmitfile(BUFF *, file_info_ptr *);
int bsetstackopts(BUFF *, int, const void *);
int bsetstackflags(BUFF *, int, int);

Note that a new flag is needed for bsetstackflags:
B_MAXBUFFERING

The current bcreate should become

BUFF * bcreatebuffer (pool *p, int flags, struct bfilter *, void *);

*************************************************************************
		Detailed Explanation

	bfilter structure

The void * pointer used in all these functions, as well as those in the
bbottomfilter structure and the filter API functions, is always the same
pointer w/in an individual BUFF.

The first function in a bfilter structure is 'writev'; this is only
needed for high efficiency writing, generally at the level of the system
interface.  In it's absence, multiple writes will be done w/ 'write'.
Note that defining 'writev' means you must define 'write'.

The second is 'write'; this is the generic writing function, taking a BUFF
* to which to write, a block of text, and the length of that block of
text.  The expected return is the number of characters (out of that block
of text) which were successfully processed (rather than the number of
characters actually written). 

The third is 'read'; this is the generic reading function, taking a BUFF *
from which to read data, and a void * buffer in which to put text, and the
number of characters to put in that buffer.  The expected return is the
number of characters placed in the buffer.

The fourth is 'flush'; this is intended to force the buffer to spit out
any data it may have been saving, as well as to clear any data the
BUFF code was storing.  If the third argument is non-null, then it
contains more text to be printed; that text need not be null terminated,
but the fourth argument contains the length of text to be processed.  The
expected return value should be the number of characters handled out
from the third argument (0 if there are none), or -1 on error.  Finally,
the fifth argument is a pointer to the bfilter struct containing this
function, so that it may use the write or writev functions in it.   Note
that general buffering is handled by BUFF's internal code, and module
writers should not store data for performance reasons.

The fifth is 'transmitfile', which takes as its arguments a buffer to
which to write (if non-null), the void * pointer containing configuration
(or other) information for this filter, and a system-dependent pointer
(the file_info_ptr structure will be defined on a per-system basis)
containing information required to print the 'file' in question.
This is intended to allow zero-copy TCP in Win32.

The sixth is 'close'; this is what is called when the connection is being
closed.  The 'close' should not be passed on to the next filter in the
stack.  Most filters will not need to use this, but if database handles
or some other object is created, this is the point at which to remove it.
Note that flush is called automatically before this.

	bbottomfilter Structure

The first function, bgetfileinfo, is designed to allow Apache to get
information from a BUFF struct regarding the input and output sources.
This is currently used to get the input file number to select on a
socket to see if there's data waiting to be read.  The information
returned is platform specific; the void * pointer passed in holds
the void * pointer passed to all user-supplied functions.

The second function, bpushfileinfo, is used to push file information
onto a buffer, so that the buffer can be fully constructed and ready
to handle data as soon as possible after a client has connected.
The first void * pointer holds platform specific information (in
Unix, it would be a pair of file descriptors); the second holds the
void * pointer passed to all user-supplied functions.

[djg: I don't think I really agree with the distinction here between
the bottom and the other filters.  Take the select() example, it's
valid for any layer to define a fd that can be used for select...
in fact it's the topmost layer that should really get to make this
definition.  Or maybe I just have your top and bottom flipped.  In
any event I think this should be part of the filter structure and
not separate.]

	The BUFF structure

A couple of changes are needed for this structure: remove fd and
fd_in; add a bfilter structure; add a pointer to a bbottomfilter;
add three pointers to the next BUFFs: one for the next BUFF in the
stack, one for the next BUFF which implements write, and one
for the next BUFF which implements read.


	Public functions in buff.c

BUFF * bpushfilter (BUFF *, struct bfilter *, void *);

This function adds the filter functions from bfilter, stacking them on
top of the BUFF.  It returns the new top BUFF, or NULL on error.

BUFF * bpushbuffer (BUFF *, BUFF *);

This function places the second buffer on the top of the stack that
the first one is on.  It returns the new top BUFF, or NULL on error.

BUFF * bpopfilter(BUFF *);
BUFF * bpopbuffer(BUFF *);

Unattaches the top-most filter from the stack, and returns the new
top-level BUFF, or NULL on error or when there are no BUFFs
remaining.  The two are synonymous.

void bclosestack(BUFF *);

Closes the I/O stack, removing all the filters in it.

BUFF * bcreatestack(pool *p, int flags, struct bfilter *,
                    struct bbottomfilter *, void *);

This creates an I/O stack.  It returns NULL on error.

BUFF * bcreatebuffer(pool *p, int flags, struct bfilter *, void *);

This creates a BUFF for later use with bpushbuffer.  The BUFF is
not set up to be used as an I/O stack, however.  It returns NULL
on error.

int bsetstackopts(BUFF *, int, const void *);
int bsetstackflags(BUFF *, int, int);

These functions, respectively, set options on all the BUFFs in a
stack.  The new flag, B_MAXBUFFERING is used to disable a feature
described in the next section, whereby only the first and last
BUFFs will buffer data.

*************************************************************************
		Efficiency Considerations

	Buffering

All input and output is buffered by the standard buffering code.
People writing code to use buff.c should not concern themselves with
buffering for efficiency, and should not buffer except when necessary.

The write function will typically be called with large blocks of text;
the read function will attempt to place the specified number of bytes
into the buffer.

Dean noted that there are possible problems w/ multiple buffers;
further, some applications must not be buffered.  This can be
partially dealt with by turning off buffering, or by flushing the
data when appropriate.

However, some potential problems arise anyway.  The simplest example
involves shrinking transformations; suppose that you have a set
of filters, A, B, and C, such that A outputs less text than it
recieves, as does B (say A strips comments, and B gzips the result).
Then after a write to A which fills the buffer, A writes to B.
However, A won't write enough to fill B's buffer, so a memory copy
will be needed.  This continues till B's buffer fills up, then
B will write to C's buffer -- with the same effect.

[djg: I don't think this is the issue I was really worried about --
in the case of shrinking transformations you are already doing 
non-trivial amounts of CPU activity with the data, and there's
no copying of data that you can eliminate anyway.  I do recognize
that there are non-CPU intensive filters -- such as DMA-capable
hardware crypto cards.  I don't think they're hard to support in
a zero-copy manner though.]

The maximum additional number of bytes which will be copied in this
scenario is on the order of nk, where n is the total number of bytes,
and k is the number of filters doing shrinking transformations.

There are several possible solutions to this issue.  The first
is to turn off buffering in all but the first filter and the
last filter.  This reduces the number of unnecessary byte copies
to at most one per byte, however it means that the functions in
the stack will get called more frequently; but it is the default
behavior, overridable by setting the B_MAXBUFFERING with
bsetstackflags.  Most filters won't involve a net shrinking
transformation, so even this will rarely be an issue; however,
if the filters do involve a net shrinking transformation, for
the sake of network-efficiency (sending reasonably sized blocks),
it may be more efficient anyway.

A second solution is more general use of writev for communication
between different buffers.  This complicates the programing work,
however.


	Memory copies

Each write function is passed a pointer to constant text; if any changes
are being made to the text, it must be copied.  However, if no changes
are made to the text (or to some smaller part of it), then it may be
sent to the next filter without any additional copying.  This should
provide the minimal necessary memory copies.

[djg: Unfortunately this makes it hard to support page-flipping and
async i/o because you don't have any reference counts on the data.
But I go into a little detail that already in docs/page_io.]

	Function chaining

In order to avoid unnecessary function chaining for reads and writes,
when a filter is pushed onto the stack, the buff.c code will determine
which is the next BUFF which contains a read or write function, and
reads and writes, respectively, will go directly to that BUFF.

	writev

writev is a function for efficient writing to the system; in terms of
this API, however, it also works for dealing with multiple blocks of
text without doing unnecessary byte copies.  It is not required.

Currently, the system level writev is used in two contexts: for
chunking and when a block of text is writen which, combined with
the text already in the buffer, would make the buffer overflow.

writev would be implemented both by the default bottom level filter
and by the chunking filter for these operations.  In addition, writev
may, be used, as noted above, to pass multiple blocks of text w/o
copying them into a single buffer.  Note that if the next filter does
not implement writev, however, this will be equivalent to repeated
calls to write, which may or may not be more efficient.  Up to
IOV_MAX-2 blocks of text may be passed along in this manner.  Unlike
the system writev call, the writev in this API should be called only
once, with a array with iovec's and a count as to the number of
iovecs in it.

If a bfilter defines writev, writev will be called whether or not
NO_WRITEV is set; hence, it should deal with that case in a reasonable
manner.

[djg: We can't guarantee atomicity of writev() when we emulate it.
Probably not a problem, just an observation.]

*************************************************************************
		Code in buff.c

	Default Functions

The default actions are generally those currently performed by Apache,
save that they they'll only attempt to write to a buffer, and they'll
return an error if there are no more buffers.  That is, you must implement
read, write, and flush in the bottom-most filter.

Except for close(), the default code will simply pass the function call
on to the next filter in the stack.  Some samples follow.

	Heuristics for writev

Currently, we call writev for chunking, and when we get a enough so that
the total overflows the buffer.  Since chunking is going to become a
filter, the chunking filter will use writev; in addition, bwrite will
trigger bwritev as shown (note that system specific information should
be kept at the filter level):

in bwrite:

    if (fb->outcnt > 0 && nbyte + fb->outcnt >= fb->bufsiz) {
        /* build iovec structs */
        struct iovec vec[2];
        vec[0].iov_base = (void *) fb->outbase;
        vec[0].iov_len = fb->outcnt;
        fb->outcnt = 0;
        vec[1].iov_base = (void *)buff;
        vec[1].iov_length = nbyte;
        return bwritev (fb, vec, 2);
    } else if (nbye >= fb->bufsiz) {
        return write_with_errors(fb,buff,nbyte);
    }

Note that the code above takes the place of large_write (as well
as taking code from it).

So, bwritev would look something like this (copying and pasting freely
from the current source for writev_it_all, which could be replaced):

-----
int bwritev (BUFF * fb, struct iovec * vec, int nvecs) {
    if (!fb)
        return -1; /* the bottom level filter implemented neither write nor
                    * writev. */
    if (fb->bfilter.bwritev) {
        return bf->bfilter.writev(fb->next, vec, nvecs);
    } else if (fb->bfilter.write) {
        /* while it's nice an easy to build the vector and crud, it's painful
         * to deal with partial writes (esp. w/ the vector)
         */
        int i = 0,rv;
        while (i < nvecs) {
            do {
                rv = fb->bfilter.write(fb, vec[i].iov_base, vec[i].iov_len);
            } while (rv == -1 && (errno == EINTR || errno == EAGAIN)
                     && !(fb->flags & B_EOUT));
            if (rv == -1) {
                if (errno != EINTR && errno != EAGAIN) {
                    doerror (fb, B_WR);
                }
                return -1;
            }
            fb->bytes_sent += rv;
            /* recalculate vec to deal with partial writes */
            while (rv > 0) {
                if (rv < vec[i].iov_len) {
                    vec[i].iov_base = (char *)vec[i].iov_base + rv;
                    vec[i].iov_len -= rv;
                    rv = 0;
                    if (vec[i].iov_len == 0) {
                        ++i;
                    }
                } else {
                    rv -= vec[i].iov_len;
                    ++i;
                }
            }
            if (fb->flags & B_EOUT)
                return -1;
        }
        /* if we got here, we wrote it all */
        return 0;
    } else {
        return bwritev(fb->next,vec,nvecs);
    }
}
-----
The default filter's writev function will pretty much like
writev_it_all.


	Writing

The general case for writing data is significantly simpler with this
model.  Because special cases are not dealt with in the BUFF core,
a single internal interface to writing data is possible; I'm going
to assume it's reasonable to standardize on write_with_errors, but
some other function may be more appropriate.

In the revised bwrite (which I'll ommit for brievity), the following
must be done:
	check for error conditions
	check to see if any buffering is done; if not, send the data
		directly to the write_with_errors function
	check to see if we should use writev or write_with_errors
		as above
	copy the data to the buffer (we know it fits since we didn't
		need writev or write_with_errors)

The other work the current bwrite is doing is
	ifdef'ing around NO_WRITEV
	numerous decisions regarding whether or not to send chunks

Generally, buff.c has a number of functions whose entire purpose is
to handle particular special cases wrt chunking, all of which could
be simplified with a chunking filter.

write_with_errors would not need to change; buff_write would.  Here
is a new version of it:

-----
/* the lowest level writing primitive */
static ap_inline int buff_write(BUFF *fb, const void *buf, int nbyte)
{
    if (fb->bfilter.write)
        return fb->bfilter.write(fb->next_writer,buff,nbyte);
    else
        return bwrite(fb->next_writer,buff,nbyte);
}
-----

If the btransmitfile function is called on a buffer which doesn't implement
it, the system will attempt to read data from the file identified
by the file_info_ptr structure and use other methods to write to it.

	Reading

One of the basic reading functions in Apache 1.3b3 is buff_read;
here is how it would look within this spec:

-----
/* the lowest level reading primitive */
static ap_inline int buff_read(BUFF *fb, void *buf, int nbyte)
{
    int rv;

    if (!fb)
        return -1; /* the bottom level filter is not set up properly */

    if (fb->bfilter.read)
        return fb->bfilter.read(fb->next_reader,buf,nbyte,fb->bfilter_info);
    else
        return bread(fb->next_reader,buff,nbyte);
}
-----
The code currently in buff_read would become part of the default
filter.


	Flushing data

flush will get passed on down the stack automatically, with recursive
calls to bflush.  The user-supplied flush function will be called then,
and also before close is called.  The user-supplied flush should not
call flush on the next buffer.

[djg: Poorly written "expanding" filters can cause some nastiness
here.  In order to flush a layer you have to write out your current
buffer, and that may cause the layer below to overflow a buffer and
flush it.  If the filter is expanding then it may have to add more to
the buffer before flushing it to the layer below.  It's possible that
the layer below will end up having to flush twice.  It's a case where
writev-like capabilities are useful.]

	Closing Stacks and Filters

When a filter is removed from the stack, flush will be called then close
will be called.  When the entire stack is being closed, this operation
will be done automatically on each filter within the stack; generally,
filters should not operate on other filters further down the stack,
except to pass data along when flush is called.

	Flags and Options

Changes to flags and options using the current functions only affect
one buffer.  To affect all the buffers on down the chain, use
bsetstackopts or bsetstackflags.

bgetopt is currently only used to grab a count of the bytes sent;
it will continue to provide that functionality.  bgetflags is
used to provide information on whether or not the connection is
still open; it'll continue to provide that functionality as well.

The core BUFF operations will remain, though some operations which
are done via flags and options will be done by attaching appropriate
filters instead (eg. chunking).

[djg: I'd like to consider filesystem metadata as well -- we only need
a few bits of metadata to do HTTP: file size and last modified.  We
need an etag generation function, it is specific to the filters in
use.  You see, I'm envisioning a bottom layer which pulls data out of
a database rather than reading from a file.]

-------

This file is there so that I do not have to remind myself
about the reasons for Layered IO, apart from the obvious one.

0. To get away from a 1 to 1 mapping

   i.e. a single URI can cause multiple backend requests, 
   in arbitrary configurations, such as in paralel, tunnel/piped, 
   or in some sort of funnel mode. Such multiple backend
   requests, with fully layered IO can be treated exactly
   like any URI request; and recursion is born :-)

1. To do on the fly charset conversion

   Be, theoretically, be able to send out your content using
   latin1, latin2 or any other charset; generated from static
   _and_ dynamic content in other charsets (typically unicode
   encoded as UTF7 or UTF8). Such conversion is prompted by
   things like the user-agent string, a cookie, or other hints
   about the capabilities of the OS, language preferences and
   other (in)capabilities of the final receipient. 

2. To be able to do fancy templates

   Have your application/cgi sending out an XML structure of
   field/value pair-ed contents; which is substituted into a 
   template by the web server; possibly based on information 
   accessible/known to the webserver which you do not want to 
   be known to the backend script. Ideally that template would
   be just as easy to generate by a backend as well (see 0).

3. On the fly translation

   And other general text and output mungling, such as translating
   an english page in spanish whilst it goes through your Proxy,
   or JPEG-ing a GIF generated by mod_perl+gd.

Dw.
---------

From dgaudet@arctic.org Fri Feb 20 00:36:52 1998
Date: Fri, 20 Feb 1998 00:35:37 -0800 (PST)
From: Dean Gaudet <dgaudet@arctic.org>
To: new-httpd@apache.org
Subject: page-based i/o
X-Comment: Visit http://www.arctic.org/~dgaudet/legal for information regarding copyright and disclaimer.
Reply-To: new-httpd@apache.org

Ed asked me for more details on what I mean when I talk about "paged based
zero copy i/o". 

While writing mod_mmap_static I was thinking about the primitives that the
core requires of the filesystem.  What exactly is it that ties us into the
filesystem?  and how would we abstract it?  The metadata (last modified
time, file length) is actually pretty easy to abstract.  It's also easy to
define an "index" function so that MultiViews and such can be implemented. 
And with layered I/O we can hide the actual details of how you access
these "virtual" files. 

But therein lies an inefficiency.  If we had only bread() for reading
virtual files, then we would enforce at least one copy of the data. 
bread() supplies the place that the caller wants to see the data, and so
the bread() code has to copy it.  But there's very little reason that
bread() callers have to supply the buffer... bread() itself could supply
the buffer.  Call this new interface page_read().  It looks something like
this:

    typedef struct {
	const void *data;
	size_t data_len; /* amt of data on page which is valid */
	... other stuff necessary for managing the page pool ...
    } a_page_head;

    /* returns NULL if an error or EOF occurs, on EOF errno will be
     * set to 0
     */
    a_page_head *page_read(BUFF *fb);

    /* queues entire page for writing, returns 0 on success, -1 on
     * error
     */
    int page_write(BUFF *fb, a_page_head *);

It's very important that a_page_head structures point to the data page
rather than be part of the data page.  This way we can build a_page_head
structures which refer to parts of mmap()d memory.

This stuff is a little more tricky to do, but is a big win for performance.
With this integrated into our layered I/O it means that we can have
zero-copy performance while still getting the advantages of layering.

But note I'm glossing over a bunch of details... like the fact that we
have to decide if a_page_heads are shared data, and hence need reference
counting (i.e. I said "queues for writing" up there, which means some
bit of the a_page_head data has to be kept until its actually written).
Similarly for the page data.

There are other tricks in this area that we can take advantage of --
like interprocess communication on architectures that do page flipping.
On these boxes if you write() something that's page-aligned and page-sized
to a pipe or unix socket, and the other end read()s into a page-aligned
page-sized buffer then the kernel can get away without copying any data.
It just marks the two pages as shared copy-on-write, and only when
they're written to will the copy be made.  So to make this work, your
writer uses a ring of 2+ page-aligned/sized buffers so that it's not
writing on something the reader is still reading.

Dean

----

For details on HPUX and avoiding extra data copies, see
<ftp://ftp.cup.hp.com/dist/networking/briefs/copyavoid.pdf>.

(note that if you get the postscript version instead, you have to 
manually edit it to remove the front page before any version of 
ghostscript that I have used will read it)

----

I've been told by an engineer in Sun's TCP/IP group that zero-copy TCP
in Solaris 2.6 occurs when:

    - you've got the right interface card (OC-12 ATM card I think)
    - you use write()
    - your write buffer is 16k aligned and a multiple of 16k in size

We currently get the 16k stuff for free by using mmap().  But sun's
current code isn't smart enough to deal with our initial writev()
of the headers and first part of the response.

----

Systems that have a system call to efficiently send the contents of a 
descriptor across the network.  This is probably the single best way
to do static content on systems that support it.

HPUX: (10.30 and on)

      ssize_t sendfile(int s, int fd, off_t offset, size_t nbytes,
              const struct iovec *hdtrl, int flags);

      (allows you to add headers and trailers in the form of iovec
      structs)  Marc has a man page; ask if you want a copy.  Not included
      due to copyright issues.  man page also available from 
      http://docs.hp.com/ (in particular, 
      http://docs.hp.com:80/dynaweb/hpux11/hpuxen1a/rvl3en1a/@Generic__BookTextView/59894;td=3 )

Windows NT:

	BOOL TransmitFile(     SOCKET hSocket, 
	    HANDLE hFile, 
	    DWORD nNumberOfBytesToWrite, 
	    DWORD nNumberOfBytesPerSend, 
	    LPOVERLAPPED lpOverlapped, 
	    LPTRANSMIT_FILE_BUFFERS lpTransmitBuffers, 
	    DWORD dwFlags 
	   ); 

	(does it start from the current position in the handle?  I would
	hope so, or else it is pretty dumb.)

	lpTransmitBuffers allows for headers and trailers.

	Documentation at:

	http://premium.microsoft.com/msdn/library/sdkdoc/wsapiref_3pwy.htm
	http://premium.microsoft.com/msdn/library/conf/html/sa8ff.htm

	Even less related to page based IO: just context switching:
	AcceptEx does an accept(), and returns the start of the
	input data.  see:

	http://premium.microsoft.com/msdn/library/sdkdoc/pdnds/sock2/wsapiref_17jm.htm

	What this means is you require one less syscall to do a
	typical request, especially if you have a cache of handles
	so you don't have to do an open or close.  Hmm.  Interesting
	question: then, if TransmitFile starts from the current
	position, you need a mutex around the seek and the
	TransmitFile.  If not, you are just limited (eg. byte
	ranges) in what you can use it for.

	Also note that TransmitFile can specify TF_REUSE_SOCKET, so that
	after use the same socket handle can be passed to AcceptEx.  
	Obviously only good where we don't have a persistent connection
	to worry about.

----

Note that all this is shot to bloody hell by HTTP-NG's multiplexing.
If fragment sizes are big enough, it could still be worthwhile to
do copy avoidence.  It also causes performance issues because of
its credit system that limits how much you can write in a single
chunk.

Don't tell me that if HTTP-NG becomes popular we will seen vendors 
embedding SMUX (or whatever multiplexing is used) in the kernel to
get around this stuff.  There we go, Apache with a loadable kernel
module.  

----

Larry McVoy's document for SGI regarding sendfile/TransmitFile:
ftp://ftp.bitmover.com/pub/splice.ps.gz
From dgaudet@arctic.org Sun Jun 20 11:07:58 1999
Path: engelschall.com!mail2news!apache.org!new-httpd-owner-rse+apache=en.muc.de
From: dgaudet@arctic.org (Dean Gaudet)
Newsgroups: en.lists.apache-new-httpd
Subject: mpm update
Date: 19 Jun 1999 07:17:00 +0200
Organization: Mail2News at engelschall.com
Lines: 104
Approved: postmaster@m2ndom
Message-ID: <Pine.LNX.3.96dg4.990618172933.27639Y-100000@twinlark.arctic.org>
Reply-To: new-httpd@apache.org
NNTP-Posting-Host: en1.engelschall.com
X-Trace: en1.engelschall.com 929769420 64417 141.1.129.1 (19 Jun 1999 05:17:00 GMT)
X-Complaints-To: postmaster@engelschall.com
NNTP-Posting-Date: 19 Jun 1999 05:17:00 GMT
X-Mail2News-Gateway: mail2news.engelschall.com
Xref: engelschall.com en.lists.apache-new-httpd:31056

I imported mpm-3 into the apache-2.0 repository (tag mpm-3 if you want
it).

Then I threw in a bunch of my recent email ramblings, because I'm getting
tired of repeating them, mostly off-list to folks who ask "why doesn't
apache do XYZ?"  I intend to be more proactive in this area, because it
can only help. 

Then I ripped up BUFF and broke lots of stuff and put in a first crack at
layering.  Info on that below.

If you check out the tree, and build it (using Configuration.mpm) you
should be able to serve up the top page of the manual, that's all I've
tested so far ;) 

Dean

goals? we need an i/o abstraction which has these properties:

- buffered and non-buffered modes

    The buffered mode should look like FILE *.

    The non-buffered mode should look more like read(2)/write(2).

- blocking and non-blocking modes

    The blocking mode is the "easy" mode -- it's what most module writers
    will see.  The non-blocking mode is the "hard" mode, this is where
    module writers wanting to squeeze out some speed will have to play.
    In order to build async/sync hybrid models we need the
    non-blocking i/o abstraction.

- timed reads and writes (for blocking cases)

    This is part of my jihad against asynchronous notification.

- i/o filtering or layering

    Yet another Holy Grail of computing.  But I digress.  These are
    hard when you take into consideration non-blocking i/o -- you have
    to keep lots of state.  I expect our core filters will all support
    non-blocking i/o, well at least the ones I need to make sure we kick
    ass on benchmarks.  A filter can deny a switch to non-blocking mode,
    the server will have to recover gracefully (ha).

- copy-avoidance

    Hey what about zero copy a la IO-Lite?  After having experienced it
    in a production setting I'm no longer convinced of its benefits.
    There is an enormous amount of overhead keeping lists of buffers,
    and reference counts, and cleanup functions, and such which requires
    a lot of tuning to get right.  I think there may be something here,
    but it's not a cakewalk.

    What I do know is that the heuristics I put into apache-1.3 to choose
    writev() at times are almost as good as what you can get from doing
    full zero-copy in the cases we *currently* care about.  To put it
    another way, let's wait another generation to deal with zero copy.

    But sendfile/transmitfile/etc. those are still interesting.

    So instead of listing "zero copy" as a property, I'll list
    "copy-avoidance".

So far?

- ap_bungetc added
- ap_blookc changed to return the character, rather than take a char *buff
- in theory, errno is always useful on return from a BUFF routine
- ap_bhalfduplex, B_SAFEREAD will be re-implemented using a layer I think
- chunking gone for now, will return as a layer
- ebcdic gone for now... it should be a layer

- ap_iol.h defined, first crack at the layers...

    Step back a second to think on it.  Much like we have fread(3)
    and read(2), I've got a BUFF and an ap_iol abstraction.  An ap_iol
    could use a BUFF if it requires some form of buffering, but many
    won't require buffering... or can do a better job themselves.

    Consider filters such as:
	- ebcdic -> ascii
	- encryption
	- compression
    These all share the property that no matter what, they're going to make
    an extra copy of the data.  In some cases they can do it in place (read)
    or into a fixed buffer... in most cases their buffering requirements
    are different than what BUFF offers.

    Consider a filter such as chunking.  This could actually use the writev
    method to get its job done... depends on the chunks being used.  This
    is where zero-copy would be really nice, but we can get by with a few
    heuristics.

    At any rate -- the NSPR folks didn't see any reason to included a
    buffered i/o abstraction on top of their layered i/o abstraction... so
    I feel like I'm not the only one who's thinking this way.

- iol_unix.c implemented... should hold us for a bit




From dgaudet@arctic.org Mon Jun 28 19:06:50 1999
Path: engelschall.com!mail2news!apache.org!new-httpd-owner-rse+apache=en.muc.de
From: dgaudet@arctic.org (Dean Gaudet)
Newsgroups: en.lists.apache-new-httpd
Subject: Re: async routines
Date: 28 Jun 1999 17:33:24 +0200
Organization: Mail2News at engelschall.com
Lines: 96
Approved: postmaster@m2ndom
Message-ID: <Pine.LNX.3.96dg4.990628081527.31647G-100000@twinlark.arctic.org>
Reply-To: new-httpd@apache.org
NNTP-Posting-Host: en1.engelschall.com
X-Trace: en1.engelschall.com 930584004 99816 141.1.129.1 (28 Jun 1999 15:33:24 GMT)
X-Complaints-To: postmaster@engelschall.com
NNTP-Posting-Date: 28 Jun 1999 15:33:24 GMT
X-Mail2News-Gateway: mail2news.engelschall.com
Xref: engelschall.com en.lists.apache-new-httpd:31280

[hope you don't mind me cc'ing new-httpd zach, I think others will be
interested.]

On Mon, 28 Jun 1999, Zach Brown wrote:

> so dean, I was wading through the mpm code to see if I could munge the
> sigwait stuff into it.
> 
> as far as I could tell, the http protocol routines are still blocking.
> what does the future hold in the way for async routines? :)  I basically
> need a way to do something like..

You're still waiting for me to get the async stuff in there... I've done
part of the work -- the BUFF layer now supports non-blocking sockets. 

However, the HTTP code will always remain blocking.  There's no way I'm
going to try to educate the world in how to write async code... and since
our HTTP code has arbitrary call outs to third party modules... It'd
have a drastic effect on everyone to make this change.

But I honestly don't think this is a problem.  Here's my observations:

All the popular HTTP clients send their requests in one packet (or two
in the case of a POST and netscape).  So the HTTP code would almost
never have to block while processing the request.  It may block while
processing a POST -- something which someone else can worry about later,
my code won't be any worse than what we already have in apache.  So
any effort we put into making the HTTP parsing code async-safe would
be wasted on the 99.9% case.

Most responses fit in the socket's send buffer, and again don't require
async support.  But we currently do the lingering_close() routine which
could easily use async support.  Large responses also could use async
support.

The goal of HTTP parsing is to figure out which response object to
send.  In most cases we can reduce that to a bunch of common response
types:

- copying a file to the socket
- copying a pipe/socket to the socket  (IPC, CGIs)
- copying a mem region to the socket (mmap, some dynamic responses)

So what we do is we modify the response handlers only.  We teach them
about how to send async responses.

There will be a few new primitives which will tell the core "the response
fits one of these categories, please handle it".  The core will do the
rest -- and for MPMs which support async handling, the core will return
to the MPM and let the MPM do the work async...  the MPM will call a
completion function supplied by the core.  (Note that this will simplify
things for lots of folks... for example, it'll let us move range request
handling to a common spot so that more than just default_handler
can support it.)

I expect this to be a simple message passing protocol (pass by reference).
Well rather, that's how I expect to implement it in ASH -- where I'll
have a single thread per-process doing the select/poll stuff; and the
other threads are in a pool that handles the protocol stuff.  For your
stuff you may want to do it another way -- but we'll be using a common
structure that the core knows about... and that structure will look like
a message:

    struct msg {
	enum {
	    MSG_SEND_FILE,
	    MSG_SEND_PIPE,
	    MSG_SEND_MEM,
	    MSG_LINGERING_CLOSE,
	    MSG_WAIT_FOR_READ,	/* for handling keep-alives */
	    ...
	} type;
	BUFF *client;
	void (*completion)(struct msg *, int status);
	union {
	    ... extra data here for whichver types need it ...;
	} x;
    };

The nice thing about this is that these operations are protocol
independant... at this level there's no knowledge of HTTP, so the same
MPM core could be used to implement other protocols.

> so as I was thinking about this stuff, I realized it might be neat to have
> 'classes' of non blocking pending work and have different threads with
> differnt priorities hacking on it.  Say we have a very high priority
> thread that accepts connectoins, does initial header parsing, and
> sendfile()ing data out.  We could have lower priority threads that are
> spinning doing 'harder' BUFF work like an encryption layer or gziping
> content, whatever.

You should be able to implement this in your MPM easily I think... because
you'll see the different message types and can distribute them as needed.

Dean
OSSP CVS Repository