OSSP: CVS Repository: ossp-pkg/sio/BRAINSTORM/doc_stacked

ossp-pkg/sio/BRAINSTORM/doc_stacked_io.txt 1.1
[djg: comments like this are from dean]

This past summer, Alexei and I wrote a spec for an I/O Filters API... 
this proposal addresses one part of that -- 'stacked' I/O with buff.c. 

We have a couple of options for stacked I/O: we can either use existing
code, such as sfio, or we can rewrite buff.c to do it.  We've gone over
the first possibility at length, though, and there were problems with each
implemenation which was mentioned (licensing and compatibility,
specifically); so far as I know, those remain issues. 

Btw -- sfio will be supported w/in this model... it just wouldn't be the
basis for the model's implementation. 

     -- Ed Korthof        |  Web Server Engineer --
     -- ed@organic.com    |  Organic Online, Inc --
     -- (415) 278-5676    |  Fax: (415) 284-6891 --

---------------------------------------------------------------------------
Stacked I/O With BUFFs
	Sections:

	1.) Overview
	2.) The API
		User-supplied structures
		API functions
	3.) Detailed Description
		The bfilter structure
		The bbottomfilter structure
		The BUFF structure
		Public functions in buff.c
	4.) Efficiency Considerations
		Buffering
		Memory copies
		Function chaining
		writev
	5.) Code in buff.c
		Default Functions
		Heuristics for writev
		Writing
		Reading
		Flushing data
		Closing stacks and filters
		Flags and Options

*************************************************************************
		Overview

The intention of this API is to make Apache's BUFF structure modular
while retaining high efficiency.  Basically, it involves rewriting
buff.c to provide 'stacked' I/O -- where the data passed through a
series of 'filters', which may modify it.

There are two parts to this, the core code for BUFF structures, and the
"filters" used to implement new behavior.  "filter" is used to refer to
both the sets of 5 functions, as shown in the bfilter structure in the
next section, and to BUFFs which are created using a specific bfliter.
These will also be occasionally refered to as "user-supplied", though
the Apache core will need to use these as well for basic functions.

The user-supplied functions should use only the public BUFF API, rather
than any internal details or functions.  One thing which may not be
clear is that in the core BUFF functions, the BUFF pointer passed in
refers to the BUFF on which the operation will happen.  OTOH, in the
user-supplied code, the BUFF passed in is the next buffer down the
chain, not the current one.

*************************************************************************
		The API

	User-supplied structures

First, the bfilter structure is used in all filters:
    typedef struct {
      int (*writev)(BUFF *, void *, struct iovect *, int);
      int (*read)(BUFF *, void *, char *, int);
      int (*write)(BUFF *, void *, const char *, int);
      int (*flush)(BUFF *, void *, const char *, int, bfilter *);
      int (*transmitfile)(BUFF *, void *, file_info_ptr *);
      void (*close)(BUFF *, void *);
    } bfilter;

bfilters are placed into a BUFF structure along with a
user-supplied void * pointer.

Second, the following structure is for use with a filter which can
sit at the bottom of the stack:

    typedef struct {
      void *(*bgetfileinfo)(BUFF *, void *);
      void (*bpushfileinfo)(BUFF *, void *, void *);
    } bbottomfilter;


	BUFF API functions

The following functions are new BUFF API functions:

For filters:

BUFF * bcreatestack(pool *p, int flags, struct bfilter *,
                    struct bbottomfilter *, void *);
BUFF * bpushfilter (BUFF *, struct bfilter *, void *);
BUFF * bpushbuffer (BUFF *, BUFF *);
BUFF * bpopfilter(BUFF *);
BUFF * bpopbuffer(BUFF *);
void bclosestack(BUFF *);

For BUFFs in general:

int btransmitfile(BUFF *, file_info_ptr *);
int bsetstackopts(BUFF *, int, const void *);
int bsetstackflags(BUFF *, int, int);

Note that a new flag is needed for bsetstackflags:
B_MAXBUFFERING

The current bcreate should become

BUFF * bcreatebuffer (pool *p, int flags, struct bfilter *, void *);

*************************************************************************
		Detailed Explanation

	bfilter structure

The void * pointer used in all these functions, as well as those in the
bbottomfilter structure and the filter API functions, is always the same
pointer w/in an individual BUFF.

The first function in a bfilter structure is 'writev'; this is only
needed for high efficiency writing, generally at the level of the system
interface.  In it's absence, multiple writes will be done w/ 'write'.
Note that defining 'writev' means you must define 'write'.

The second is 'write'; this is the generic writing function, taking a BUFF
* to which to write, a block of text, and the length of that block of
text.  The expected return is the number of characters (out of that block
of text) which were successfully processed (rather than the number of
characters actually written). 

The third is 'read'; this is the generic reading function, taking a BUFF *
from which to read data, and a void * buffer in which to put text, and the
number of characters to put in that buffer.  The expected return is the
number of characters placed in the buffer.

The fourth is 'flush'; this is intended to force the buffer to spit out
any data it may have been saving, as well as to clear any data the
BUFF code was storing.  If the third argument is non-null, then it
contains more text to be printed; that text need not be null terminated,
but the fourth argument contains the length of text to be processed.  The
expected return value should be the number of characters handled out
from the third argument (0 if there are none), or -1 on error.  Finally,
the fifth argument is a pointer to the bfilter struct containing this
function, so that it may use the write or writev functions in it.   Note
that general buffering is handled by BUFF's internal code, and module
writers should not store data for performance reasons.

The fifth is 'transmitfile', which takes as its arguments a buffer to
which to write (if non-null), the void * pointer containing configuration
(or other) information for this filter, and a system-dependent pointer
(the file_info_ptr structure will be defined on a per-system basis)
containing information required to print the 'file' in question.
This is intended to allow zero-copy TCP in Win32.

The sixth is 'close'; this is what is called when the connection is being
closed.  The 'close' should not be passed on to the next filter in the
stack.  Most filters will not need to use this, but if database handles
or some other object is created, this is the point at which to remove it.
Note that flush is called automatically before this.

	bbottomfilter Structure

The first function, bgetfileinfo, is designed to allow Apache to get
information from a BUFF struct regarding the input and output sources.
This is currently used to get the input file number to select on a
socket to see if there's data waiting to be read.  The information
returned is platform specific; the void * pointer passed in holds
the void * pointer passed to all user-supplied functions.

The second function, bpushfileinfo, is used to push file information
onto a buffer, so that the buffer can be fully constructed and ready
to handle data as soon as possible after a client has connected.
The first void * pointer holds platform specific information (in
Unix, it would be a pair of file descriptors); the second holds the
void * pointer passed to all user-supplied functions.

[djg: I don't think I really agree with the distinction here between
the bottom and the other filters.  Take the select() example, it's
valid for any layer to define a fd that can be used for select...
in fact it's the topmost layer that should really get to make this
definition.  Or maybe I just have your top and bottom flipped.  In
any event I think this should be part of the filter structure and
not separate.]

	The BUFF structure

A couple of changes are needed for this structure: remove fd and
fd_in; add a bfilter structure; add a pointer to a bbottomfilter;
add three pointers to the next BUFFs: one for the next BUFF in the
stack, one for the next BUFF which implements write, and one
for the next BUFF which implements read.


	Public functions in buff.c

BUFF * bpushfilter (BUFF *, struct bfilter *, void *);

This function adds the filter functions from bfilter, stacking them on
top of the BUFF.  It returns the new top BUFF, or NULL on error.

BUFF * bpushbuffer (BUFF *, BUFF *);

This function places the second buffer on the top of the stack that
the first one is on.  It returns the new top BUFF, or NULL on error.

BUFF * bpopfilter(BUFF *);
BUFF * bpopbuffer(BUFF *);

Unattaches the top-most filter from the stack, and returns the new
top-level BUFF, or NULL on error or when there are no BUFFs
remaining.  The two are synonymous.

void bclosestack(BUFF *);

Closes the I/O stack, removing all the filters in it.

BUFF * bcreatestack(pool *p, int flags, struct bfilter *,
                    struct bbottomfilter *, void *);

This creates an I/O stack.  It returns NULL on error.

BUFF * bcreatebuffer(pool *p, int flags, struct bfilter *, void *);

This creates a BUFF for later use with bpushbuffer.  The BUFF is
not set up to be used as an I/O stack, however.  It returns NULL
on error.

int bsetstackopts(BUFF *, int, const void *);
int bsetstackflags(BUFF *, int, int);

These functions, respectively, set options on all the BUFFs in a
stack.  The new flag, B_MAXBUFFERING is used to disable a feature
described in the next section, whereby only the first and last
BUFFs will buffer data.

*************************************************************************
		Efficiency Considerations

	Buffering

All input and output is buffered by the standard buffering code.
People writing code to use buff.c should not concern themselves with
buffering for efficiency, and should not buffer except when necessary.

The write function will typically be called with large blocks of text;
the read function will attempt to place the specified number of bytes
into the buffer.

Dean noted that there are possible problems w/ multiple buffers;
further, some applications must not be buffered.  This can be
partially dealt with by turning off buffering, or by flushing the
data when appropriate.

However, some potential problems arise anyway.  The simplest example
involves shrinking transformations; suppose that you have a set
of filters, A, B, and C, such that A outputs less text than it
recieves, as does B (say A strips comments, and B gzips the result).
Then after a write to A which fills the buffer, A writes to B.
However, A won't write enough to fill B's buffer, so a memory copy
will be needed.  This continues till B's buffer fills up, then
B will write to C's buffer -- with the same effect.

[djg: I don't think this is the issue I was really worried about --
in the case of shrinking transformations you are already doing 
non-trivial amounts of CPU activity with the data, and there's
no copying of data that you can eliminate anyway.  I do recognize
that there are non-CPU intensive filters -- such as DMA-capable
hardware crypto cards.  I don't think they're hard to support in
a zero-copy manner though.]

The maximum additional number of bytes which will be copied in this
scenario is on the order of nk, where n is the total number of bytes,
and k is the number of filters doing shrinking transformations.

There are several possible solutions to this issue.  The first
is to turn off buffering in all but the first filter and the
last filter.  This reduces the number of unnecessary byte copies
to at most one per byte, however it means that the functions in
the stack will get called more frequently; but it is the default
behavior, overridable by setting the B_MAXBUFFERING with
bsetstackflags.  Most filters won't involve a net shrinking
transformation, so even this will rarely be an issue; however,
if the filters do involve a net shrinking transformation, for
the sake of network-efficiency (sending reasonably sized blocks),
it may be more efficient anyway.

A second solution is more general use of writev for communication
between different buffers.  This complicates the programing work,
however.


	Memory copies

Each write function is passed a pointer to constant text; if any changes
are being made to the text, it must be copied.  However, if no changes
are made to the text (or to some smaller part of it), then it may be
sent to the next filter without any additional copying.  This should
provide the minimal necessary memory copies.

[djg: Unfortunately this makes it hard to support page-flipping and
async i/o because you don't have any reference counts on the data.
But I go into a little detail that already in docs/page_io.]

	Function chaining

In order to avoid unnecessary function chaining for reads and writes,
when a filter is pushed onto the stack, the buff.c code will determine
which is the next BUFF which contains a read or write function, and
reads and writes, respectively, will go directly to that BUFF.

	writev

writev is a function for efficient writing to the system; in terms of
this API, however, it also works for dealing with multiple blocks of
text without doing unnecessary byte copies.  It is not required.

Currently, the system level writev is used in two contexts: for
chunking and when a block of text is writen which, combined with
the text already in the buffer, would make the buffer overflow.

writev would be implemented both by the default bottom level filter
and by the chunking filter for these operations.  In addition, writev
may, be used, as noted above, to pass multiple blocks of text w/o
copying them into a single buffer.  Note that if the next filter does
not implement writev, however, this will be equivalent to repeated
calls to write, which may or may not be more efficient.  Up to
IOV_MAX-2 blocks of text may be passed along in this manner.  Unlike
the system writev call, the writev in this API should be called only
once, with a array with iovec's and a count as to the number of
iovecs in it.

If a bfilter defines writev, writev will be called whether or not
NO_WRITEV is set; hence, it should deal with that case in a reasonable
manner.

[djg: We can't guarantee atomicity of writev() when we emulate it.
Probably not a problem, just an observation.]

*************************************************************************
		Code in buff.c

	Default Functions

The default actions are generally those currently performed by Apache,
save that they they'll only attempt to write to a buffer, and they'll
return an error if there are no more buffers.  That is, you must implement
read, write, and flush in the bottom-most filter.

Except for close(), the default code will simply pass the function call
on to the next filter in the stack.  Some samples follow.

	Heuristics for writev

Currently, we call writev for chunking, and when we get a enough so that
the total overflows the buffer.  Since chunking is going to become a
filter, the chunking filter will use writev; in addition, bwrite will
trigger bwritev as shown (note that system specific information should
be kept at the filter level):

in bwrite:

    if (fb->outcnt > 0 && nbyte + fb->outcnt >= fb->bufsiz) {
        /* build iovec structs */
        struct iovec vec[2];
        vec[0].iov_base = (void *) fb->outbase;
        vec[0].iov_len = fb->outcnt;
        fb->outcnt = 0;
        vec[1].iov_base = (void *)buff;
        vec[1].iov_length = nbyte;
        return bwritev (fb, vec, 2);
    } else if (nbye >= fb->bufsiz) {
        return write_with_errors(fb,buff,nbyte);
    }

Note that the code above takes the place of large_write (as well
as taking code from it).

So, bwritev would look something like this (copying and pasting freely
from the current source for writev_it_all, which could be replaced):

-----
int bwritev (BUFF * fb, struct iovec * vec, int nvecs) {
    if (!fb)
        return -1; /* the bottom level filter implemented neither write nor
                    * writev. */
    if (fb->bfilter.bwritev) {
        return bf->bfilter.writev(fb->next, vec, nvecs);
    } else if (fb->bfilter.write) {
        /* while it's nice an easy to build the vector and crud, it's painful
         * to deal with partial writes (esp. w/ the vector)
         */
        int i = 0,rv;
        while (i < nvecs) {
            do {
                rv = fb->bfilter.write(fb, vec[i].iov_base, vec[i].iov_len);
            } while (rv == -1 && (errno == EINTR || errno == EAGAIN)
                     && !(fb->flags & B_EOUT));
            if (rv == -1) {
                if (errno != EINTR && errno != EAGAIN) {
                    doerror (fb, B_WR);
                }
                return -1;
            }
            fb->bytes_sent += rv;
            /* recalculate vec to deal with partial writes */
            while (rv > 0) {
                if (rv < vec[i].iov_len) {
                    vec[i].iov_base = (char *)vec[i].iov_base + rv;
                    vec[i].iov_len -= rv;
                    rv = 0;
                    if (vec[i].iov_len == 0) {
                        ++i;
                    }
                } else {
                    rv -= vec[i].iov_len;
                    ++i;
                }
            }
            if (fb->flags & B_EOUT)
                return -1;
        }
        /* if we got here, we wrote it all */
        return 0;
    } else {
        return bwritev(fb->next,vec,nvecs);
    }
}
-----
The default filter's writev function will pretty much like
writev_it_all.


	Writing

The general case for writing data is significantly simpler with this
model.  Because special cases are not dealt with in the BUFF core,
a single internal interface to writing data is possible; I'm going
to assume it's reasonable to standardize on write_with_errors, but
some other function may be more appropriate.

In the revised bwrite (which I'll ommit for brievity), the following
must be done:
	check for error conditions
	check to see if any buffering is done; if not, send the data
		directly to the write_with_errors function
	check to see if we should use writev or write_with_errors
		as above
	copy the data to the buffer (we know it fits since we didn't
		need writev or write_with_errors)

The other work the current bwrite is doing is
	ifdef'ing around NO_WRITEV
	numerous decisions regarding whether or not to send chunks

Generally, buff.c has a number of functions whose entire purpose is
to handle particular special cases wrt chunking, all of which could
be simplified with a chunking filter.

write_with_errors would not need to change; buff_write would.  Here
is a new version of it:

-----
/* the lowest level writing primitive */
static ap_inline int buff_write(BUFF *fb, const void *buf, int nbyte)
{
    if (fb->bfilter.write)
        return fb->bfilter.write(fb->next_writer,buff,nbyte);
    else
        return bwrite(fb->next_writer,buff,nbyte);
}
-----

If the btransmitfile function is called on a buffer which doesn't implement
it, the system will attempt to read data from the file identified
by the file_info_ptr structure and use other methods to write to it.

	Reading

One of the basic reading functions in Apache 1.3b3 is buff_read;
here is how it would look within this spec:

-----
/* the lowest level reading primitive */
static ap_inline int buff_read(BUFF *fb, void *buf, int nbyte)
{
    int rv;

    if (!fb)
        return -1; /* the bottom level filter is not set up properly */

    if (fb->bfilter.read)
        return fb->bfilter.read(fb->next_reader,buf,nbyte,fb->bfilter_info);
    else
        return bread(fb->next_reader,buff,nbyte);
}
-----
The code currently in buff_read would become part of the default
filter.


	Flushing data

flush will get passed on down the stack automatically, with recursive
calls to bflush.  The user-supplied flush function will be called then,
and also before close is called.  The user-supplied flush should not
call flush on the next buffer.

[djg: Poorly written "expanding" filters can cause some nastiness
here.  In order to flush a layer you have to write out your current
buffer, and that may cause the layer below to overflow a buffer and
flush it.  If the filter is expanding then it may have to add more to
the buffer before flushing it to the layer below.  It's possible that
the layer below will end up having to flush twice.  It's a case where
writev-like capabilities are useful.]

	Closing Stacks and Filters

When a filter is removed from the stack, flush will be called then close
will be called.  When the entire stack is being closed, this operation
will be done automatically on each filter within the stack; generally,
filters should not operate on other filters further down the stack,
except to pass data along when flush is called.

	Flags and Options

Changes to flags and options using the current functions only affect
one buffer.  To affect all the buffers on down the chain, use
bsetstackopts or bsetstackflags.

bgetopt is currently only used to grab a count of the bytes sent;
it will continue to provide that functionality.  bgetflags is
used to provide information on whether or not the connection is
still open; it'll continue to provide that functionality as well.

The core BUFF operations will remain, though some operations which
are done via flags and options will be done by attaching appropriate
filters instead (eg. chunking).

[djg: I'd like to consider filesystem metadata as well -- we only need
a few bits of metadata to do HTTP: file size and last modified.  We
need an etag generation function, it is specific to the filters in
use.  You see, I'm envisioning a bottom layer which pulls data out of
a database rather than reading from a file.]


**************************************************************
**************************************************************
Date: Wed, 9 Sep 1998 18:55:40 -0700 (PDT)
From: Alexei Kosut <akosut@leland.stanford.edu>
To: new-httpd@apache.org
Subject: A Magic Cache example
Message-ID: <Pine.GSO.3.96.980909182642.29690A-100000@myth1.Stanford.EDU>

During the drive home, I came up with a good example of how I envision the
new module/cache/layer model thingy working. Comments please:

The middle end of the server is responsible for taking the request the
front end gives it and somehow telling the back end how to fulfill it. I
look at it like this: The request is a URI (Uniform Resource Identifier)
and a set of request dimensions (the request headers, the remote IP
address, the time of day, etc...). The middle end, via its configuration,
translates this into a request for content from a backing store module,
plus possibly some filter modules. Since the term "filename" is too
flat-file specific, let's call the parameter we pass to the backing store
a SRI (Specific Resource Identifier), in a format specific to that module.

Our example is similar to the one I was using earlier, with some
additions: The request is for a URI, say "/skzb/teckla.html". The response
is a lookup from a (slow) database. The URI maps to the mod_database SRI
of "BOOK:0-441-7997-9" (I made that format up). We want to take that
output and convert it from whatever charset it's in into Unicode. We then
have a PHP script that works on a Unicode document and does things based
on whether the browser is Netscape or not. Then we translate the document
to the best charset that matches the characters used and the client's
capabilities and send it.

So upon request for /skzb/teckla.html, the middle end translates the
request into the following "equation":

        SRI: mod_database("BOOK:0-441-7997-9")
    +   filter: mod_charset("Unicode")
    +   filter: mod_php()
    +   fllter: mod_charset("best_fit")
 -------------------------------------------------
        URI: /skzb/teckla.html

It then constructs a stack of IO (NSPR) filters like this:

mod_database -> cache-write -> mod_charset -> cache-write -> mod_php ->
cache_write -> mod_charset -> cache-write -> client

And sets it to running. Each of the cache filters is a write-through
filter that copies its data into the cache with a tag based on what
equation the middle end uses to get to it, plus the request dimensions it
uses (info it gets from the modules).

The database access is stored under "SRI: mod_database(BOOK:0-441-79977-9"
with no dimensions (because it's the same for all requests). The first
charset manipulation is stored under "SRI: mod_database(BOOK...) + filter:
mod_charset(Unicode)", again with no dimensions. The PHP output is stored
under "SRI: mod_database(BOOK...) + filter: mod_charset(Unicode) + filter:
mod_php()" with dimesions of (User-Agent). The final output is stored both
as "SRI: mod_database(BOOK...) + filter: mod_charset(Unicode) + filter:
mod_php() + filter: mod_charset(best_fit)" and "URI: /skzb/teckla.html"
(they're the same thing), both with dimensions of (User-Agent,
Accept-Charset).

So far so good. Now, when another request for /skzb/teckla.html comes in,
the cache is consulted to see how much we can use. First, the URI is
looked up. This can be done by a kernel or other streamlined part of the
server. So "URI: /skzb/teckla.html" is looked up, and one entry pops out
with dimensions of (User-Agent, Accept-Charset). The user-agent and
accept-charset of the request are compared against the ones of the stored
entiry(ies). If one matches, it can be sent directly.

If not, the server proceeds to look up "SRI: mod_database(BOOK...) +
filter: mod_charset(Unicode) + filter: mod_php()". If the request has a
different accept-charset, but the same user-agent, then this can be
reprocessed by mod_charset and used. Otherwise, the server proceeds back
to "SRI: mod_database(BOOK...) + filter: mod_charset(Unicode)", which will
match any request. There's probably some sort of cache invalidation
(expires, etc...) that happens eventually to result in a new database
lookup, but mostly, that very costly operation is avoided.

I think I've made it out to be a bit more complicated than it is, with the
long equation strings mixed in there. But the above reflects my
understanding of how the new Apache 2.0 system should work.

Note 1: The cache is smarter than I make it out here when it comes to
adding new entries. It should realize that, since the translation to
Unicode doesn't change or restrict the dimensions of the request, it
really is pointless to cache the original database lookup, since it will
always be translated in exactly the same manner. Knowing this, it will
only cache the Unicode version. 

Note 2: PHP probably doesn't work with Unicode. And there may not be a way
to identify a script as only acting on the User-Agent dimension. That's
not the point.

Note 3: Ten bonus points to anyone who's read this far, and is the first
person to answer today's trivia question: What does the skzb referred to
in the example URI stand for? There's enough information in this mail to
figure it out (with some help from the Net), even if you don't know
offhand (though if you do, I'd be happier). 

-- Alexei Kosut <akosut@stanford.edu> <http://www.stanford.edu/~akosut/>
   Stanford University, Class of 2001 * Apache <http://www.apache.org> *


**************************************************************
Message-ID: <19980922224326.A16219@aisa.fi.muni.cz>
Date: Tue, 22 Sep 1998 22:43:26 +0200
From: Honza Pazdziora <adelton@informatics.muni.cz>
To: new-httpd@apache.org
Subject: Re: I/O Layering in next version of Apache.
References: <19980922111627.19784.qmail@hyperreal.org> <3607D53A.1FF6D93@algroup.co.uk> <13831.55021.929560.977122@zap.ml.org>
In-Reply-To: <13831.55021.929560.977122@zap.ml.org>; from Ben Hyde on Tue, Sep 22, 1998 at 01:04:12PM -0400

> >Does anyone have a starting point for layered I/O? I know we kicked it

Hello,

there has been a thread on modperl mailing list recently about
problems we have with the current architecture. Some of the points
were: what requerements will be put on modules to be new I/O
compliant. I believe it's the Apache::SSI vs. Apache::SSIChain
difference between 1.3.* and 2.*. The first fetches the file _and_
does the SSI, the second takes input from a different module that
either gets the HTML or runs the CGI or so, and processes its output.
Should all modules be capable of working on some other module's
output? Probably except those that actually go to disk or database for
the primary data.

Randal's point was that output of any module could be processed, so
that no module should make any assumption whether it's sending data
directly to the browser or to some other module. This can be used both
for caching, but it also one of the things to get the filtering
transparent.

Also, as Apache::GzipChain module shows, once you process the output,
you may need to modify the headers as well. I was hit by this when I
tried to convert between charsets, to send out those that the browsers
would understand. The Apache::Mason module shows that you can build
a page from pieces. Each of the pieces might have different
characteristics (charset, for example), so with each piece of code we
might need to have its own headers that describe it, or at least the
difference between the final (global) header-outs and its local.

Sorry for bringing so much Perl module names in, but modperl is
currently a way to get some layered I/O done in 1.3.*, so I only have
practical experiance with it.

Yours,

------------------------------------------------------------------------
 Honza Pazdziora | adelton@fi.muni.cz | http://www.fi.muni.cz/~adelton/
                   I can take or leave it if I please
------------------------------------------------------------------------

**************************************************************
Date: Wed, 23 Sep 1998 10:46:47 -0700 (PDT)
From: Dean Gaudet <dgaudet@arctic.org>
To: new-httpd@apache.org
Subject: Re: I/O Layering in next version of Apache.
In-Reply-To: <36092F2D.BCC4E5C1@algroup.co.uk>
Message-ID: <Pine.LNX.3.96dg4.980923103916.24223K-100000@twinlark.arctic.org>

On Wed, 23 Sep 1998, Ben Laurie wrote:

> Dean Gaudet wrote:
> > 
> > On Wed, 23 Sep 1998, Ben Laurie wrote:
> > 
> > > Is the simplest model that accomodates this actually just a stack
> > > (tree?) of webservers? Naturally, we wouldn't talk HTTP between the
> > > layers, but pass (header,content) pairs around (effectively).
> > > Interesting.
> > 
> > We could just talk "compiled" HTTP -- using a parsed representation of
> > everything essentially.
> 
> That's pretty much what I had in mind - but does it make sense? I have
> to admit, it makes a certain amount of sense to me, but I still have
> this nagging suspicion that there's a catch.

We talked about this during the developers meeting earlier this summer... 
while we were hiking, so I don't think there were any notes.

I think it'd be a useful exercise to specify a few example applications we
want to be able to support, and then consider methods of implementing
those applications.  Make the set as diverse and small as possible.  I'll
take the easiest one :)

- serve static content from arbitrary backing store (e.g. file, database) 

Once we flesh such a list out it may be easier to consider implementation
variations... 

I think it was Cliff who said it this way:  in a multiple layer setup he
wants to be able to partition the layers across servers in an arbtrary
manner.  For example, a proxy cache on one box which the world talks to,
and which backends to various other boxes for dynamic and static content.
Or maybe the static content is on the same server as the proxy. If this is
something we want to support then talking (a restricted form of) HTTP
between layers is interesting. 

Now we can all start worrying about performance ;) 

Dean


**************************************************************
Date: Wed, 23 Sep 1998 11:23:30 -0700 (PDT)
From: Alexei Kosut <akosut@leland.stanford.edu>
To: new-httpd@apache.org
Subject: Re: I/O Layering in next version of Apache.
In-Reply-To: <36092F2D.BCC4E5C1@algroup.co.uk>
Message-ID: <Pine.GSO.3.96.980923111613.17322C-100000@myth6.Stanford.EDU>

On Wed, 23 Sep 1998, Ben Laurie wrote:

> > We could just talk "compiled" HTTP -- using a parsed representation of
> > everything essentially.
> 
> That's pretty much what I had in mind - but does it make sense? I have
> to admit, it makes a certain amount of sense to me, but I still have
> this nagging suspicion that there's a catch.

One important thing to note is that we want this server to be able to
handle non-HTTP requests. So using HTTP as the internal language (as we do
now) is not the way to go. What we talked about in SF was using a basic
set of key/value pairs to represent the metadata of the response. Which
would of course bear an uncanny resemblance to HTTP-style MIME headers...

Certainly, and this is the point I think the originator of this thread
raised, each module layer (see the emails I sent a few weeks ago for more
details on how I see *that*) needs to provide both a content filter and a
metadata filter. Certainly a module that does encoding has to be able to
alter the headers to add a Content-Encoding, Transfer-Encoding, TE, or
what have you. Many module that does anything to the content will
want to add headers, and many others will need to alter the dimensions on
which the request is served, or what the parameters to those dimensions
are for the current request. The latter is absolutely vital for cacheing.

The problem, as I see it, is this: Often, I suspect it will be the case
that the module does not know what metadata it will be altering (and how)
until after it has processed the request. i.e., a PHP script may not
discover what dimensions it uses (as we discussed earlier) until after it
has parsed the entire script. But if the module is functioning as an
in-place filter, that can cause massive headaches if we need the metadata
in a complete form *before* we sent the entity, as we do for HTTP.

I'm not quite sure how to solve that problem. Anyone have any brilliant
ideas?

(Note that for internal caching, we don't actually need the dimension data
until after the request, because we can alter the state of the cache at
any time, but if we want to place nice with HTTP and send Vary: headers
and such, we do need that information. I guess we could send Vary:
footers...)

-- Alexei Kosut <akosut@stanford.edu> <http://www.stanford.edu/~akosut/>
   Stanford University, Class of 2001 * Apache <http://www.apache.org> *


**************************************************************
Date: 23 Sep 1998 20:26:58 -0000
Message-ID: <19980923202658.25736.qmail@zap.ml.org>
From: Ben Hyde <bhyde@pobox.com>
To: new-httpd@apache.org
Subject: Stacking up Response Handling
In-Reply-To: <Pine.GSO.3.96.980923111613.17322C-100000@myth6.Stanford.EDU>
References: <36092F2D.BCC4E5C1@algroup.co.uk>
	<Pine.GSO.3.96.980923111613.17322C-100000@myth6.Stanford.EDU>

Alexei Kosut writes:
>The problem, as I see it, is this: Often, I suspect it will be the case
>that the module does not know what metadata it will be altering (and how)
>until after it has processed the request. i.e., a PHP script may not
>discover what dimensions it uses (as we discussed earlier) until after it
>has parsed the entire script. But if the module is functioning as an
>in-place filter, that can cause massive headaches if we need the metadata
>in a complete form *before* we sent the entity, as we do for HTTP.
>
>I'm not quite sure how to solve that problem. Anyone have any brilliant
>ideas?

This is the same as building a layout engine that incremental layout
but simpler since I doubt we'd want to allow for reflow.

Sometimes you can send output right along, sometimes you have to wait.
I visualize the output as a tree/outline and as it is swept out a
stack holds the path to the leave.  Handlers for the individual nodes
wait or proceed depending on if they can.

It's pretty design with the pipeline consisting of this stack of
output transformers/generators.  Each pipeline stage accepts a stream
of output_chunks.  I think of these output_chunks as coming in plenty
of flavors, for example transmit_file, transmit_memory, etc.  Some
pipeline stages might handle very symbolic chunks.  For example
transmit_xml_tree might be handed to transform_xml_to_html stage in
the pipeline.

I'm assuming the core server would have only a few kinds of pipeline
nodes, generate_response, generate_content_from_url_via_file_system,
generate_via_classic_module_api.  Things like convert_char_set or
do_cool_transfer_encoding, could easily be loaded at runtime and
authored outside the core.  That would be nice.

For typical fast responses we wouldn't push much on this stack at
all.  It might go something like this: Push generate_response node, 
it selects an appropriate content generator by consulting the
module community and pushes that.  Often this is 
generate_content_from_url_via_file_system which in turn does
all that ugly mapping to a file name and then passes 
transmit_file down the pipeline and pops it's self off the stack.
generate_response once back on top again does the transmit and
pops off.

For rich complex output generation we might push all kinds of things
(charset converters, transfer encoders, XML -> HTML rewriters, cache
builders, old style apache module API simulators, what ever).

The intra-stack element protocol get's interesting around issues
like error handling, blocking, etc.  

I particularly like how this allows simulation of the old module API,
as well as the API of other servers, and experimenting with other
module API which cross process or machine boundaries.

In many ways this isn't that much different from what was proposed
a year ago.  

 - ben

**************************************************************
From: Ben Hyde <bhyde@pobox.com>
Date: Wed, 23 Sep 1998 21:58:54 -0400 (EDT)
To: new-httpd@apache.org
Subject: Re: Core server caching
In-Reply-To: <Pine.GSO.3.96.980923142800.14009A-100000@elaine40.Stanford.EDU>
References: <19980923210119.25763.qmail@zap.ml.org>
	<Pine.GSO.3.96.980923142800.14009A-100000@elaine40.Stanford.EDU>
Message-ID: <13833.39467.942203.885143@zap.ml.org>

Alexei Kosut writes:
>On 23 Sep 1998, Ben Hyde wrote:
>
>> The core problem of caching seems to me to get confused by the
>> complexity of designing a caching proxy.  If one ignores that then the
>> core problem of caching seems quite simple.
>
>Actually, for an HTTP server, they're the same problem, if you want to be
>able to cache any sort of dynamic request. And caching static requests is
>kind of silly (Dean's flow stuff notwithstanding, making copies of static
>files in either memory or on disk is silly, since the OS can do it better
>than we can).

I don't disagree with any of the things you said, so I guess I'm
failing to get across where in this structure the functions your
pointing out as necessary would reside as versus where the "chunk
cache" mechanism I'm yearning for would fit.

Well, that's not entirely true I do feel it's helpful to make this
point.

The HTTP spec's definition of proper caching is terribly constrained
by the poverty of information available to the proxy server.  He is
trapped in the middle between an opinionated content provider and an
opinionated content consumer.  It was written in an attempt to keep
people like AOL from making their opinions dominate either of those
other two.  Proper caching by a server that is right next to the
content generation can and ought to include both more or less
heuristics that are tunable by the opinions of the content provider
who presumably we are right next to.

Imagine the server that has a loop that goes like so:

   loop
     r<-swallow_incomming_request
     h<-select_response_handler(r)
     initialize_response_pipeline()
     push_pipeline_element(h)
     tend_pipeline_until_done()
   end loop

In most of the web based applications I've seen the
select_response_handler step evolves into something that looks like an
AI expert system.  That said, what I'd like to see is in Apache2 is a
simple dispatch along with a way to plug-in more complex dispatching
mechanisms.  I'd very much like to avoid having that get confused with
the suite of response_handlers.

I ignored the complexity of when to you can safely select
a cached value because I think it's in the select_response_handler
step.  And possibly, I'll admit, not part of what I called the
"core server"

Clearly I'm a fool for using this term 'core server' since it
doesn't mean anything.  I wanted it to mean that loop above
and the most minimal implementations for the pipeline and
the select_response_handler one could imagine before starting
to pile on.  The server as shipped would have a lot more
stuff in it!

What I'm focused on is what has to be in that core versus
what has to be, but can be outside of it.

So. as i thought about the state of the pipeline just after
the call on initialize_response_pipeline I at first thought
it would have something much like the current buffer abstraction
in the pipeline.  Then i got to wondering if transfer encoding,
charset conversion, or caching ought to be in there.

I think there is an argument for putting some caching functionality
in there.  Possibly because that entire knot is what you'd move
into the OS if you could.  Possibly because this is the bit
that must fly.

Recall that I think the pipeline takes a stream of response
chunks with things like memory_chunk, transfer_file_chunk, etc.
in that stream.  The question is what flavors of chunks does
that bottom element in the pipeline take.  It's the chunks
that fly (and nothing more?).  So I got to thinking about
what does it mean to have a cached_chunk.

A cached_chunk needs only the small operation set along
the lines of what I mentioned.  A full caching scheme
can build on it.  As an added benefit the caching scheme
can be dumb, standard, extremely witty without effecting
this portion of the design.

A quick point about why I wanted the cache to handle things
smaller than entire responses.  This isn't central I guess.

I want a protocol with content generators that encourages
them to use dynamic programming tricks to quickly generate
portions of pages that are static over long periods.  Such
a scheme has worked well in systems we've built.

 - ben hyde

**************************************************************
From: Ben Hyde <bhyde@pobox.com>
Date: Thu, 29 Oct 1998 23:16:37 -0500 (EST)
To: new-httpd@apache.org
Subject: Re: Core server caching
In-Reply-To: <Pine.LNX.3.96dg4.981029175439.3639X-100000@twinlark.arctic.org>
References: <Pine.WNT.4.05.9810292049480.-445955@helium.jetpen.com>
	<Pine.LNX.3.96dg4.981029175439.3639X-100000@twinlark.arctic.org>
Message-ID: <13881.12903.661334.819447@zap.ml.org>

Dean Gaudet writes:
>On Thu, 29 Oct 1998, Rasmus Lerdorf wrote:
>
>> There are also weird and wacky things you would be able to do if you could
>> stack mod_php on top of mod_perl.
>
>You people scare me.
>
>Isn't that redundant though?
>
>Dean

Yes it's scary, but oddly erotic, when these behemoths with their
gigantic interpreters try to mate.

It's interesting syndrome, systems as soon as they get an interpreter
they tend to loose their bearings and grow into vast behemoths that
lumber about slowly crushing little problems with their vast mass.
Turing syndrome?

I've heard people say modules can help avoid this, but I've rarely
seen it.  Olde Unix kinda manages it remember being frightened by
awk.

Can we nudge alloc.c/buff.c toward a bit of connective glue that
continues to let individual modules evolve their own gigantism while
avoiding vile effects on the core performance of the server?  Stuff
like this:

  memory chunk alignment for optimal I/O
  memory hand off along the pipeline
  memory hand off crossing pool boundaries
  memory hand off in zero copy cases
  transmit file
  transmit cache elements
  insert/remove cache elements
  leverage unique hardware and instructions

That memcpy in ap_bread really bugs me.

I'd be rather have routines that let me handoff chunks.  Presumably
these would need to be able to move chunks across pool and buffer
boundaries.  But zero copy if I don't touch the content and never a
memcpy just to let my lex the input.

I've built systems like this with the buffers exposing a emacs
buffer style of abstraction, but with special kinds of marks
to denote what's released for sending, and what's been accepted
and lex'd on the input side.  It does create mean all your
lexical and printf stuff has to be able to smoothly slide
over chunk boundaries.

 - ben

*************************************************************************
Date: Sun, 27 Dec 1998 13:08:22 -0800 (PST)
From: Ed Korthof <ed@bitmechanic.com>
To: new-httpd@apache.org
Subject: I/O filters & reference counts
Message-ID: <Pine.LNX.3.96.981224163237.10687E-100000@crankshaft>

Hi --

A while back, I indicated I'd propose a way to do reference counts w/ the
layered I/O I want to implement for 2.0 (assuming we don't use nspr)...
for single-threaded Apache, this seems unnecessary (assuming you don't use
shared memory in your filters to share data amoung the processes), but in
other situations it does have advantages.

Anyway, what I'd propose involves using a special syntax when you want to
use reference counts.  This allows Apache to continue using the
'pool'-based memory system (it may not be perfect, but imo it's reasonably
good), without creating difficult when you wish to free memory.

If you're creating memory which you'll want to share amoung multiple
threads, you'll create it using a function more or less like: 

    ap_palloc_share(pool *p, size_t size);

you get back a void * pointer for use as normal. When you want to give
someone else a reference to it, you do the following: 

    ap_pshare_data(pool *p1, pool *p2, void * data);

where data is the return from above (and it must be the same).  Then both
pools have a reference to the data & to a counter; when each pool is
cleaned up, it will automatically decrement the counter, and free the data
if the counter is down to zero.

In addition, a pool can decrement the counter with the following:

    ap_pshare_free(pool * p1, void * data);

after which the data may be freed.  There would also be a function,

    ap_pshare_countrefs(pool * p1, void * data);

which would return the number of pools holding a ref to 'data', or 1 if
it's not a shared block.

Internally, the pool might either keep a list of the shared blocks, or a
balanced b-tree; if those are too slow, I'd look into passing back and
forth a (pointer to an) int, and simply use an array.  The filter
declaring the shared memory would need to keep track of such an int, but
no one else would. 

In the context of I/O filters, this would mean that each read function
returns a const char *, which should not be cast to a non-const char * (at
least, not without calling ap_pshare_countrefs()).  If a filter screwed
this up, you'd have a problem -- but that's more or less unavoidable with
sharing data amoung threads using reference counts. 

It might make sense to build a more general reference counting system; if
that's what people want, I'm also up for working on that.  But one of the
advantages the pool system has is its simplicity, some of which would be
lost.

Anyway, how does this sound?  Reasonable or absurd?

Thanks --

Ed
               ----------------------------------------
History repeats itself, first as tragedy, second as farce. - Karl Marx

*************************************************************************
From: Ben Hyde <bhyde@pobox.com>
Date: Tue, 29 Dec 1998 11:50:01 -0500 (EST)
To: new-httpd@apache.org
Subject: Re: I/O filters & reference counts
In-Reply-To: <Pine.LNX.3.96.981227192210.10687H-100000@crankshaft>
References: <Pine.GSO.3.96.981227185303.8793B-100000@elaine21.Stanford.EDU>
	<Pine.LNX.3.96.981227192210.10687H-100000@crankshaft>
Message-ID: <13960.60942.186393.799490@zap.ml.org>


There are two problems that reference counts address that we have,
but I still don't like them.

These two are: pipeline memory management, and response paste up.  A
good pipeline ought not _require_ memory proportional to the size of
the response but only proportional to the diameter of the pipe.
Response paste up is interesting because the library of clip art is
longer lived than the response or connection pool.  There is a lot to
be said for leveraging the configuration pool life cycle for this kind
of thing.

The pipeline design, and the handling of the memory it uses become
very entangled after a while - I can't think about one without the
other.  This is the right place to look at this problem.  I.e. this
is a problem to be lead by buff.c rework, not alloc.c rework.

Many pipeline operations require tight coupling to primitive
operations that happen to be efficient.  Neat instructions, memory
mapping, etc.  Extreme efficiency in this pipeline makes it desirable
that the chunks in the pipeline be large.  I like the phrase "chunks
and pumps" to summarize that there are two elements to design to get
modularity right here.

The pasteup problem - one yearns for a library of fragments (call it a
cache, clip art, or templates if you like) which then readers in that
library can assemble these into responses.  Some librarians like to
discard stale bits and they need a scheme to know that the readers
have all finished.  The library resides in a pool that lives longer
than a single response connection.  If the librarian can be convinced
that the server restart cycles are useful we get to a fall back to
there.

I can't smell yet where the paste up problem belong in the 2.0 design
problem.  (a) in the core, (b) in a module, (c) as a subpart of the
pipeline design, or (d) ostracized outside 2.0 to await a gift (XML?)
we then fold into Apache.  I could probably argue any one of these.  A
good coupling between this mechanism and the pipeline is good, limits
on the pipeline design space are very good.

   - ben


*************************************************************************
Date: Mon, 4 Jan 1999 18:26:36 -0800 (PST)
From: Ed Korthof <ed@bitmechanic.com>
To: new-httpd@apache.org
Subject: Re: I/O filters & reference counts
In-Reply-To: <13960.60942.186393.799490@zap.ml.org>
Message-ID: <Pine.LNX.3.96.981231094653.486R-100000@crankshaft>

On Tue, 29 Dec 1998, Ben Hyde wrote:

> There are two problems that reference counts address that we have,
> but I still don't like them.

They certainly add some clutter.  But they offer a solution to the
problems listed below... and specifically to an issue which you brought up
a while back: avoiding a memcpy in each read layer which has a read
function other than the default one.  Sometimes a memcpy is required,
sometimes not; with "reference counts", you can go either way.

> These two are: pipeline memory management, and response paste up.  A
> good pipeline ought not _require_ memory proportional to the size of
> the response but only proportional to the diameter of the pipe.
> Response paste up is interesting because the library of clip art is
> longer lived than the response or connection pool.  There is a lot to
> be said for leveraging the configuration pool life cycle for this kind
> of thing.

I was indeed assuming that we would use pools which would last from one
restart (and a run through of the configuration functions) to the next.

So far as limiting the memory requirements of the pipeline -- this is
primarily a function of the module programming.  Because the pipeline will
generally live in a single thread (with the possible exception of the data
source, which could be another processes), the thread will only be
operating on a single filter at a time (unless you added custom code to
create a new thread to handle one part of the pipeline -- ugg).

For writing, the idea would be to print one or more blocks of text with
each call; wait for the write function to return; and then recycle the
buffers used.

Reading has no writev equivalent, so you only be able to do it one block
at a time, but this seems alright to me (reading data is actually a much
less complicated procedure in practice -- at least, with the applications
which I've seen).

Recycling read buffers (so as to limit the size of the memory pipeline)
is the hardest part, when we add in this 'reference count' scheme -- but
it can be done, if the modules recieving the data are polite and indicate
when they're done with the buffer.  Ie.:

    module 1			module 2
1.) reads from module 2:
	char * ap_bread(BUFF *, pool *, int);

2.)				returns a block of text w/ ref counts:
					str= char* ap_pshare_alloc(size_t);
					...
					return str;
				keeps a ref to str.
		
3.) handles the block of data
    returned, and indicates it's
    finished with:
	void ap_pshare_free(char * block);
    reads more data via
	char * ap_bread(BUFF *, pool *, int);

4.)				tries to recycle the buffer used:
					if (ap_pshare_count_refs(str)==1)
						reuse str
					else
						str = ap_pshare_alloc(...)
					...
	 				return str;

5.) handles the block of data
    returned...
...

One disadvantage is that if module 1 doesn't release its hold on a memory
block it got from step 2 until step 5, then the memory block wouldn't be
reused -- you'd pay w/ a free & a malloc (or with a significant increase
in complexity -- I'd probably choose the free & malloc). And if the module
failed to release the memory (via ap_pshare_free), then the memory
requirements would be as large as the response (or request).

I believe this is only relevant for clients PUTting large files onto their
servers; but w/ files which are potentially many gigabytes, it is
important that filters handling reading do this correctly.  Of course,
that's currently the situation anyhow.

> The pipeline design, and the handling of the memory it uses become
> very entangled after a while - I can't think about one without the
> other.  This is the right place to look at this problem.  I.e. this
> is a problem to be lead by buff.c rework, not alloc.c rework.

Yeah, after thinking about it a little bit I realized that no (or very
little) alloc.c work would be needed to implement the system which I
described.  Basically, you'd have an Apache API function which does malloc
on its own, and other functions (also in the API) which register a cleanup
function (for the malloc'ed memory) in appropriate pools. 

IMO, the 'pipeline' is likely to be the easiest place to work with this,
at least in terms of getting the most efficient & clean design which we
can.

[snip good comments]
> I can't smell yet where the paste up problem belong in the 2.0 design
> problem.  (a) in the core, (b) in a module, (c) as a subpart of the
> pipeline design, or (d) ostracized outside 2.0 to await a gift (XML?)
> we then fold into Apache.  I could probably argue any one of these.  A
> good coupling between this mechanism and the pipeline is good, limits
> on the pipeline design space are very good.

An overdesigned pipeline system (or an overly large one) would definitely
not be helpful.  If it would be useful, I'm happy to work on this (even if
y'all aren't sure if you'd want to use it); if not, I'm sure I can find
things to do with my time. <g>

Anyway, I went to CPAN and got a copy of sfio... the latest version I
found is from Oct, 1997.  I'd guess that using it (assuming this is
possible) might give us slightly less efficency (simply because sfio
wasn't built specifically for Apache, and customizing it is a much more
involved processes), but possibly fewer bugs to work out & lots of
interesting features.

thanks --

Ed, slowly reading through the sfio source code
OSSP CVS Repository