OSSP CVS Repository

ossp - ossp-pkg/sio/BRAINSTORM/doc_bucket_brigades.txt
Not logged in
[Honeypot]  [Browse]  [Directory]  [Home]  [Login
[Reports]  [Search]  [Ticket]  [Timeline
  [Raw

ossp-pkg/sio/BRAINSTORM/doc_bucket_brigades.txt
To: new-httpd@apache.org
Subject: bucket brigades and IOL
Date: Fri, 12 Nov 1999 23:57:43 -0800
From: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
Message-ID:  <199911122357.aa18914@gremlin-relay.ics.uci.edu>

About two years ago I wasted a lot of time writing an Ada95 library
called Onions that provides a stackable stream abstraction for files,
sockets, etc.  It is at <http://www.ics.uci.edu/pub/websoft/libwww-ada95/>
if you want to take a look at it, but I don't recommend looking at the
code since it is almost all just working around Ada95's lack of a
system interface.  I'll describe the worthwhile bits here.

The heart of Onions is the input and output stream object
classes and classwide types for building a data stream via a
stack of stream objects (Input_Pipe and Output_Pipe).  Reading
from the head of an input pipe causes the head stream object
to read from the next outbound stream object, and on down the line.
Likewise for writing to the head of an output pipe.  One of the
main features of streams is that they can filter the data as it
passes, converting, adding to, and/or removing from the data
before giving it to the next object.  Since multiple streams can be
cascaded, the complete data conversion is the sum of the individual
data conversions performed by the stream objects.

So far, no big deal -- this can be manually created by stacking ap_iol
types in a meaningful way.  But, the one unique thing I did in Onions was 
abstract the memory handling into something called Buckets and moved them
around in Bucket_Brigades.  A bucket is an allocated segment of memory
with pointers to its allocation address and current size.  If I were doing
this in C, I'd also add a pointer to current start address and allocated
size, so that a single bucket could be shrunk from both ends without
copying, and a function pointer for freeing it at the stream end.
Note that this is the same type of memory structure that IO-Lite uses,
though developed independently and for different reasons.

A bucket brigade is a list-queue of buckets.  Each of the stream read/write
calls would pass a bucket brigade instead of single bucket, since this
made insertion by filters more efficient, with the general idea being that
the outbound end of the sream would be writing them out using writev
or reading them in using readv, which is about as efficient as I could
get with Ada95.  [I call it a list-queue instead of just queue because you
have the choice of removing buckets from (or adding to) the queue one
bucket at a time or an entire linked list of buckets.]

But we could go one step further.  A bucket is an ADT, and as such can
be used as a general handle for read-only memory, read-write memory,
cache object, file handle, mmap handle, file name, URL, whatever.
What if, instead of just a stream of memory, it could pass around a
stream of memory interspersed with file handles or references to
remote objects?  A filter could then add stuff around the stream without
causing too much parsing overhead, and if it needed to look at all the
bytes in the stream it would just replace the bucket handle with a stream
of memory sucked from that handle.  Something like this was talked about
last year (see threads on "Stacking up Response Handling" on 23 Sep 1998
and "I/O filters & reference counts" in late December 1998 and January 1999).
And Dean started something with ap_buf.h, but I don't know how he meant
to finish it.

What I was thinking of was

   typedef enum {
       AP_BUCKET_rwmem,
       AP_BUCKET_rmem,
       AP_BUCKET_file_t,
       AP_BUCKET_mmap_t,
       AP_BUCKET_filename,
       AP_BUCKET_cached_entity,
       AP_BUCKET_URI,
   } ap_bucket_color_t;

   typedef struct ap_bucket_t ap_bucket_t;
   struct ap_bucket_t {
       ap_bucket_color_t color;
       void              *content;
       ap_status_t       (*free)(ap_bucket_t *bucket);
       unsigned int      refcount;
   };

   typedef struct ap_bucket_rwmem_t ap_bucket_rwmem_t;
   struct ap_bucket_rwmem_t {
       void   *alloc_addr;
       size_t  alloc_len;
       void   *addr;
       size_t  len;
   };

   typedef struct ap_bucket_rmem_t ap_bucket_rmem_t;
   struct ap_bucket_rmem_t {
       void   *addr;
       size_t  len;
   };

   typedef struct ap_bucket_filename ap_bucket_filename;
   struct ap_bucket_filename {
       ap_context_t *ctx;
       char         *name;
       ap_stat_t    *stat;  /* useful if already stat'ed */
       ap_aa_t      *conf;  /* access control structure for this file */
   };

   ...

and then

   typedef struct ap_bucket_list_t ap_bucket_list_t;
   struct ap_bucket_list_t {
       ap_bucket_t *bucket;
       ap_bucket_list_t *prev;
       ap_bucket_list_t *next;
   };

   typedef struct ap_brigade_t ap_brigade_t;
   struct ap_brigade_t {
       ap_context_t     *ctx;
       ap_bucket_list_t *first;
       ap_bucket_list_t *last;
       unsigned int     count;
   };

and then construct the input and output streams as pushing these
bucket brigades to or from the client.  The streams would have to
be a little more complicated than Onions, since I learned later that
you also need a parallel stream of header fields (in tokenized form)
in order for it to work with anything HTTP-like.

Why use an enum instead of a bunch of file pointers for each type
of bucket, kind of like ap_iol?  Because it allows adjacent memory
buckets (the most frequent kind after a filter operation) to be
gathered into a single writev.  Also, we need a way to be able to
set up an operation and figure out what it will produce without
actually performing the operation -- this is for OPTIONS and HEAD.

Note that this would completely change the way we handle internal
redirects, subrequests, server-side include, mod_proxy, access control, etc.
And then most of the API hooks would need to change.  I think that is why
Dean was putting it off until 2.1.  The annoying thing is that this is the
most useful rearchitecting of the server -- the MPM, APR, and hook changes
make 2.0 easier/cleaner/faster to port to other platforms, but layering
enables in one fell swoop almost every significant non-config feature
that our users have requested.  A cache would just be a hash table or
btree of file buckets, complete with AA info.

Anyway, that was stuck in the back of my head and had to get out.
I won't be able to work on it until after the dissertation is done,
which every day seems to be further away.  Maybe 3.0, with rHTTP/2.0.

....Roy

=================================================
To: new-httpd@apache.org
Subject: Re: bucket brigades and IOL
In-reply-to: Your message of "Sat, 13 Nov 1999 20:43:58 GMT."
             <382DCD8E.881B8468@algroup.co.uk>
Date: Sun, 14 Nov 1999 22:24:03 -0800
From: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
Message-ID:  <199911142224.aa22545@gremlin-relay.ics.uci.edu>

BenL wrote:
>I've got to say that this is the most coherent suggestion along these
>lines that I've seen yet. I rather like it. One thing I'd add is that if
>you are going to have a movable "start of block" pointer, and changeable
>length, it can be nice to allocate extra around the edges under some
>circumstances, so that lower layers can expand the block without having
>to add extra chunks.

Or, alternatively, allocate equal size blocks and just pass around
a reference pair within the buckets that, when the bucket is freed,
access a more complicated reference-counting pool.  I think that is
closer to what IO-Lite does.

>Also, the usual objections still apply - i.e. it is awkward to do things
>like searching for particular strings, since they may cross boundaries.
>I'm beginning to think that the right answer to this is to provide nice
>matching functions that know about the chunked structures, and last
>resort functions that'll glue it all back into one chunk...

Yep, that's what I ended up doing for Ada95, though in that case there
were no easier alternatives.

....Roy

=================================================
To: new-httpd@apache.org
Subject: Re: layered I/O (was: cvs commit: ...) 
In-reply-to: Your message of "Wed, 29 Mar 2000 01:21:09 PST."
             <Pine.LNX.4.21.0003290004100.10357-100000@piglet> 
Date: Wed, 29 Mar 2000 02:05:08 -0800
From: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
Message-ID:  <200003290205.aa19557@gremlin-relay.ics.uci.edu>

>Selection of IO Layers
>
>The core selects a source module and IO layers based on the urlspace
>configuration.  Content might be generated by mod_perl, and the result is
>piped through mod_chunk, mod_ssl, and mod_net, in turn.  When the content
>generator runs, the core enforces that the module set the content type
>before the first call to ap_bput.  The content type is set by a function
>call.  The function (ap_set_content_type(request_rec *, char *)) examines
>the content type and adds IO layers as neccessary.  For server parsed
>html, the core might insert mod_include immediately after mod_perl.

The problem of thinking of it that way is that, like Dean mentioned,
the output of one module may be filtered and the filter indicate that
content should be embedded from another URL, which turns out to be a
CGI script that outputs further parseable content.  In this instance,
the goal of layered-IO is to abstract away such behavior so that the
instance is processed recursively and thus doesn't result in some tangled
mess of processing code for subrequests.  Doing it requires that each
layer be able to pass both data and metadata, and have both data and
metadata be processed at each layer (if desired), rather than call a
single function that would set the metadata for the entire response.

My "solution" to that is to pass three interlaced streams -- data,
metadata, and meta-metadata -- through each layer.  The metadata
streams would point to a table of tokenized name-value pairs.
There are lots of ways to do that, going back to my description of
bucket brigades long ago.  Basically, each block of memory would
indicate what type of data, with metadata occurring in a block before
the data block(s) that it describes (just like chunk-size describes
the subsequent chunk-data) and the layers could be dynamically
rearranged based on the metadata that passed through them, in
accordance with the purpose of the filter.

>(Can anyone produce a use case where the IO chain could change after
>output begins?)

Output is a little easier, but that is the normal case for input.
We don't know what filters to apply to the request body until after
we have passed through the HTTP headers, and the HTTP message processor
is itself a filter in this model.

....Roy


=================================================
To: new-httpd@apache.org
Subject: Re: filtering patches
In-reply-to: Your message of "Mon, 10 Jul 2000 15:33:25 PDT."
             <Pine.LNX.4.21.0007101528180.10878-100000@koj.rkbloom.net>
Date: Mon, 10 Jul 2000 16:58:00 -0700
From: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
Message-ID:  <200007101657.aa21782@gremlin-relay.ics.uci.edu>

[...]
I meant that the filters, when written to as part of the output stream,
are treated as a stack (write to the top-most filter without any knowledge
of what may lie underneath it).  So the process of arranging filters
for a particular response is like dropping them onto a stack.  When a
filter is done or the stream is closed, each instantiated filter cleans
up according to its local state and then destroys itself (as it is popped
off the stack).

This is completely separate from the registration of filters by
name and purpose, which could be done by hooks.  The difference is that
filters are registered at config time but only instantiated (given local
storage) and arranged on a per stream basis.

Bucket brigades is simply a way to encapsulate and pass data down the stream
such that it can be as efficient as the sender desires, while retaining
a simple interface.  The purpose of the bucket is to make handling of the
data uniform regardless of its type, or make type-specific conversions
via a single ADT call if and only if they are needed by some filter.
The purpose of the brigade is to reduce the number of calling arguments
and linearize the calling sequence for insertion filters.  Each filter
definition is separate from its instantiation on the stream because
there may be many streams operating at once within a single program.
Each bucket is independent of the brigade so that the filters can rearrange
and insert buckets at will.  Each data item is isolated by the bucket
structure, which allows them to be split across child buckets or shared
with multiple streams (e.g., cached objects).  We don't need to implement
all of this on the first pass -- we just need to implement the ADT external
interfaces such that they don't have to change as we make the overall
stream more efficient.

BTW, in case I didn't make this clear in past messages, this design is
an amalgam of the best aspects of the designs from Henrik's Streams
(see w3c-libwww), sfio (AT&T Research), IO-Lite (Rice Univ.), and
libwww-ada95 (UCI).  The MIME stuff in MSIE is based on Henrik's streams.
Henrik's stuff is very fast, but is spaghetti code because it relies on
callbacks and legacy stuff in libwww.  sfio is great but is intended to
be a complete replacement for stdio and hence does way too much and is
subject to a few patents that I don't appreciate.  IO-Lite is cool but
is probably only worth it when the entire OS is based on IO-Lite memory
management, but regardless the code isn't available for commercial use.
As Dean has mentioned many times, we already get most of the performance
benefit of IO-Lite simply by avoiding memory copies on large writes.
libwww-ada95 was an attempt to make Ada95 suck less for systems programming,
which was only a partial success (it is very fast compared to other Ada95
libraries, but memory management became a problem with complex filters).

Writing our own streams library isn't NIH syndrome -- both Dean and I
have independently investigated the other available alternatives and they
just aren't suitable for our purpose.  Even with all that, my own design
only truly pays off (versus plain old BUFF) when you make good use of
sendfile and shared object caching.

[...]


=================================================
Other stuff Roy wrote on new-httpd:

My buckets are passed around in list-queues (really just lists with front
and back pointers).  My buckets carry data and metadata and meta-metadata.
My buckets are used to indicate stream-end, and the filter configuration
itself is determined by the stream content.  It probably sounds weird, but
the effects of this interface are completely different than mere content
filters.  They simplify everything.  I'm not saying that we have to
simplify everything right away, but I am saying that it is just as easy
to implement a fully-general filter using bucket brigades as it is
to implement string interface filters -- all of the complex parts
are handled by the ADTs.

...

The real psychedelic stuff happens when you can pass metadata (tokenized
header fields) as buckets and the filters know how to pass that down the
chain without treating them as data.

...

The purpose of passing a list of buckets around is to linearize
the call stack for the frequent case of filtered content
splitting one large bucket into separate buckets with filtered results
interspersed in between.  The effect is that a filter chain can frequently
process an entire message in one pass down the chain, which enables the
stream end to send the entire response in one go, which also allows it
to do interesting things like provide a content length by summing the
data length of all the buckets' data, and set a last-modified time
by picking the most recent time from a set of static file buckets.

I think it would help if we stopped using artificial examples.  Let's
try something simple:

       socket <-- http <-- add_footer <-- add_header <-- send_file

send_file calls its filter with an ap_file_t bucket and End-of-Stream (EOS)
in the bucket list.  add_header sets a flag, prepends another ap_file_t
bucket to the list and sends the list to its filter.  add_footer looks
at the list, finds the EOS, inserts another ap_file_t bucket in
front of the EOS, and sends the list on to its filter.  http walks through
the list picking up the (cached) stat values, notes the EOS and seeing
that its own flag for headers_sent is false, sets the cumulative metadata
and sends the header fields, followed by three calls to the kernel to
send out the three files using whatever mechanism is most efficient.

The point here isn't that this is the only way to implement filters.
The point is that no other interface can implement them as efficiently.
Not even close.  Yes, there are cases where string filters are just as
efficient as any other design, but there is no case in which they are
more efficient than bucket brigades.  The reason is that being able
to process a list of strings in one call more than offsets the extra
cost of list processing, regardless of the filter type, and allows
for additional features that have benefits for http processing.
Like, for example, being able to determine the entire set of resources
that make up the source of this dynamic resource without teaching every
filter about WebDAV.

...

Making many small calls down the filter chain is something best
avoided, which is why the bucket brigades interface consists of
a linked list of buckets, such that all of the currently available
data can be passed-on in a single call.

Being able to handle sendfile, cached objects and subrequests is very
effective at improving efficiency, which is why the buckets are typed.
A filter that needs to do byte-level processing will have to call a
routine to convert the typed bucket into a data stream, but that decision
is delayed until no other choice is available and adds no overhead to
the common cases of non-filtered or pre-/post-filtered objects.

Being able to process header fields (metadata) through the same filter
set as the data is necessary for correctness and simplicity in the
proper ordering of independently developed filter modules, which is
why the buckets can carry metadata on the same stream.  Every filter
has to be knowledgeable about metadata because only the filter knows
whether or not its actions will change the nature of the data.



CVSTrac 2.0.1