To: new-httpd@apache.org Subject: bucket brigades and IOL Date: Fri, 12 Nov 1999 23:57:43 -0800 From: "Roy T. Fielding" Message-ID: <199911122357.aa18914@gremlin-relay.ics.uci.edu> About two years ago I wasted a lot of time writing an Ada95 library called Onions that provides a stackable stream abstraction for files, sockets, etc. It is at if you want to take a look at it, but I don't recommend looking at the code since it is almost all just working around Ada95's lack of a system interface. I'll describe the worthwhile bits here. The heart of Onions is the input and output stream object classes and classwide types for building a data stream via a stack of stream objects (Input_Pipe and Output_Pipe). Reading from the head of an input pipe causes the head stream object to read from the next outbound stream object, and on down the line. Likewise for writing to the head of an output pipe. One of the main features of streams is that they can filter the data as it passes, converting, adding to, and/or removing from the data before giving it to the next object. Since multiple streams can be cascaded, the complete data conversion is the sum of the individual data conversions performed by the stream objects. So far, no big deal -- this can be manually created by stacking ap_iol types in a meaningful way. But, the one unique thing I did in Onions was abstract the memory handling into something called Buckets and moved them around in Bucket_Brigades. A bucket is an allocated segment of memory with pointers to its allocation address and current size. If I were doing this in C, I'd also add a pointer to current start address and allocated size, so that a single bucket could be shrunk from both ends without copying, and a function pointer for freeing it at the stream end. Note that this is the same type of memory structure that IO-Lite uses, though developed independently and for different reasons. A bucket brigade is a list-queue of buckets. Each of the stream read/write calls would pass a bucket brigade instead of single bucket, since this made insertion by filters more efficient, with the general idea being that the outbound end of the sream would be writing them out using writev or reading them in using readv, which is about as efficient as I could get with Ada95. [I call it a list-queue instead of just queue because you have the choice of removing buckets from (or adding to) the queue one bucket at a time or an entire linked list of buckets.] But we could go one step further. A bucket is an ADT, and as such can be used as a general handle for read-only memory, read-write memory, cache object, file handle, mmap handle, file name, URL, whatever. What if, instead of just a stream of memory, it could pass around a stream of memory interspersed with file handles or references to remote objects? A filter could then add stuff around the stream without causing too much parsing overhead, and if it needed to look at all the bytes in the stream it would just replace the bucket handle with a stream of memory sucked from that handle. Something like this was talked about last year (see threads on "Stacking up Response Handling" on 23 Sep 1998 and "I/O filters & reference counts" in late December 1998 and January 1999). And Dean started something with ap_buf.h, but I don't know how he meant to finish it. What I was thinking of was typedef enum { AP_BUCKET_rwmem, AP_BUCKET_rmem, AP_BUCKET_file_t, AP_BUCKET_mmap_t, AP_BUCKET_filename, AP_BUCKET_cached_entity, AP_BUCKET_URI, } ap_bucket_color_t; typedef struct ap_bucket_t ap_bucket_t; struct ap_bucket_t { ap_bucket_color_t color; void *content; ap_status_t (*free)(ap_bucket_t *bucket); unsigned int refcount; }; typedef struct ap_bucket_rwmem_t ap_bucket_rwmem_t; struct ap_bucket_rwmem_t { void *alloc_addr; size_t alloc_len; void *addr; size_t len; }; typedef struct ap_bucket_rmem_t ap_bucket_rmem_t; struct ap_bucket_rmem_t { void *addr; size_t len; }; typedef struct ap_bucket_filename ap_bucket_filename; struct ap_bucket_filename { ap_context_t *ctx; char *name; ap_stat_t *stat; /* useful if already stat'ed */ ap_aa_t *conf; /* access control structure for this file */ }; ... and then typedef struct ap_bucket_list_t ap_bucket_list_t; struct ap_bucket_list_t { ap_bucket_t *bucket; ap_bucket_list_t *prev; ap_bucket_list_t *next; }; typedef struct ap_brigade_t ap_brigade_t; struct ap_brigade_t { ap_context_t *ctx; ap_bucket_list_t *first; ap_bucket_list_t *last; unsigned int count; }; and then construct the input and output streams as pushing these bucket brigades to or from the client. The streams would have to be a little more complicated than Onions, since I learned later that you also need a parallel stream of header fields (in tokenized form) in order for it to work with anything HTTP-like. Why use an enum instead of a bunch of file pointers for each type of bucket, kind of like ap_iol? Because it allows adjacent memory buckets (the most frequent kind after a filter operation) to be gathered into a single writev. Also, we need a way to be able to set up an operation and figure out what it will produce without actually performing the operation -- this is for OPTIONS and HEAD. Note that this would completely change the way we handle internal redirects, subrequests, server-side include, mod_proxy, access control, etc. And then most of the API hooks would need to change. I think that is why Dean was putting it off until 2.1. The annoying thing is that this is the most useful rearchitecting of the server -- the MPM, APR, and hook changes make 2.0 easier/cleaner/faster to port to other platforms, but layering enables in one fell swoop almost every significant non-config feature that our users have requested. A cache would just be a hash table or btree of file buckets, complete with AA info. Anyway, that was stuck in the back of my head and had to get out. I won't be able to work on it until after the dissertation is done, which every day seems to be further away. Maybe 3.0, with rHTTP/2.0. ....Roy ================================================= To: new-httpd@apache.org Subject: Re: bucket brigades and IOL In-reply-to: Your message of "Sat, 13 Nov 1999 20:43:58 GMT." <382DCD8E.881B8468@algroup.co.uk> Date: Sun, 14 Nov 1999 22:24:03 -0800 From: "Roy T. Fielding" Message-ID: <199911142224.aa22545@gremlin-relay.ics.uci.edu> BenL wrote: >I've got to say that this is the most coherent suggestion along these >lines that I've seen yet. I rather like it. One thing I'd add is that if >you are going to have a movable "start of block" pointer, and changeable >length, it can be nice to allocate extra around the edges under some >circumstances, so that lower layers can expand the block without having >to add extra chunks. Or, alternatively, allocate equal size blocks and just pass around a reference pair within the buckets that, when the bucket is freed, access a more complicated reference-counting pool. I think that is closer to what IO-Lite does. >Also, the usual objections still apply - i.e. it is awkward to do things >like searching for particular strings, since they may cross boundaries. >I'm beginning to think that the right answer to this is to provide nice >matching functions that know about the chunked structures, and last >resort functions that'll glue it all back into one chunk... Yep, that's what I ended up doing for Ada95, though in that case there were no easier alternatives. ....Roy ================================================= To: new-httpd@apache.org Subject: Re: layered I/O (was: cvs commit: ...) In-reply-to: Your message of "Wed, 29 Mar 2000 01:21:09 PST." Date: Wed, 29 Mar 2000 02:05:08 -0800 From: "Roy T. Fielding" Message-ID: <200003290205.aa19557@gremlin-relay.ics.uci.edu> >Selection of IO Layers > >The core selects a source module and IO layers based on the urlspace >configuration. Content might be generated by mod_perl, and the result is >piped through mod_chunk, mod_ssl, and mod_net, in turn. When the content >generator runs, the core enforces that the module set the content type >before the first call to ap_bput. The content type is set by a function >call. The function (ap_set_content_type(request_rec *, char *)) examines >the content type and adds IO layers as neccessary. For server parsed >html, the core might insert mod_include immediately after mod_perl. The problem of thinking of it that way is that, like Dean mentioned, the output of one module may be filtered and the filter indicate that content should be embedded from another URL, which turns out to be a CGI script that outputs further parseable content. In this instance, the goal of layered-IO is to abstract away such behavior so that the instance is processed recursively and thus doesn't result in some tangled mess of processing code for subrequests. Doing it requires that each layer be able to pass both data and metadata, and have both data and metadata be processed at each layer (if desired), rather than call a single function that would set the metadata for the entire response. My "solution" to that is to pass three interlaced streams -- data, metadata, and meta-metadata -- through each layer. The metadata streams would point to a table of tokenized name-value pairs. There are lots of ways to do that, going back to my description of bucket brigades long ago. Basically, each block of memory would indicate what type of data, with metadata occurring in a block before the data block(s) that it describes (just like chunk-size describes the subsequent chunk-data) and the layers could be dynamically rearranged based on the metadata that passed through them, in accordance with the purpose of the filter. >(Can anyone produce a use case where the IO chain could change after >output begins?) Output is a little easier, but that is the normal case for input. We don't know what filters to apply to the request body until after we have passed through the HTTP headers, and the HTTP message processor is itself a filter in this model. ....Roy ================================================= To: new-httpd@apache.org Subject: Re: filtering patches In-reply-to: Your message of "Mon, 10 Jul 2000 15:33:25 PDT." Date: Mon, 10 Jul 2000 16:58:00 -0700 From: "Roy T. Fielding" Message-ID: <200007101657.aa21782@gremlin-relay.ics.uci.edu> [...] I meant that the filters, when written to as part of the output stream, are treated as a stack (write to the top-most filter without any knowledge of what may lie underneath it). So the process of arranging filters for a particular response is like dropping them onto a stack. When a filter is done or the stream is closed, each instantiated filter cleans up according to its local state and then destroys itself (as it is popped off the stack). This is completely separate from the registration of filters by name and purpose, which could be done by hooks. The difference is that filters are registered at config time but only instantiated (given local storage) and arranged on a per stream basis. Bucket brigades is simply a way to encapsulate and pass data down the stream such that it can be as efficient as the sender desires, while retaining a simple interface. The purpose of the bucket is to make handling of the data uniform regardless of its type, or make type-specific conversions via a single ADT call if and only if they are needed by some filter. The purpose of the brigade is to reduce the number of calling arguments and linearize the calling sequence for insertion filters. Each filter definition is separate from its instantiation on the stream because there may be many streams operating at once within a single program. Each bucket is independent of the brigade so that the filters can rearrange and insert buckets at will. Each data item is isolated by the bucket structure, which allows them to be split across child buckets or shared with multiple streams (e.g., cached objects). We don't need to implement all of this on the first pass -- we just need to implement the ADT external interfaces such that they don't have to change as we make the overall stream more efficient. BTW, in case I didn't make this clear in past messages, this design is an amalgam of the best aspects of the designs from Henrik's Streams (see w3c-libwww), sfio (AT&T Research), IO-Lite (Rice Univ.), and libwww-ada95 (UCI). The MIME stuff in MSIE is based on Henrik's streams. Henrik's stuff is very fast, but is spaghetti code because it relies on callbacks and legacy stuff in libwww. sfio is great but is intended to be a complete replacement for stdio and hence does way too much and is subject to a few patents that I don't appreciate. IO-Lite is cool but is probably only worth it when the entire OS is based on IO-Lite memory management, but regardless the code isn't available for commercial use. As Dean has mentioned many times, we already get most of the performance benefit of IO-Lite simply by avoiding memory copies on large writes. libwww-ada95 was an attempt to make Ada95 suck less for systems programming, which was only a partial success (it is very fast compared to other Ada95 libraries, but memory management became a problem with complex filters). Writing our own streams library isn't NIH syndrome -- both Dean and I have independently investigated the other available alternatives and they just aren't suitable for our purpose. Even with all that, my own design only truly pays off (versus plain old BUFF) when you make good use of sendfile and shared object caching. [...] ================================================= Other stuff Roy wrote on new-httpd: My buckets are passed around in list-queues (really just lists with front and back pointers). My buckets carry data and metadata and meta-metadata. My buckets are used to indicate stream-end, and the filter configuration itself is determined by the stream content. It probably sounds weird, but the effects of this interface are completely different than mere content filters. They simplify everything. I'm not saying that we have to simplify everything right away, but I am saying that it is just as easy to implement a fully-general filter using bucket brigades as it is to implement string interface filters -- all of the complex parts are handled by the ADTs. ... The real psychedelic stuff happens when you can pass metadata (tokenized header fields) as buckets and the filters know how to pass that down the chain without treating them as data. ... The purpose of passing a list of buckets around is to linearize the call stack for the frequent case of filtered content splitting one large bucket into separate buckets with filtered results interspersed in between. The effect is that a filter chain can frequently process an entire message in one pass down the chain, which enables the stream end to send the entire response in one go, which also allows it to do interesting things like provide a content length by summing the data length of all the buckets' data, and set a last-modified time by picking the most recent time from a set of static file buckets. I think it would help if we stopped using artificial examples. Let's try something simple: socket <-- http <-- add_footer <-- add_header <-- send_file send_file calls its filter with an ap_file_t bucket and End-of-Stream (EOS) in the bucket list. add_header sets a flag, prepends another ap_file_t bucket to the list and sends the list to its filter. add_footer looks at the list, finds the EOS, inserts another ap_file_t bucket in front of the EOS, and sends the list on to its filter. http walks through the list picking up the (cached) stat values, notes the EOS and seeing that its own flag for headers_sent is false, sets the cumulative metadata and sends the header fields, followed by three calls to the kernel to send out the three files using whatever mechanism is most efficient. The point here isn't that this is the only way to implement filters. The point is that no other interface can implement them as efficiently. Not even close. Yes, there are cases where string filters are just as efficient as any other design, but there is no case in which they are more efficient than bucket brigades. The reason is that being able to process a list of strings in one call more than offsets the extra cost of list processing, regardless of the filter type, and allows for additional features that have benefits for http processing. Like, for example, being able to determine the entire set of resources that make up the source of this dynamic resource without teaching every filter about WebDAV. ... Making many small calls down the filter chain is something best avoided, which is why the bucket brigades interface consists of a linked list of buckets, such that all of the currently available data can be passed-on in a single call. Being able to handle sendfile, cached objects and subrequests is very effective at improving efficiency, which is why the buckets are typed. A filter that needs to do byte-level processing will have to call a routine to convert the typed bucket into a data stream, but that decision is delayed until no other choice is available and adds no overhead to the common cases of non-filtered or pre-/post-filtered objects. Being able to process header fields (metadata) through the same filter set as the data is necessary for correctness and simplicity in the proper ordering of independently developed filter modules, which is why the buckets can carry metadata on the same stream. Every filter has to be knowledgeable about metadata because only the filter knows whether or not its actions will change the nature of the data.