goals? we need an i/o abstraction which has these properties: - buffered and non-buffered modes The buffered mode should look like FILE *. The non-buffered mode should look more like read(2)/write(2). - blocking and non-blocking modes The blocking mode is the "easy" mode -- it's what most module writers will see. The non-blocking mode is the "hard" mode, this is where module writers wanting to squeeze out some speed will have to play. In order to build async/sync hybrid models we need the non-blocking i/o abstraction. - timed reads and writes (for blocking cases) This is part of my jihad against asynchronous notification. - i/o filtering or layering Yet another Holy Grail of computing. But I digress. These are hard when you take into consideration non-blocking i/o -- you have to keep lots of state. I expect our core filters will all support non-blocking i/o, well at least the ones I need to make sure we kick ass on benchmarks. A filter can deny a switch to non-blocking mode, the server will have to recover gracefully (ha). - copy-avoidance Hey what about zero copy a la IO-Lite? After having experienced it in a production setting I'm no longer convinced of its benefits. There is an enormous amount of overhead keeping lists of buffers, and reference counts, and cleanup functions, and such which requires a lot of tuning to get right. I think there may be something here, but it's not a cakewalk. What I do know is that the heuristics I put into apache-1.3 to choose writev() at times are almost as good as what you can get from doing full zero-copy in the cases we *currently* care about. To put it another way, let's wait another generation to deal with zero copy. But sendfile/transmitfile/etc. those are still interesting. So instead of listing "zero copy" as a property, I'll list "copy-avoidance". So far? - ap_bungetc added - ap_blookc changed to return the character, rather than take a char *buff - in theory, errno is always useful on return from a BUFF routine - ap_bhalfduplex, B_SAFEREAD will be re-implemented using a layer I think - chunking gone for now, will return as a layer - ebcdic gone for now... it should be a layer - ap_iol.h defined, first crack at the layers... Step back a second to think on it. Much like we have fread(3) and read(2), I've got a BUFF and an ap_iol abstraction. An ap_iol could use a BUFF if it requires some form of buffering, but many won't require buffering... or can do a better job themselves. Consider filters such as: - ebcdic -> ascii - encryption - compression These all share the property that no matter what, they're going to make an extra copy of the data. In some cases they can do it in place (read) or into a fixed buffer... in most cases their buffering requirements are different than what BUFF offers. Consider a filter such as chunking. This could actually use the writev method to get its job done... depends on the chunks being used. This is where zero-copy would be really nice, but we can get by with a few heuristics. At any rate -- the NSPR folks didn't see any reason to included a buffered i/o abstraction on top of their layered i/o abstraction... so I feel like I'm not the only one who's thinking this way. - iol_unix.c implemented... should hold us for a bit ============================== Date: Mon, 10 Apr 2000 14:39:48 -0700 (PDT) From: dean gaudet To: new-httpd@apache.org Subject: Re: Buff should be an I/O layer In-Reply-To: <20000410123109.C3931@manojk.users.mindspring.com> Message-ID: [hope you don't mind me taking this back to new-httpd so that it's archived this time :)] On Mon, 10 Apr 2000, Manoj Kasichainula wrote: > On Mon, Mar 27, 2000 at 04:48:23PM -0800, Dean Gaudet wrote: > > On Sat, 25 Mar 2000, Manoj Kasichainula wrote: > > > (aside: Though my unschooled brain still sees no > > > problem if our chunking layer maintains a pile of 6-byte blocks that > > > get used in an iol_writev. I'll read the archived discussions.) > > > > there's little in the way of archived discussions, there's just me admitting > > that i couldn't find a solution which was not complex. > > OK, there's got to be something wrong with this: > > chunk_iol->iol_write(char *buffer) { > pull a 10-byte (or whatever) piece out of our local stash > construct a chunk header in it > set the iovec = chunk header + buffer > writev(iovec) > } > > But what is it? when i was doing the new apache-2.0 buffering i was focusing a lot on supporting non-blocking sockets so we could do the async i/o stuff -- and to support a partial write you need to keep more state than what your suggestion has. also, the real complexity comes when you consider handling a pipelined HTTP/1.1 connection -- consider what happens when you get 5 requests for /cgi-bin/printenv smack after the other. if you do that against apache-1.3 and the current apache-2.0 you get back maximally packed packets. but if you make chunking a layer then every time you add/remove the layer you'll cause a packet boundary -- unless you add another buffering layer... or otherwise shift around the buffering. as a reminder, visit for a description of how much we win on the wire from such an effort. also, at some point i worry that passing the kernel dozens of tiny iovecs is more expensive than an extra byte copy into a staging buffer, and passing it one large buffer. but i haven't done any benchmarks to prove this. (my suscipions have to do with the way that at least the linux kernel's copying routine is written regarding aligned copies) oh it's totally worth pointing out that at least Solaris allows at most 16 iovecs in a single writev()... which probably means every sysv derived system is similarly limited. linux sets the limit at 1024. freebsd has an optimisation for up to 8, but otherwise handles 1024. i'm still doing work in this area though -- after all my ranting about zero-copy a few weeks back i set out to prove myself wrong by writing a zero-copy buffering library using every trick in my book. i've no results to share yet though. -dean ============================== Date: Tue, 2 May 2000 15:51:30 +0200 From: Martin Kraemer To: new-httpd@apache.org Subject: BUFF, IOL, Chunking, and Unicode in 2.0 (long) Message-ID: <20000502155129.A10548@pgtm0035.mch.sni.de> Sorry for a long silence in the past weeks, I've been busy with other stuff. Putting the catch-words "Chunking, Unicode and 2.0" into the subject was on purpose: I didn't want to scare off anyone because of the word EBCDIC: the problems I describe here, and the proposed new buff.c layering, are mostly independent from the EBCDIC port. In the past weeks, I've been thinking about today's buff.c (and studied its applicability for automatic conversion stuff like in the russian apache, see apache.lexa.ru). I think it would be neat to be able to do automatic character set conversion in the server, for example by negotiation (when the client sends an Accept-Charset and the server doesn't have a document with exactly the right Charset, but knows how to generate it from an existing representation). IMO it is a reoccurring problem, * not only in today's russian internet environment (de facto browsers support 5 different cyrillic character sets, but the server doesn't want to hold every document in 5 copies, so an automatic translation is performed by the russian apache, depending on information supplied by the client, or by explicit configuration). One of the supported character sets is Unicode (UTF-7 or UTF-8) * in japanese/chinese environments, support for 16 bit character sets is an absolute requirement. (Other oriental scripts like Thai get along with 8 bit: they only have 44 consonants and 16 vowels). Having success on the eastern markets depends to a great deal on having support for these character sets. The japanese Apache community hasn't had much contact with new-httpd in the past, but I'm absolutely sure that there is a "standard japanese patch" for Apache which would well be worth integrating into the standard distribution. (Anyone on the list to provide a pointer?) * In the future, more and more browsers will support unicode, and so will the demand grow for servers supporting unicode. Why not integrate ONE solution for the MANY problems worldwide? * The EBCDIC port of 1997 has been a simple solution for a rather simple problem. If we would "do it right" for 2.0 and provide a generic translation layer, we would solve many problems in a single blow. The EBCDIC translation would be only one of them. Jeff has been digging through the EBCDIC stuff and apparently succeeded in porting a lot of the 1.3 stuff to 2.0 already. Jeff, I'd sure be interested in having a look at it. However, when I looked at buff.c and the new iol_* functionality, I found out that iol's are not the way to go: they give us no solution for any of the conversion problems: * iol's sit below BUFF. Therefore, they don't have enough information to know which part of the written byte stream is net client data, and which part is protocol information (chunks, MIME headers for multipart/*). * iol's don't allow simplification of today's chunking code. It is spread thruout buff.c and there's a very hairy balance between efficiency and code correctness. Re-adding (EBCDIC/UTF) conversion, possibly with sup[port for multi byte character sets (MBCS), would make a code nightmare out of it. (buff.c in 1.3 was "almost" a nightmare because we had onlu single byte translations. * Putting conversion to a hierarchy level any higher than buff.c is no solution either: for chunks, as well as for multipart headers and buffering boundaries, we need character set translation. Pulling it to a higher level means that a lot of redundant information has to be passed down and up. In my understanding, we need a layered buff.c (which I number from 0 upwards): 0) at the lowest layer, there's a "block mode" which basically supports bread/bwrite/bwritev by calling the equivalent iol_* routines. It doesn't know about chunking, conversion, buffering and the like. All it does is read/write with error handling. 1) the next layer handles chunking. It knows about the current chunking state and adds chunking information into the written byte stream at appropriate places. It does not need to know about buffering, or what the current (ebcdic?) conversion setting is. 2) this layer handles conversion. I was thinking about a concept where a generic character set conversion would be possible based on Unicode-to-any translation tables. This would also deal with multibyte character sets, because at this layer, it would be easy to convert SBCS to MBCS. Note that conversion *MUST* be positioned above the chunking layer and below the buffering layer. The former guarantees that chunking information is not converted twice (or not at all), and the latter guarantees that ap_bgets() is looking at the converted data (-- otherwise it would fail to find the '\n' which indicates end- of-line). Using (loadable?) translation tables based on unicode definitions is a very similar approach to what libiconv offers you (see http://clisp.cons.org/~haible/packages-libiconv.html -- though my inspiration came from the russian apache, and I only heard about libiconv recently). Every character set can be defined as a list of pairs, and translations between several SBCS's can be collapsed into a single 256 char table. Efficiently building them once only, and finding them fast is an optimization task. 3) This last layer adds buffering to the byte stream of the lower layers. Because chunking and translation have already been dealt with, it only needs to implement efficient buffering. Code complexity is reduced to simple stdio-like buffering. Creating a BUFF stream involves creation of the basic (layer 0) BUFF, and then pushing zero or more filters (in the right order) on top of it. Usually, this will always add the chunking layer, optionally add the conversion layer, and usually add the buffering layer (look for ap_bcreate() in the code: it almost always uses B_RD/B_WR). Here's code from a conceptual prototype I wrote: BUFF *buf = ap_bcreate(NULL, B_RDWR), *chunked, *buffered; chunked = ap_bpush_filter(buf, chunked_filter, 0); buffered = ap_bpush_filter(chunked, buffered_filter, B_RDWR); ap_bputs("Data for buffered ap_bputs\n", buffered); Using a BUFF stream doesn't change: simply invoke the well known API and call ap_bputs() or ap_bwrite() as you would today. Only, these would be wrapper macros #define ap_bputs(data, buf) buf->bf_puts(data, buf) #define ap_write(buf, data, max, lenp) buf->bf_write(buf, data, max, lenp) where a BUFF struct would hold function pointers and flags for the various levels' input/output functions, in addition to today's BUFF layout. For performance improvement, the following can be added to taste: * fewer buffering (zero copy where possible) by putting the buffers for buffered reading/writing down as far as possible (for SBCS: from layer 3 to layer 0). By doing this, the buffer can also hold a chunking prefix (used by layer 1) in front of the buffering buffer to reduce the number of vectors in a writev, or the number of copies between buffers. Each layer could indicate whether it needs a private buffer or not. * intra-module calls can be hardcoded to call the appropriate lower layer directly, instead of using the ap_bwrite() etc macros. That means we don't use the function pointers all the time, but instead call the lower levels directly. OTOH we have iol_* stuff which uses function pointers anyway. We decided in 1.3 that we wanted to avoid the C++ type stuff (esp. function pointers) for performance reasons. But it would sure reduces the code complexity a lot. The resulting layering would look like this: | Caller: using ap_bputs() | or ap_bgets/apbwrite etc. +--------------------------+ | Layer 3: Buffered I/O | gets/puts/getchar functionality +--------------------------+ | Layer 2: Code Conversion | (optional conversions) +--------------------------+ | Layer 1: Chunking Layer | Adding chunks on writes +--------------------------+ | Layer 0: Binary Output | bwrite/bwritev, error handling +--------------------------+ | iol_* functionality | basic i/o +--------------------------+ | apr_* functionality | .... -- | Fujitsu Siemens Fon: +49-89-636-46021, FAX: +49-89-636-41143 | 81730 Munich, Germany ============================== Date: Tue, 2 May 2000 09:09:28 -0700 (PDT) From: dean gaudet To: new-httpd@apache.org Subject: Re: BUFF, IOL, Chunking, and Unicode in 2.0 (long) In-Reply-To: <20000502155129.A10548@pgtm0035.mch.sni.de> Message-ID: On Tue, 2 May 2000, Martin Kraemer wrote: > * iol's sit below BUFF. Therefore, they don't have enough information > to know which part of the written byte stream is net client data, > and which part is protocol information (chunks, MIME headers for > multipart/*). there's not much stopping you from writing an iol which takes a BUFF * in its initialiser, and then bcreating a second BUFF, and bpushing your iol. like: /* this is in r->pool rather than r->connection->pool because * we expect to create & destroy this inside request boundaries * and if we stuck it in r->connection->pool the storage wouldn't * be reclaimed earlier enough on pipelined connections. * * also, no need for buffering in new_buff because the translation * layer can easily assume lower level BUFF is doing the buffering. */ new_buff = ap_bcreate(r->pool, B_WR); ap_bpush_iol(new_buff, ap_utf8_to_ebcdic(r->pool, r->connection->client)); r->connection->client = new_buff; main problem is that the new_buff only works for writing, and you potentially need a separate conversion layer for reading from the client. shouldn't be too hard to split up r->connection->client into a read and write half. think of iol as the equivalent of the low level read/write, and BUFF as the equivalent of FILE *. there's a reason for both layers in the interface. > * iol's don't allow simplification of today's chunking code. It is > spread thruout buff.c and there's a very hairy balance between > efficiency and code correctness. Re-adding (EBCDIC/UTF) conversion, > possibly with sup[port for multi byte character sets (MBCS), would > make a code nightmare out of it. (buff.c in 1.3 was "almost" a > nightmare because we had onlu single byte translations. as i've said before, i welcome anyone to do it otherwise without adding network packets, without adding unnecessary byte copies, and without making it even more complex. until you've tried it, it's pretty easy to just say "this is a mess". once you've tried it i suspect you'll discover why it is a mess. that said, i'm still trying to prove to myself that the zero-copy crud necessary to clean this up can be done in a less complex manner. > * Putting conversion to a hierarchy level any higher than buff.c is no > solution either: for chunks, as well as for multipart headers and > buffering boundaries, we need character set translation. Pulling it > to a higher level means that a lot of redundant information has to > be passed down and up. huh? HTTP is in ASCII -- you don't need any conversion -- if a chunking BUFF below a converting BUFF/iol is writing those things in ascii it works. no? at least that's my understanding of the code in 1.3. you wouldn't do the extra BUFF layer above until after you've written the headers into the plain-text BUFF. i would expect you'd: write headers through plain text BUFF push conversion BUFF run method pop conversion BUFF pump multipart header push conversion BUFF ... pop conversion BUFF > In my understanding, we need a layered buff.c (which I number from 0 > upwards): you've already got it :) > | Caller: using ap_bputs() | or ap_bgets/apbwrite etc. > +--------------------------+ > | Layer 3: Buffered I/O | gets/puts/getchar functionality > +--------------------------+ > | Layer 2: Code Conversion | (optional conversions) > +--------------------------+ > | Layer 1: Chunking Layer | Adding chunks on writes > +--------------------------+ > | Layer 0: Binary Output | bwrite/bwritev, error handling > +--------------------------+ > | iol_* functionality | basic i/o > +--------------------------+ > | apr_* functionality | there are two cases you need to consider: chunking and a partial write occurs -- you need to keep track of how much of the chunk header/trailer was written so that on the next loop around (which happens in the application at the top) you continue where you left off. and more importantly at the moment, and easier to grasp -- consider what happens when you've got a pipelined connection. a dozen requests come in from the client, and apache-1.3 will send back the minimal number of packets. 2.0-current still needs fixing in this area (specifically saferead needs to be implemented). for example, suppose the client sends one packet: GET /images/a.gif HTTP/1.1 Host: foo GET /images/b.gif HTTP/1.1 Host: foo suppose that a.gif and b.gif are small 200 byte files. apache-1.3 sends back one response packet: HTTP/1.1 OK headers a.gif body HTTP/1.1 OK headers b.gif body consider what happens with your proposal. in between each of those requests you remove the buffering -- which means you have to flush a packet boundary. so your proposal generates two network packets. like i've said before on this topic -- if all unixes had TCP_CORK, it'd be a breeze. but only linux has TCP_CORK. you pretty much require a layer of buffering right above the iol which talks to the network. and once you put that layer of buffering there, you might as well merge chunking into it, because chunking needs buffering as well (specifically for the async i/o case). and then you either have to double-buffer, or you can only stack non-buffered layers above it. fortunately, character-set conversion should be doable without any buffering. *or* you implement a zero-copy library, and hope it all works out in the end. -dean