OSSP: CVS Repository: ossp-pkg/sio/BRAINSTORM/doc_dean

ossp-pkg/sio/BRAINSTORM/doc_dean_iol.txt 1.1
goals? we need an i/o abstraction which has these properties:

- buffered and non-buffered modes

    The buffered mode should look like FILE *.

    The non-buffered mode should look more like read(2)/write(2).

- blocking and non-blocking modes

    The blocking mode is the "easy" mode -- it's what most module writers
    will see.  The non-blocking mode is the "hard" mode, this is where
    module writers wanting to squeeze out some speed will have to play.
    In order to build async/sync hybrid models we need the
    non-blocking i/o abstraction.

- timed reads and writes (for blocking cases)

    This is part of my jihad against asynchronous notification.

- i/o filtering or layering

    Yet another Holy Grail of computing.  But I digress.  These are
    hard when you take into consideration non-blocking i/o -- you have
    to keep lots of state.  I expect our core filters will all support
    non-blocking i/o, well at least the ones I need to make sure we kick
    ass on benchmarks.  A filter can deny a switch to non-blocking mode,
    the server will have to recover gracefully (ha).

- copy-avoidance

    Hey what about zero copy a la IO-Lite?  After having experienced it
    in a production setting I'm no longer convinced of its benefits.
    There is an enormous amount of overhead keeping lists of buffers,
    and reference counts, and cleanup functions, and such which requires
    a lot of tuning to get right.  I think there may be something here,
    but it's not a cakewalk.

    What I do know is that the heuristics I put into apache-1.3 to choose
    writev() at times are almost as good as what you can get from doing
    full zero-copy in the cases we *currently* care about.  To put it
    another way, let's wait another generation to deal with zero copy.

    But sendfile/transmitfile/etc. those are still interesting.

    So instead of listing "zero copy" as a property, I'll list
    "copy-avoidance".

So far?

- ap_bungetc added
- ap_blookc changed to return the character, rather than take a char *buff
- in theory, errno is always useful on return from a BUFF routine
- ap_bhalfduplex, B_SAFEREAD will be re-implemented using a layer I think
- chunking gone for now, will return as a layer
- ebcdic gone for now... it should be a layer

- ap_iol.h defined, first crack at the layers...

    Step back a second to think on it.  Much like we have fread(3)
    and read(2), I've got a BUFF and an ap_iol abstraction.  An ap_iol
    could use a BUFF if it requires some form of buffering, but many
    won't require buffering... or can do a better job themselves.

    Consider filters such as:
	- ebcdic -> ascii
	- encryption
	- compression
    These all share the property that no matter what, they're going to make
    an extra copy of the data.  In some cases they can do it in place (read)
    or into a fixed buffer... in most cases their buffering requirements
    are different than what BUFF offers.

    Consider a filter such as chunking.  This could actually use the writev
    method to get its job done... depends on the chunks being used.  This
    is where zero-copy would be really nice, but we can get by with a few
    heuristics.

    At any rate -- the NSPR folks didn't see any reason to included a
    buffered i/o abstraction on top of their layered i/o abstraction... so
    I feel like I'm not the only one who's thinking this way.

- iol_unix.c implemented... should hold us for a bit


==============================
Date: Mon, 10 Apr 2000 14:39:48 -0700 (PDT)
From: dean gaudet <dgaudet-list-new-httpd@arctic.org>
To: new-httpd@apache.org
Subject: Re: Buff should be an I/O layer
In-Reply-To: <20000410123109.C3931@manojk.users.mindspring.com>
Message-ID: <Pine.LNX.4.21.0004101418410.2626-100000@twinlark.arctic.org>

[hope you don't mind me taking this back to new-httpd so that it's
archived this time :)]

On Mon, 10 Apr 2000, Manoj Kasichainula wrote:

> On Mon, Mar 27, 2000 at 04:48:23PM -0800, Dean Gaudet wrote:
> > On Sat, 25 Mar 2000, Manoj Kasichainula wrote:
> > > (aside: Though my unschooled brain still sees no
> > > problem if our chunking layer maintains a pile of 6-byte blocks that
> > > get used in an iol_writev. I'll read the archived discussions.)
> > 
> > there's little in the way of archived discussions, there's just me admitting
> > that i couldn't find a solution which was not complex.
> 
> OK, there's got to be something wrong with this:
> 
> chunk_iol->iol_write(char *buffer) {
>     pull a 10-byte (or whatever) piece out of our local stash
>     construct a chunk header in it
>     set the iovec = chunk header + buffer
>     writev(iovec)
> }
> 
> But what is it?

when i was doing the new apache-2.0 buffering i was focusing a lot on
supporting non-blocking sockets so we could do the async i/o stuff -- and
to support a partial write you need to keep more state than what your
suggestion has.

also, the real complexity comes when you consider handling a pipelined
HTTP/1.1 connection -- consider what happens when you get 5 requests
for /cgi-bin/printenv smack after the other.

if you do that against apache-1.3 and the current apache-2.0 you get
back maximally packed packets.  but if you make chunking a layer then
every time you add/remove the layer you'll cause a packet boundary --
unless you add another buffering layer... or otherwise shift around
the buffering.

as a reminder, visit
<http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html> for a
description of how much we win on the wire from such an effort.

also, at some point i worry that passing the kernel dozens of tiny
iovecs is more expensive than an extra byte copy into a staging buffer,
and passing it one large buffer.  but i haven't done any benchmarks to
prove this.  (my suscipions have to do with the way that at least the
linux kernel's copying routine is written regarding aligned copies)

oh it's totally worth pointing out that at least Solaris allows at
most 16 iovecs in a single writev()... which probably means every sysv
derived system is similarly limited.  linux sets the limit at 1024.
freebsd has an optimisation for up to 8, but otherwise handles 1024.

i'm still doing work in this area though -- after all my ranting about
zero-copy a few weeks back i set out to prove myself wrong by writing
a zero-copy buffering library using every trick in my book.  i've no
results to share yet though.

-dean


==============================
Date: Tue, 2 May 2000 15:51:30 +0200
From: Martin Kraemer <Martin.Kraemer@mch.sni.de>
To: new-httpd@apache.org
Subject: BUFF, IOL, Chunking, and Unicode in 2.0 (long)
Message-ID: <20000502155129.A10548@pgtm0035.mch.sni.de>

Sorry for a long silence in the past weeks, I've been busy with other
stuff.

Putting the catch-words "Chunking, Unicode and 2.0" into the subject
was on purpose: I didn't want to scare off anyone because of the word
EBCDIC: the problems I describe here, and the proposed new buff.c
layering, are mostly independent from the EBCDIC port.


In the past weeks, I've been thinking about today's buff.c (and
studied its applicability for automatic conversion stuff like in the
russian apache, see apache.lexa.ru). I think it would be neat to be
able to do automatic character set conversion in the server, for
example by negotiation (when the client sends an Accept-Charset and
the server doesn't have a document with exactly the right Charset, but
knows how to generate it from an existing representation).

IMO it is a reoccurring problem,

* not only in today's russian internet environment (de facto browsers
  support 5 different cyrillic character sets, but the server doesn't
  want to hold every document in 5 copies, so an automatic translation
  is performed by the russian apache, depending on information supplied
  by the client, or by explicit configuration). One of the supported
  character sets is Unicode (UTF-7 or UTF-8)

* in japanese/chinese environments, support for 16 bit character sets
  is an absolute requirement. (Other oriental scripts like Thai get
  along with 8 bit: they only have 44 consonants and 16 vowels).
  Having success on the eastern markets depends to a great deal on
  having support for these character sets. The japanese Apache
  community hasn't had much contact with new-httpd in the past, but
  I'm absolutely sure that there is a "standard japanese patch" for
  Apache which would well be worth integrating into the standard
  distribution. (Anyone on the list to provide a pointer?)

* In the future, more and more browsers will support unicode, and so
  will the demand grow for servers supporting unicode. Why not
  integrate ONE solution for the MANY problems worldwide?

* The EBCDIC port of 1997 has been a simple solution for a rather
  simple problem. If we would "do it right" for 2.0 and provide a
  generic translation layer, we would solve many problems in a single
  blow. The EBCDIC translation would be only one of them.

Jeff has been digging through the EBCDIC stuff and apparently
succeeded in porting a lot of the 1.3 stuff to 2.0 already. Jeff, I'd
sure be interested in having a look at it. However, when I looked at
buff.c and the new iol_* functionality, I found out that iol's are not
the way to go: they give us no solution for any of the conversion
problems:

* iol's sit below BUFF. Therefore, they don't have enough information
  to know which part of the written byte stream is net client data,
  and which part is protocol information (chunks, MIME headers for
  multipart/*).

* iol's don't allow simplification of today's chunking code. It is
  spread thruout buff.c and there's a very hairy balance between
  efficiency and code correctness. Re-adding (EBCDIC/UTF) conversion,
  possibly with sup[port for multi byte character sets (MBCS), would
  make a code nightmare out of it. (buff.c in 1.3 was "almost" a
  nightmare because we had onlu single byte translations.

* Putting conversion to a hierarchy level any higher than buff.c is no
  solution either: for chunks, as well as for multipart headers and
  buffering boundaries, we need character set translation. Pulling it
  to a higher level means that a lot of redundant information has to
  be passed down and up.

In my understanding, we need a layered buff.c (which I number from 0
upwards):

0) at the lowest layer, there's a "block mode" which basically
   supports bread/bwrite/bwritev by calling the equivalent iol_*
   routines. It doesn't know about chunking, conversion, buffering and
   the like. All it does is read/write with error handling.

1) the next layer handles chunking. It knows about the current
   chunking state and adds chunking information into the written
   byte stream at appropriate places. It does not need to know about
   buffering, or what the current (ebcdic?) conversion setting is.

2) this layer handles conversion. I was thinking about a concept
   where a generic character set conversion would be possible based on
   Unicode-to-any translation tables. This would also deal with
   multibyte character sets, because at this layer, it would
   be easy to convert SBCS to MBCS.
   Note that conversion *MUST* be positioned above the chunking layer
   and below the buffering layer. The former guarantees that chunking
   information is not converted twice (or not at all), and the latter
   guarantees that ap_bgets() is looking at the converted data
   (-- otherwise it would fail to find the '\n' which indicates end-
   of-line).
   Using (loadable?) translation tables based on unicode definitions
   is a very similar approach to what libiconv offers you (see
   http://clisp.cons.org/~haible/packages-libiconv.html -- though my
   inspiration came from the russian apache, and I only heard about
   libiconv recently). Every character set can be defined as a list
   of <hex code> <unicode equiv> pairs, and translations between
   several SBCS's can be collapsed into a single 256 char table.
   Efficiently building them once only, and finding them fast is an
   optimization task.

3) This last layer adds buffering to the byte stream of the lower
   layers. Because chunking and translation have already been dealt
   with, it only needs to implement efficient buffering. Code
   complexity is reduced to simple stdio-like buffering.


Creating a BUFF stream involves creation of the basic (layer 0) BUFF,
and then pushing zero or more filters (in the right order) on top of
it. Usually, this will always add the chunking layer, optionally add
the conversion layer, and usually add the buffering layer (look for
ap_bcreate() in the code: it almost always uses B_RD/B_WR).

Here's code from a conceptual prototype I wrote:
    BUFF *buf = ap_bcreate(NULL, B_RDWR), *chunked, *buffered;
    chunked   = ap_bpush_filter(buf,     chunked_filter, 0);
    buffered  = ap_bpush_filter(chunked, buffered_filter, B_RDWR);
    ap_bputs("Data for buffered ap_bputs\n", buffered);


Using a BUFF stream doesn't change: simply invoke the well known API
and call ap_bputs() or ap_bwrite() as you would today. Only, these
would be wrapper macros

    #define ap_bputs(data, buf)             buf->bf_puts(data, buf)
    #define ap_write(buf, data, max, lenp)  buf->bf_write(buf, data, max, lenp)

where a BUFF struct would hold function pointers and flags for the
various levels' input/output functions, in addition to today's BUFF
layout.

For performance improvement, the following can be added to taste:

* fewer buffering (zero copy where possible) by putting the buffers
  for buffered reading/writing down as far as possible (for SBCS: from
  layer 3 to layer 0). By doing this, the buffer can also hold a
  chunking prefix (used by layer 1) in front of the buffering buffer
  to reduce the number of vectors in a writev, or the number of copies
  between buffers. Each layer could indicate whether it needs a
  private buffer or not.

* intra-module calls can be hardcoded to call the appropriate lower
  layer directly, instead of using the ap_bwrite() etc macros. That
  means we don't use the function pointers all the time, but instead
  call the lower levels directly. OTOH we have iol_* stuff which uses
  function pointers anyway. We decided in 1.3 that we wanted to avoid
  the C++ type stuff (esp. function pointers) for performance reasons.
  But it would sure reduces the code complexity a lot.

The resulting layering would look like this:

    | Caller: using ap_bputs() | or ap_bgets/apbwrite etc.
    +--------------------------+
    | Layer 3: Buffered I/O    | gets/puts/getchar functionality
    +--------------------------+
    | Layer 2: Code Conversion | (optional conversions)
    +--------------------------+
    | Layer 1: Chunking Layer  | Adding chunks on writes
    +--------------------------+
    | Layer 0: Binary Output   | bwrite/bwritev, error handling
    +--------------------------+
    | iol_* functionality      | basic i/o
    +--------------------------+
    | apr_* functionality      |
    ....

-- 
<Martin.Kraemer@MchP.Siemens.De>             |    Fujitsu Siemens
Fon: +49-89-636-46021, FAX: +49-89-636-41143 | 81730  Munich,  Germany


==============================
Date: Tue, 2 May 2000 09:09:28 -0700 (PDT)
From: dean gaudet <dgaudet-list-new-httpd@arctic.org>
To: new-httpd@apache.org
Subject: Re: BUFF, IOL, Chunking, and Unicode in 2.0 (long)
In-Reply-To: <20000502155129.A10548@pgtm0035.mch.sni.de>
Message-ID: <Pine.LNX.4.21.0005020847180.22518-100000@twinlark.arctic.org>

On Tue, 2 May 2000, Martin Kraemer wrote:

> * iol's sit below BUFF. Therefore, they don't have enough information
>   to know which part of the written byte stream is net client data,
>   and which part is protocol information (chunks, MIME headers for
>   multipart/*).

there's not much stopping you from writing an iol which takes a BUFF * in
its initialiser, and then bcreating a second BUFF, and bpushing your iol.
like:

	/* this is in r->pool rather than r->connection->pool because
	 * we expect to create & destroy this inside request boundaries
	 * and if we stuck it in r->connection->pool the storage wouldn't
	 * be reclaimed earlier enough on pipelined connections.
	 *
	 * also, no need for buffering in new_buff because the translation
	 * layer can easily assume lower level BUFF is doing the buffering.
	 */
	new_buff = ap_bcreate(r->pool, B_WR);
	ap_bpush_iol(new_buff,
		ap_utf8_to_ebcdic(r->pool, r->connection->client));
	r->connection->client = new_buff;

main problem is that the new_buff only works for writing, and you
potentially need a separate conversion layer for reading from the
client.

shouldn't be too hard to split up r->connection->client into a read and
write half.

think of iol as the equivalent of the low level read/write, and BUFF
as the equivalent of FILE *.  there's a reason for both layers in
the interface.

> * iol's don't allow simplification of today's chunking code. It is
>   spread thruout buff.c and there's a very hairy balance between
>   efficiency and code correctness. Re-adding (EBCDIC/UTF) conversion,
>   possibly with sup[port for multi byte character sets (MBCS), would
>   make a code nightmare out of it. (buff.c in 1.3 was "almost" a
>   nightmare because we had onlu single byte translations.

as i've said before, i welcome anyone to do it otherwise without adding
network packets, without adding unnecessary byte copies, and without
making it even more complex.  until you've tried it, it's pretty easy
to just say "this is a mess".  once you've tried it i suspect you'll
discover why it is a mess.

that said, i'm still trying to prove to myself that the zero-copy
crud necessary to clean this up can be done in a less complex manner.

> * Putting conversion to a hierarchy level any higher than buff.c is no
>   solution either: for chunks, as well as for multipart headers and
>   buffering boundaries, we need character set translation. Pulling it
>   to a higher level means that a lot of redundant information has to
>   be passed down and up.

huh?  HTTP is in ASCII -- you don't need any conversion -- if a chunking
BUFF below a converting BUFF/iol is writing those things in ascii
it works.  no?  at least that's my understanding of the code in 1.3.

you wouldn't do the extra BUFF layer above until after you've written
the headers into the plain-text BUFF.

i would expect you'd:

	write headers through plain text BUFF
	push conversion BUFF
	run method
	pop conversion BUFF
	pump multipart header
	push conversion BUFF
	...
	pop conversion BUFF

> In my understanding, we need a layered buff.c (which I number from 0
> upwards):

you've already got it :)

>     | Caller: using ap_bputs() | or ap_bgets/apbwrite etc.
>     +--------------------------+
>     | Layer 3: Buffered I/O    | gets/puts/getchar functionality
>     +--------------------------+
>     | Layer 2: Code Conversion | (optional conversions)
>     +--------------------------+
>     | Layer 1: Chunking Layer  | Adding chunks on writes
>     +--------------------------+
>     | Layer 0: Binary Output   | bwrite/bwritev, error handling
>     +--------------------------+
>     | iol_* functionality      | basic i/o
>     +--------------------------+
>     | apr_* functionality      |

there are two cases you need to consider:

chunking and a partial write occurs -- you need to keep track of how much
of the chunk header/trailer was written so that on the next loop around
(which happens in the application at the top) you continue where you
left off.

and more importantly at the moment, and easier to grasp -- consider what
happens when you've got a pipelined connection.  a dozen requests come
in from the client, and apache-1.3 will send back the minimal number
of packets.  2.0-current still needs fixing in this area (specifically
saferead needs to be implemented).

for example, suppose the client sends one packet:

	GET /images/a.gif HTTP/1.1
	Host: foo

	GET /images/b.gif HTTP/1.1
	Host: foo

suppose that a.gif and b.gif are small 200 byte files.

apache-1.3 sends back one response packet:

	HTTP/1.1 OK
	headers

	a.gif body
	HTTP/1.1 OK
	headers

	b.gif body

consider what happens with your proposal.  in between each of those
requests you remove the buffering -- which means you have to flush a
packet boundary.  so your proposal generates two network packets.

like i've said before on this topic -- if all unixes had TCP_CORK,
it'd be a breeze.  but only linux has TCP_CORK.

you pretty much require a layer of buffering right above the iol which
talks to the network.

and once you put that layer of buffering there, you might as well merge
chunking into it, because chunking needs buffering as well (specifically
for the async i/o case).

and then you either have to double-buffer, or you can only stack
non-buffered layers above it.  fortunately, character-set conversion
should be doable without any buffering.

*or* you implement a zero-copy library, and hope it all works out in
the end.

-dean
OSSP CVS Repository