[djg: comments like this are from dean] This past summer, Alexei and I wrote a spec for an I/O Filters API... this proposal addresses one part of that -- 'stacked' I/O with buff.c. We have a couple of options for stacked I/O: we can either use existing code, such as sfio, or we can rewrite buff.c to do it. We've gone over the first possibility at length, though, and there were problems with each implemenation which was mentioned (licensing and compatibility, specifically); so far as I know, those remain issues. Btw -- sfio will be supported w/in this model... it just wouldn't be the basis for the model's implementation. -- Ed Korthof | Web Server Engineer -- -- ed@organic.com | Organic Online, Inc -- -- (415) 278-5676 | Fax: (415) 284-6891 -- --------------------------------------------------------------------------- Stacked I/O With BUFFs Sections: 1.) Overview 2.) The API User-supplied structures API functions 3.) Detailed Description The bfilter structure The bbottomfilter structure The BUFF structure Public functions in buff.c 4.) Efficiency Considerations Buffering Memory copies Function chaining writev 5.) Code in buff.c Default Functions Heuristics for writev Writing Reading Flushing data Closing stacks and filters Flags and Options ************************************************************************* Overview The intention of this API is to make Apache's BUFF structure modular while retaining high efficiency. Basically, it involves rewriting buff.c to provide 'stacked' I/O -- where the data passed through a series of 'filters', which may modify it. There are two parts to this, the core code for BUFF structures, and the "filters" used to implement new behavior. "filter" is used to refer to both the sets of 5 functions, as shown in the bfilter structure in the next section, and to BUFFs which are created using a specific bfliter. These will also be occasionally refered to as "user-supplied", though the Apache core will need to use these as well for basic functions. The user-supplied functions should use only the public BUFF API, rather than any internal details or functions. One thing which may not be clear is that in the core BUFF functions, the BUFF pointer passed in refers to the BUFF on which the operation will happen. OTOH, in the user-supplied code, the BUFF passed in is the next buffer down the chain, not the current one. ************************************************************************* The API User-supplied structures First, the bfilter structure is used in all filters: typedef struct { int (*writev)(BUFF *, void *, struct iovect *, int); int (*read)(BUFF *, void *, char *, int); int (*write)(BUFF *, void *, const char *, int); int (*flush)(BUFF *, void *, const char *, int, bfilter *); int (*transmitfile)(BUFF *, void *, file_info_ptr *); void (*close)(BUFF *, void *); } bfilter; bfilters are placed into a BUFF structure along with a user-supplied void * pointer. Second, the following structure is for use with a filter which can sit at the bottom of the stack: typedef struct { void *(*bgetfileinfo)(BUFF *, void *); void (*bpushfileinfo)(BUFF *, void *, void *); } bbottomfilter; BUFF API functions The following functions are new BUFF API functions: For filters: BUFF * bcreatestack(pool *p, int flags, struct bfilter *, struct bbottomfilter *, void *); BUFF * bpushfilter (BUFF *, struct bfilter *, void *); BUFF * bpushbuffer (BUFF *, BUFF *); BUFF * bpopfilter(BUFF *); BUFF * bpopbuffer(BUFF *); void bclosestack(BUFF *); For BUFFs in general: int btransmitfile(BUFF *, file_info_ptr *); int bsetstackopts(BUFF *, int, const void *); int bsetstackflags(BUFF *, int, int); Note that a new flag is needed for bsetstackflags: B_MAXBUFFERING The current bcreate should become BUFF * bcreatebuffer (pool *p, int flags, struct bfilter *, void *); ************************************************************************* Detailed Explanation bfilter structure The void * pointer used in all these functions, as well as those in the bbottomfilter structure and the filter API functions, is always the same pointer w/in an individual BUFF. The first function in a bfilter structure is 'writev'; this is only needed for high efficiency writing, generally at the level of the system interface. In it's absence, multiple writes will be done w/ 'write'. Note that defining 'writev' means you must define 'write'. The second is 'write'; this is the generic writing function, taking a BUFF * to which to write, a block of text, and the length of that block of text. The expected return is the number of characters (out of that block of text) which were successfully processed (rather than the number of characters actually written). The third is 'read'; this is the generic reading function, taking a BUFF * from which to read data, and a void * buffer in which to put text, and the number of characters to put in that buffer. The expected return is the number of characters placed in the buffer. The fourth is 'flush'; this is intended to force the buffer to spit out any data it may have been saving, as well as to clear any data the BUFF code was storing. If the third argument is non-null, then it contains more text to be printed; that text need not be null terminated, but the fourth argument contains the length of text to be processed. The expected return value should be the number of characters handled out from the third argument (0 if there are none), or -1 on error. Finally, the fifth argument is a pointer to the bfilter struct containing this function, so that it may use the write or writev functions in it. Note that general buffering is handled by BUFF's internal code, and module writers should not store data for performance reasons. The fifth is 'transmitfile', which takes as its arguments a buffer to which to write (if non-null), the void * pointer containing configuration (or other) information for this filter, and a system-dependent pointer (the file_info_ptr structure will be defined on a per-system basis) containing information required to print the 'file' in question. This is intended to allow zero-copy TCP in Win32. The sixth is 'close'; this is what is called when the connection is being closed. The 'close' should not be passed on to the next filter in the stack. Most filters will not need to use this, but if database handles or some other object is created, this is the point at which to remove it. Note that flush is called automatically before this. bbottomfilter Structure The first function, bgetfileinfo, is designed to allow Apache to get information from a BUFF struct regarding the input and output sources. This is currently used to get the input file number to select on a socket to see if there's data waiting to be read. The information returned is platform specific; the void * pointer passed in holds the void * pointer passed to all user-supplied functions. The second function, bpushfileinfo, is used to push file information onto a buffer, so that the buffer can be fully constructed and ready to handle data as soon as possible after a client has connected. The first void * pointer holds platform specific information (in Unix, it would be a pair of file descriptors); the second holds the void * pointer passed to all user-supplied functions. [djg: I don't think I really agree with the distinction here between the bottom and the other filters. Take the select() example, it's valid for any layer to define a fd that can be used for select... in fact it's the topmost layer that should really get to make this definition. Or maybe I just have your top and bottom flipped. In any event I think this should be part of the filter structure and not separate.] The BUFF structure A couple of changes are needed for this structure: remove fd and fd_in; add a bfilter structure; add a pointer to a bbottomfilter; add three pointers to the next BUFFs: one for the next BUFF in the stack, one for the next BUFF which implements write, and one for the next BUFF which implements read. Public functions in buff.c BUFF * bpushfilter (BUFF *, struct bfilter *, void *); This function adds the filter functions from bfilter, stacking them on top of the BUFF. It returns the new top BUFF, or NULL on error. BUFF * bpushbuffer (BUFF *, BUFF *); This function places the second buffer on the top of the stack that the first one is on. It returns the new top BUFF, or NULL on error. BUFF * bpopfilter(BUFF *); BUFF * bpopbuffer(BUFF *); Unattaches the top-most filter from the stack, and returns the new top-level BUFF, or NULL on error or when there are no BUFFs remaining. The two are synonymous. void bclosestack(BUFF *); Closes the I/O stack, removing all the filters in it. BUFF * bcreatestack(pool *p, int flags, struct bfilter *, struct bbottomfilter *, void *); This creates an I/O stack. It returns NULL on error. BUFF * bcreatebuffer(pool *p, int flags, struct bfilter *, void *); This creates a BUFF for later use with bpushbuffer. The BUFF is not set up to be used as an I/O stack, however. It returns NULL on error. int bsetstackopts(BUFF *, int, const void *); int bsetstackflags(BUFF *, int, int); These functions, respectively, set options on all the BUFFs in a stack. The new flag, B_MAXBUFFERING is used to disable a feature described in the next section, whereby only the first and last BUFFs will buffer data. ************************************************************************* Efficiency Considerations Buffering All input and output is buffered by the standard buffering code. People writing code to use buff.c should not concern themselves with buffering for efficiency, and should not buffer except when necessary. The write function will typically be called with large blocks of text; the read function will attempt to place the specified number of bytes into the buffer. Dean noted that there are possible problems w/ multiple buffers; further, some applications must not be buffered. This can be partially dealt with by turning off buffering, or by flushing the data when appropriate. However, some potential problems arise anyway. The simplest example involves shrinking transformations; suppose that you have a set of filters, A, B, and C, such that A outputs less text than it recieves, as does B (say A strips comments, and B gzips the result). Then after a write to A which fills the buffer, A writes to B. However, A won't write enough to fill B's buffer, so a memory copy will be needed. This continues till B's buffer fills up, then B will write to C's buffer -- with the same effect. [djg: I don't think this is the issue I was really worried about -- in the case of shrinking transformations you are already doing non-trivial amounts of CPU activity with the data, and there's no copying of data that you can eliminate anyway. I do recognize that there are non-CPU intensive filters -- such as DMA-capable hardware crypto cards. I don't think they're hard to support in a zero-copy manner though.] The maximum additional number of bytes which will be copied in this scenario is on the order of nk, where n is the total number of bytes, and k is the number of filters doing shrinking transformations. There are several possible solutions to this issue. The first is to turn off buffering in all but the first filter and the last filter. This reduces the number of unnecessary byte copies to at most one per byte, however it means that the functions in the stack will get called more frequently; but it is the default behavior, overridable by setting the B_MAXBUFFERING with bsetstackflags. Most filters won't involve a net shrinking transformation, so even this will rarely be an issue; however, if the filters do involve a net shrinking transformation, for the sake of network-efficiency (sending reasonably sized blocks), it may be more efficient anyway. A second solution is more general use of writev for communication between different buffers. This complicates the programing work, however. Memory copies Each write function is passed a pointer to constant text; if any changes are being made to the text, it must be copied. However, if no changes are made to the text (or to some smaller part of it), then it may be sent to the next filter without any additional copying. This should provide the minimal necessary memory copies. [djg: Unfortunately this makes it hard to support page-flipping and async i/o because you don't have any reference counts on the data. But I go into a little detail that already in docs/page_io.] Function chaining In order to avoid unnecessary function chaining for reads and writes, when a filter is pushed onto the stack, the buff.c code will determine which is the next BUFF which contains a read or write function, and reads and writes, respectively, will go directly to that BUFF. writev writev is a function for efficient writing to the system; in terms of this API, however, it also works for dealing with multiple blocks of text without doing unnecessary byte copies. It is not required. Currently, the system level writev is used in two contexts: for chunking and when a block of text is writen which, combined with the text already in the buffer, would make the buffer overflow. writev would be implemented both by the default bottom level filter and by the chunking filter for these operations. In addition, writev may, be used, as noted above, to pass multiple blocks of text w/o copying them into a single buffer. Note that if the next filter does not implement writev, however, this will be equivalent to repeated calls to write, which may or may not be more efficient. Up to IOV_MAX-2 blocks of text may be passed along in this manner. Unlike the system writev call, the writev in this API should be called only once, with a array with iovec's and a count as to the number of iovecs in it. If a bfilter defines writev, writev will be called whether or not NO_WRITEV is set; hence, it should deal with that case in a reasonable manner. [djg: We can't guarantee atomicity of writev() when we emulate it. Probably not a problem, just an observation.] ************************************************************************* Code in buff.c Default Functions The default actions are generally those currently performed by Apache, save that they they'll only attempt to write to a buffer, and they'll return an error if there are no more buffers. That is, you must implement read, write, and flush in the bottom-most filter. Except for close(), the default code will simply pass the function call on to the next filter in the stack. Some samples follow. Heuristics for writev Currently, we call writev for chunking, and when we get a enough so that the total overflows the buffer. Since chunking is going to become a filter, the chunking filter will use writev; in addition, bwrite will trigger bwritev as shown (note that system specific information should be kept at the filter level): in bwrite: if (fb->outcnt > 0 && nbyte + fb->outcnt >= fb->bufsiz) { /* build iovec structs */ struct iovec vec[2]; vec[0].iov_base = (void *) fb->outbase; vec[0].iov_len = fb->outcnt; fb->outcnt = 0; vec[1].iov_base = (void *)buff; vec[1].iov_length = nbyte; return bwritev (fb, vec, 2); } else if (nbye >= fb->bufsiz) { return write_with_errors(fb,buff,nbyte); } Note that the code above takes the place of large_write (as well as taking code from it). So, bwritev would look something like this (copying and pasting freely from the current source for writev_it_all, which could be replaced): ----- int bwritev (BUFF * fb, struct iovec * vec, int nvecs) { if (!fb) return -1; /* the bottom level filter implemented neither write nor * writev. */ if (fb->bfilter.bwritev) { return bf->bfilter.writev(fb->next, vec, nvecs); } else if (fb->bfilter.write) { /* while it's nice an easy to build the vector and crud, it's painful * to deal with partial writes (esp. w/ the vector) */ int i = 0,rv; while (i < nvecs) { do { rv = fb->bfilter.write(fb, vec[i].iov_base, vec[i].iov_len); } while (rv == -1 && (errno == EINTR || errno == EAGAIN) && !(fb->flags & B_EOUT)); if (rv == -1) { if (errno != EINTR && errno != EAGAIN) { doerror (fb, B_WR); } return -1; } fb->bytes_sent += rv; /* recalculate vec to deal with partial writes */ while (rv > 0) { if (rv < vec[i].iov_len) { vec[i].iov_base = (char *)vec[i].iov_base + rv; vec[i].iov_len -= rv; rv = 0; if (vec[i].iov_len == 0) { ++i; } } else { rv -= vec[i].iov_len; ++i; } } if (fb->flags & B_EOUT) return -1; } /* if we got here, we wrote it all */ return 0; } else { return bwritev(fb->next,vec,nvecs); } } ----- The default filter's writev function will pretty much like writev_it_all. Writing The general case for writing data is significantly simpler with this model. Because special cases are not dealt with in the BUFF core, a single internal interface to writing data is possible; I'm going to assume it's reasonable to standardize on write_with_errors, but some other function may be more appropriate. In the revised bwrite (which I'll ommit for brievity), the following must be done: check for error conditions check to see if any buffering is done; if not, send the data directly to the write_with_errors function check to see if we should use writev or write_with_errors as above copy the data to the buffer (we know it fits since we didn't need writev or write_with_errors) The other work the current bwrite is doing is ifdef'ing around NO_WRITEV numerous decisions regarding whether or not to send chunks Generally, buff.c has a number of functions whose entire purpose is to handle particular special cases wrt chunking, all of which could be simplified with a chunking filter. write_with_errors would not need to change; buff_write would. Here is a new version of it: ----- /* the lowest level writing primitive */ static ap_inline int buff_write(BUFF *fb, const void *buf, int nbyte) { if (fb->bfilter.write) return fb->bfilter.write(fb->next_writer,buff,nbyte); else return bwrite(fb->next_writer,buff,nbyte); } ----- If the btransmitfile function is called on a buffer which doesn't implement it, the system will attempt to read data from the file identified by the file_info_ptr structure and use other methods to write to it. Reading One of the basic reading functions in Apache 1.3b3 is buff_read; here is how it would look within this spec: ----- /* the lowest level reading primitive */ static ap_inline int buff_read(BUFF *fb, void *buf, int nbyte) { int rv; if (!fb) return -1; /* the bottom level filter is not set up properly */ if (fb->bfilter.read) return fb->bfilter.read(fb->next_reader,buf,nbyte,fb->bfilter_info); else return bread(fb->next_reader,buff,nbyte); } ----- The code currently in buff_read would become part of the default filter. Flushing data flush will get passed on down the stack automatically, with recursive calls to bflush. The user-supplied flush function will be called then, and also before close is called. The user-supplied flush should not call flush on the next buffer. [djg: Poorly written "expanding" filters can cause some nastiness here. In order to flush a layer you have to write out your current buffer, and that may cause the layer below to overflow a buffer and flush it. If the filter is expanding then it may have to add more to the buffer before flushing it to the layer below. It's possible that the layer below will end up having to flush twice. It's a case where writev-like capabilities are useful.] Closing Stacks and Filters When a filter is removed from the stack, flush will be called then close will be called. When the entire stack is being closed, this operation will be done automatically on each filter within the stack; generally, filters should not operate on other filters further down the stack, except to pass data along when flush is called. Flags and Options Changes to flags and options using the current functions only affect one buffer. To affect all the buffers on down the chain, use bsetstackopts or bsetstackflags. bgetopt is currently only used to grab a count of the bytes sent; it will continue to provide that functionality. bgetflags is used to provide information on whether or not the connection is still open; it'll continue to provide that functionality as well. The core BUFF operations will remain, though some operations which are done via flags and options will be done by attaching appropriate filters instead (eg. chunking). [djg: I'd like to consider filesystem metadata as well -- we only need a few bits of metadata to do HTTP: file size and last modified. We need an etag generation function, it is specific to the filters in use. You see, I'm envisioning a bottom layer which pulls data out of a database rather than reading from a file.] ------- This file is there so that I do not have to remind myself about the reasons for Layered IO, apart from the obvious one. 0. To get away from a 1 to 1 mapping i.e. a single URI can cause multiple backend requests, in arbitrary configurations, such as in paralel, tunnel/piped, or in some sort of funnel mode. Such multiple backend requests, with fully layered IO can be treated exactly like any URI request; and recursion is born :-) 1. To do on the fly charset conversion Be, theoretically, be able to send out your content using latin1, latin2 or any other charset; generated from static _and_ dynamic content in other charsets (typically unicode encoded as UTF7 or UTF8). Such conversion is prompted by things like the user-agent string, a cookie, or other hints about the capabilities of the OS, language preferences and other (in)capabilities of the final receipient. 2. To be able to do fancy templates Have your application/cgi sending out an XML structure of field/value pair-ed contents; which is substituted into a template by the web server; possibly based on information accessible/known to the webserver which you do not want to be known to the backend script. Ideally that template would be just as easy to generate by a backend as well (see 0). 3. On the fly translation And other general text and output mungling, such as translating an english page in spanish whilst it goes through your Proxy, or JPEG-ing a GIF generated by mod_perl+gd. Dw. --------- From dgaudet@arctic.org Fri Feb 20 00:36:52 1998 Date: Fri, 20 Feb 1998 00:35:37 -0800 (PST) From: Dean Gaudet To: new-httpd@apache.org Subject: page-based i/o X-Comment: Visit http://www.arctic.org/~dgaudet/legal for information regarding copyright and disclaimer. Reply-To: new-httpd@apache.org Ed asked me for more details on what I mean when I talk about "paged based zero copy i/o". While writing mod_mmap_static I was thinking about the primitives that the core requires of the filesystem. What exactly is it that ties us into the filesystem? and how would we abstract it? The metadata (last modified time, file length) is actually pretty easy to abstract. It's also easy to define an "index" function so that MultiViews and such can be implemented. And with layered I/O we can hide the actual details of how you access these "virtual" files. But therein lies an inefficiency. If we had only bread() for reading virtual files, then we would enforce at least one copy of the data. bread() supplies the place that the caller wants to see the data, and so the bread() code has to copy it. But there's very little reason that bread() callers have to supply the buffer... bread() itself could supply the buffer. Call this new interface page_read(). It looks something like this: typedef struct { const void *data; size_t data_len; /* amt of data on page which is valid */ ... other stuff necessary for managing the page pool ... } a_page_head; /* returns NULL if an error or EOF occurs, on EOF errno will be * set to 0 */ a_page_head *page_read(BUFF *fb); /* queues entire page for writing, returns 0 on success, -1 on * error */ int page_write(BUFF *fb, a_page_head *); It's very important that a_page_head structures point to the data page rather than be part of the data page. This way we can build a_page_head structures which refer to parts of mmap()d memory. This stuff is a little more tricky to do, but is a big win for performance. With this integrated into our layered I/O it means that we can have zero-copy performance while still getting the advantages of layering. But note I'm glossing over a bunch of details... like the fact that we have to decide if a_page_heads are shared data, and hence need reference counting (i.e. I said "queues for writing" up there, which means some bit of the a_page_head data has to be kept until its actually written). Similarly for the page data. There are other tricks in this area that we can take advantage of -- like interprocess communication on architectures that do page flipping. On these boxes if you write() something that's page-aligned and page-sized to a pipe or unix socket, and the other end read()s into a page-aligned page-sized buffer then the kernel can get away without copying any data. It just marks the two pages as shared copy-on-write, and only when they're written to will the copy be made. So to make this work, your writer uses a ring of 2+ page-aligned/sized buffers so that it's not writing on something the reader is still reading. Dean ---- For details on HPUX and avoiding extra data copies, see . (note that if you get the postscript version instead, you have to manually edit it to remove the front page before any version of ghostscript that I have used will read it) ---- I've been told by an engineer in Sun's TCP/IP group that zero-copy TCP in Solaris 2.6 occurs when: - you've got the right interface card (OC-12 ATM card I think) - you use write() - your write buffer is 16k aligned and a multiple of 16k in size We currently get the 16k stuff for free by using mmap(). But sun's current code isn't smart enough to deal with our initial writev() of the headers and first part of the response. ---- Systems that have a system call to efficiently send the contents of a descriptor across the network. This is probably the single best way to do static content on systems that support it. HPUX: (10.30 and on) ssize_t sendfile(int s, int fd, off_t offset, size_t nbytes, const struct iovec *hdtrl, int flags); (allows you to add headers and trailers in the form of iovec structs) Marc has a man page; ask if you want a copy. Not included due to copyright issues. man page also available from http://docs.hp.com/ (in particular, http://docs.hp.com:80/dynaweb/hpux11/hpuxen1a/rvl3en1a/@Generic__BookTextView/59894;td=3 ) Windows NT: BOOL TransmitFile( SOCKET hSocket, HANDLE hFile, DWORD nNumberOfBytesToWrite, DWORD nNumberOfBytesPerSend, LPOVERLAPPED lpOverlapped, LPTRANSMIT_FILE_BUFFERS lpTransmitBuffers, DWORD dwFlags ); (does it start from the current position in the handle? I would hope so, or else it is pretty dumb.) lpTransmitBuffers allows for headers and trailers. Documentation at: http://premium.microsoft.com/msdn/library/sdkdoc/wsapiref_3pwy.htm http://premium.microsoft.com/msdn/library/conf/html/sa8ff.htm Even less related to page based IO: just context switching: AcceptEx does an accept(), and returns the start of the input data. see: http://premium.microsoft.com/msdn/library/sdkdoc/pdnds/sock2/wsapiref_17jm.htm What this means is you require one less syscall to do a typical request, especially if you have a cache of handles so you don't have to do an open or close. Hmm. Interesting question: then, if TransmitFile starts from the current position, you need a mutex around the seek and the TransmitFile. If not, you are just limited (eg. byte ranges) in what you can use it for. Also note that TransmitFile can specify TF_REUSE_SOCKET, so that after use the same socket handle can be passed to AcceptEx. Obviously only good where we don't have a persistent connection to worry about. ---- Note that all this is shot to bloody hell by HTTP-NG's multiplexing. If fragment sizes are big enough, it could still be worthwhile to do copy avoidence. It also causes performance issues because of its credit system that limits how much you can write in a single chunk. Don't tell me that if HTTP-NG becomes popular we will seen vendors embedding SMUX (or whatever multiplexing is used) in the kernel to get around this stuff. There we go, Apache with a loadable kernel module. ---- Larry McVoy's document for SGI regarding sendfile/TransmitFile: ftp://ftp.bitmover.com/pub/splice.ps.gz From dgaudet@arctic.org Sun Jun 20 11:07:58 1999 Path: engelschall.com!mail2news!apache.org!new-httpd-owner-rse+apache=en.muc.de From: dgaudet@arctic.org (Dean Gaudet) Newsgroups: en.lists.apache-new-httpd Subject: mpm update Date: 19 Jun 1999 07:17:00 +0200 Organization: Mail2News at engelschall.com Lines: 104 Approved: postmaster@m2ndom Message-ID: Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 929769420 64417 141.1.129.1 (19 Jun 1999 05:17:00 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 19 Jun 1999 05:17:00 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:31056 I imported mpm-3 into the apache-2.0 repository (tag mpm-3 if you want it). Then I threw in a bunch of my recent email ramblings, because I'm getting tired of repeating them, mostly off-list to folks who ask "why doesn't apache do XYZ?" I intend to be more proactive in this area, because it can only help. Then I ripped up BUFF and broke lots of stuff and put in a first crack at layering. Info on that below. If you check out the tree, and build it (using Configuration.mpm) you should be able to serve up the top page of the manual, that's all I've tested so far ;) Dean goals? we need an i/o abstraction which has these properties: - buffered and non-buffered modes The buffered mode should look like FILE *. The non-buffered mode should look more like read(2)/write(2). - blocking and non-blocking modes The blocking mode is the "easy" mode -- it's what most module writers will see. The non-blocking mode is the "hard" mode, this is where module writers wanting to squeeze out some speed will have to play. In order to build async/sync hybrid models we need the non-blocking i/o abstraction. - timed reads and writes (for blocking cases) This is part of my jihad against asynchronous notification. - i/o filtering or layering Yet another Holy Grail of computing. But I digress. These are hard when you take into consideration non-blocking i/o -- you have to keep lots of state. I expect our core filters will all support non-blocking i/o, well at least the ones I need to make sure we kick ass on benchmarks. A filter can deny a switch to non-blocking mode, the server will have to recover gracefully (ha). - copy-avoidance Hey what about zero copy a la IO-Lite? After having experienced it in a production setting I'm no longer convinced of its benefits. There is an enormous amount of overhead keeping lists of buffers, and reference counts, and cleanup functions, and such which requires a lot of tuning to get right. I think there may be something here, but it's not a cakewalk. What I do know is that the heuristics I put into apache-1.3 to choose writev() at times are almost as good as what you can get from doing full zero-copy in the cases we *currently* care about. To put it another way, let's wait another generation to deal with zero copy. But sendfile/transmitfile/etc. those are still interesting. So instead of listing "zero copy" as a property, I'll list "copy-avoidance". So far? - ap_bungetc added - ap_blookc changed to return the character, rather than take a char *buff - in theory, errno is always useful on return from a BUFF routine - ap_bhalfduplex, B_SAFEREAD will be re-implemented using a layer I think - chunking gone for now, will return as a layer - ebcdic gone for now... it should be a layer - ap_iol.h defined, first crack at the layers... Step back a second to think on it. Much like we have fread(3) and read(2), I've got a BUFF and an ap_iol abstraction. An ap_iol could use a BUFF if it requires some form of buffering, but many won't require buffering... or can do a better job themselves. Consider filters such as: - ebcdic -> ascii - encryption - compression These all share the property that no matter what, they're going to make an extra copy of the data. In some cases they can do it in place (read) or into a fixed buffer... in most cases their buffering requirements are different than what BUFF offers. Consider a filter such as chunking. This could actually use the writev method to get its job done... depends on the chunks being used. This is where zero-copy would be really nice, but we can get by with a few heuristics. At any rate -- the NSPR folks didn't see any reason to included a buffered i/o abstraction on top of their layered i/o abstraction... so I feel like I'm not the only one who's thinking this way. - iol_unix.c implemented... should hold us for a bit From dgaudet@arctic.org Mon Jun 28 19:06:50 1999 Path: engelschall.com!mail2news!apache.org!new-httpd-owner-rse+apache=en.muc.de From: dgaudet@arctic.org (Dean Gaudet) Newsgroups: en.lists.apache-new-httpd Subject: Re: async routines Date: 28 Jun 1999 17:33:24 +0200 Organization: Mail2News at engelschall.com Lines: 96 Approved: postmaster@m2ndom Message-ID: Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 930584004 99816 141.1.129.1 (28 Jun 1999 15:33:24 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 28 Jun 1999 15:33:24 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:31280 [hope you don't mind me cc'ing new-httpd zach, I think others will be interested.] On Mon, 28 Jun 1999, Zach Brown wrote: > so dean, I was wading through the mpm code to see if I could munge the > sigwait stuff into it. > > as far as I could tell, the http protocol routines are still blocking. > what does the future hold in the way for async routines? :) I basically > need a way to do something like.. You're still waiting for me to get the async stuff in there... I've done part of the work -- the BUFF layer now supports non-blocking sockets. However, the HTTP code will always remain blocking. There's no way I'm going to try to educate the world in how to write async code... and since our HTTP code has arbitrary call outs to third party modules... It'd have a drastic effect on everyone to make this change. But I honestly don't think this is a problem. Here's my observations: All the popular HTTP clients send their requests in one packet (or two in the case of a POST and netscape). So the HTTP code would almost never have to block while processing the request. It may block while processing a POST -- something which someone else can worry about later, my code won't be any worse than what we already have in apache. So any effort we put into making the HTTP parsing code async-safe would be wasted on the 99.9% case. Most responses fit in the socket's send buffer, and again don't require async support. But we currently do the lingering_close() routine which could easily use async support. Large responses also could use async support. The goal of HTTP parsing is to figure out which response object to send. In most cases we can reduce that to a bunch of common response types: - copying a file to the socket - copying a pipe/socket to the socket (IPC, CGIs) - copying a mem region to the socket (mmap, some dynamic responses) So what we do is we modify the response handlers only. We teach them about how to send async responses. There will be a few new primitives which will tell the core "the response fits one of these categories, please handle it". The core will do the rest -- and for MPMs which support async handling, the core will return to the MPM and let the MPM do the work async... the MPM will call a completion function supplied by the core. (Note that this will simplify things for lots of folks... for example, it'll let us move range request handling to a common spot so that more than just default_handler can support it.) I expect this to be a simple message passing protocol (pass by reference). Well rather, that's how I expect to implement it in ASH -- where I'll have a single thread per-process doing the select/poll stuff; and the other threads are in a pool that handles the protocol stuff. For your stuff you may want to do it another way -- but we'll be using a common structure that the core knows about... and that structure will look like a message: struct msg { enum { MSG_SEND_FILE, MSG_SEND_PIPE, MSG_SEND_MEM, MSG_LINGERING_CLOSE, MSG_WAIT_FOR_READ, /* for handling keep-alives */ ... } type; BUFF *client; void (*completion)(struct msg *, int status); union { ... extra data here for whichver types need it ...; } x; }; The nice thing about this is that these operations are protocol independant... at this level there's no knowledge of HTTP, so the same MPM core could be used to implement other protocols. > so as I was thinking about this stuff, I realized it might be neat to have > 'classes' of non blocking pending work and have different threads with > differnt priorities hacking on it. Say we have a very high priority > thread that accepts connectoins, does initial header parsing, and > sendfile()ing data out. We could have lower priority threads that are > spinning doing 'harder' BUFF work like an encryption layer or gziping > content, whatever. You should be able to implement this in your MPM easily I think... because you'll see the different message types and can distribute them as needed. Dean