From dgaudet-list-new-httpd@arctic.org Mon Mar 13 11:11:15 2000 Path: engelschall.com!mail2news!apache.org!new-httpd-return-660-rse+apache=en.muc.de From: dgaudet-list-new-httpd@arctic.org (Dean Gaudet) Newsgroups: en.lists.apache-new-httpd Subject: Re: Buff should be an I/O layer Date: 12 Mar 2000 19:24:28 +0100 Organization: Mail2News at engelschall.com Lines: 41 Approved: postmaster@m2ndom Message-ID: Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 952885468 17709 141.1.129.1 (12 Mar 2000 18:24:28 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 12 Mar 2000 18:24:28 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:36319 On Fri, 10 Mar 2000, Manoj Kasichainula wrote: > Random thought that I thought should go to the list before I forget > it: > > BUFF's API should be redone to look exactly like an IOL. As far as the > rest of Apache code is concerned, a IOL with a buff around it > should look just like any other IOL. haha! if you figure this out, then rad. i tried for a long time to figure out a clean way to do this which doesn't suck and i never found one. remember it is totally unacceptable for bputc() and bgetc() to be anything other than macros operating directly on the buffer. > First of all, we get more uniformity in the API. That's always a good > thing. This also allows us to yank out the buff IOL sometimes. I can > see this being useful if a really sophisticated module wants to truly > eliminate the buffering between it and the client. the BUFF layer allows you to run without buffering, or with non-blocking i/o, and it implements chunking. you gain a mere few cycles by "yanking it out". you might say "chunking should be a layer too", see my above laughter. the only way you're going to make a change like this "clean" is to add a sophisticated zero-copy implementation. i stopped short of doing this because in my experience using one of these, the benefits are really minimal. look in libstash. the only case where i've seen the zero-copy stuff really shine is when doing TCP-to-TCP proxying. for everything else, the (comparatively simple) heuristics present in BUFF are all that's required. Dean From dgaudet-list-new-httpd@arctic.org Mon Mar 13 11:11:44 2000 Path: engelschall.com!mail2news!apache.org!new-httpd-return-661-rse+apache=en.muc.de From: dgaudet-list-new-httpd@arctic.org (Dean Gaudet) Newsgroups: en.lists.apache-new-httpd Subject: Re: Buff should be an I/O layer Date: 12 Mar 2000 19:24:30 +0100 Organization: Mail2News at engelschall.com Lines: 41 Approved: postmaster@m2ndom Message-ID: Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 952885470 17739 141.1.129.1 (12 Mar 2000 18:24:30 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 12 Mar 2000 18:24:30 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:36320 On Sun, 12 Mar 2000, Dean Gaudet wrote: > the only way you're going to make a change like this "clean" is to add a i should clarify this... the only *portable* way is blah blah blah. if all unixes supported TCP_CORK, and had very inexpensive syscall overhead like linux does then we wouldn't have to do much work at all -- we could take advantage of the fact that the kernel generally has to do a single copy of all the bytes anyhow. TCP_CORK, for those not aware of it, is a very much needed correction to the TCP API. specifically, the traditional API gives us the two options: - nagle on: the kernel makes somewhat arbitrary decisions as to where to form your TCP packet boundaries, you might get lucky and two small writes will be combined into one packet... or you might get unlucky and your packets will be delayed causing performance degredation. - nagle off: each write() can, and usually does, cause a packet boundary. it's pretty much the case that no matter which option you choose it results in performance degredation. with TCP_CORK the kernel is permitted to send any complete frames, but can't send any final partial frames until the cork is removed. this lets user applications use write(), which is *far* more natural to use than writev() ... writev() is essentially an optimisation for kernels with expensive syscall overhead :) i think TCP_CORK is still unique to linux. they added it when they were implementing sendfile() and i pointed out the packet boundary problems and asked for this new api... most other sendfile() implementations are kludges bundling up sendfile and writev for the headers and trailers. Dean From dgaudet-list-new-httpd@arctic.org Tue Apr 11 08:16:13 2000 Path: engelschall.com!mail2news!apache.org!new-httpd-return-2246-rse+apache=en.muc.de From: dgaudet-list-new-httpd@arctic.org (dean gaudet) Newsgroups: en.lists.apache-new-httpd Subject: SAFEREAD (was Re: Buff should be an I/O layer) Date: 11 Apr 2000 07:12:07 +0200 Organization: Mail2News at engelschall.com Lines: 31 Approved: postmaster@m2ndom Message-ID: Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 955429927 94666 141.1.129.1 (11 Apr 2000 05:12:07 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 11 Apr 2000 05:12:07 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:37614 On Mon, 10 Apr 2000, dean gaudet wrote: > if you do that against apache-1.3 and the current apache-2.0 you get > back maximally packed packets. heh, no 2.0 is broken. i broke SAFEREAD during the initial mpm work -- and it hasn't been re-implemented yet. does someone else want to fix this? it's probably not ideal that i'm the only person intimately familiar with this code :) without SAFEREAD, we end up with a packet boundary between every response in a pipelined connection. essentially saferead ensures that if we are going to have to block in read() to get the next request on a connection then we better flush our output buffer (otherwise we cause a deadlock with non-pipelining clients). but if there are more bytes available, then we don't need to flush our output buffer. if you search the code for SAFEREAD you'll see i suggest that it might be implemented as a layer. i'm not sure what i meant, i don't think this works. if you look at 1.3's saferead and bhalfduplex in buff.c you'll see that we use select() to implement it. naturally this won't work in 2.0... but what will work is to set the iol's timeout to 0 and attempt a read. underneath the covers this achieves the same result. -dean From dgaudet-list-new-httpd@arctic.org Thu Apr 13 14:10:41 2000 Path: engelschall.com!mail2news!apache.org!new-httpd-return-2263-rse+apache=en.muc.de From: dgaudet-list-new-httpd@arctic.org (dean gaudet) Newsgroups: en.lists.apache-new-httpd Subject: Re: question about the STATUS entry for lingering close Date: 11 Apr 2000 18:35:53 +0200 Organization: Mail2News at engelschall.com Lines: 74 Approved: postmaster@m2ndom Message-ID: Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 955470953 15153 141.1.129.1 (11 Apr 2000 16:35:53 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 11 Apr 2000 16:35:53 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:37624 On Tue, 11 Apr 2000, Jeff Trawick wrote: > > * Fix lingering close > > Status: > > Does 2.0 regress one or more of the solutions in 1.3, or was some > improvement (other than async I/O) envisioned? this actually isn't about implementing SO_LINGER stuff... that should be avoided at all costs -- SO_LINGER works on very few kernels. actually i'm tempted to say rip out all the SO_LINGER stuff. we need lingering_close() re-implemented, it's in main/http_connection.c. to do that we need to add a shutdown() method to ap_iol_methods, and i suggest an ap_bshutdown() added to BUFF. and then lingering_close() needs to be re-implemented something like the code below. -dean /* we now proceed to read from the client until we get EOF, or until * MAX_SECS_TO_LINGER has passed. the reasons for doing this are * documented in a draft: * * http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-connection-00.txt * * in a nutshell -- if we don't make this effort we risk causing * TCP RST packets to be sent which can tear down a connection before * all the response data has been sent to the client. */ static void lingering_close(request_rec *r) { char dummybuf[IOBUFFERSIZE]; ap_time_t start; ap_ssize_t nbytes; ap_status_t rc; int timeout; /* Send any leftover data to the client, but never try to again */ if (ap_bflush(r->connection->client) != APR_SUCCESS) { ap_bclose(r->connection->client); return; } /* XXX: hrm, setting B_EOUT should probably be part of ap_bshutdown() */ ap_bsetflag(r->connection->client, B_EOUT, 1); if (ap_bshutdown(r->connection->client, 1) != APR_SUCCESS || ap_is_aborted(r->connection)) { ap_bclose(r->connection->client); return; } start = ap_now(); timeout = MAX_SECS_TO_LINGER; for (;;) { ap_bsetopt(r->connection->client, BO_TIMEOUT, &timeout); rc = ap_bread(r->connection->client, dummybuf, sizeof(dummybuf), &nbytes); if (rc != APR_SUCCESS) break; /* how much time has elapsed? */ timeout = (ap_now() - start) / AP_USEC_PER_SEC; if (timeout >= MAX_SECS_TO_LINGER) break; /* figure out the new timeout */ timeout = MAX_SECS_TO_LINGER - timeout; } ap_bclose(r->connection->client); } From dgaudet-list-new-httpd@arctic.org Tue Mar 28 11:34:53 2000 Path: engelschall.com!mail2news!apache.org!new-httpd-return-1510-rse+apache=en.muc.de From: dgaudet-list-new-httpd@arctic.org (Dean Gaudet) Newsgroups: en.lists.apache-new-httpd Subject: canonical list of i/o layering use cases Date: 28 Mar 2000 07:04:07 +0200 Organization: Mail2News at engelschall.com Lines: 93 Approved: postmaster@m2ndom Message-ID: Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 954219847 46303 141.1.129.1 (28 Mar 2000 05:04:07 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 28 Mar 2000 05:04:07 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:37027 i really hope this helps this discussion move forward. the following is the list of all applications i know of which have been proposed to benefit from i/o layering. - data sink abstractions: - memory destination (for ipc; for caching; or even for abstracting things such as strings, which can be treated as an i/o object) - pipe/socket destination - portability variations on the above - data source abstraction, such as: - file source (includes proxy caching) - memory source (includes most dynamic content generation) - network source (TCP-to-TCP proxying) - database source (which is probably, under the covers, something like a memory source mapped from the db process on the same box, or from a network source on another box) - portability variations in the above sources - filters: - encryption - translation (ebcdic, unicode) - compression - chunking - MUX - mod_include et al and here are some of my thoughts on trying to further quantify filters: a filter separates two layers and is both a sink and a source. a filter takes an input stream of bytes OOOO... and generates an output stream of bytes which can be broken into blocks such as: OOO NNN O NNNNN ... where O = an old or original byte copied from the input and N = a new byte generated by the filter for each filter we can calculate a quantity i'll call the copied-content ratio, or CCR: nbytes_old / nbytes_new where: nbytes_old = number of bytes in the output of the filter which are copied from the input (in zero-copy this would mean "copy by reference counting an input buffer") nbytes_new = number of bytes which are generated by the filter which weren't present in the input examples: CCR = infinity: who cares -- straight through with no transformation. the filter shouldn't even be there. CCR = 0: encryption, translation (ebcdic, unicode), compression. these get zero benefit from zero-copy. CCR > 0: chunking, MUX, mod_include from the point of view of evaluating the benefit of zero-copy we only care about filters with CCR > 0 -- because CCR = 0 cases degenerate into a single-copy scheme anyhow. it is worth noting that the large_write heuristic in BUFF fairly clearly handles zero-copy at very little overhead for CCRs larger than DEFAULT_BUFSIZE. what needs further quantification is what the CCR of mod_include would be. for a particular zero-copy implementation we can find some threshold k where filters with CCRs >= k are faster with the zero-copy implementation and CCRs < k are slower... faster/slower as compared to a baseline implementation such as the existing BUFF. it's my opinion that when you consider the data sources listed above, and the filters listed above that *in general* the existing BUFF heuristics are faster than a complete zero-copy implementation. you might ask how does this jive with published research such as the IO-Lite stuff? well, when it comes right down to it, the research in the IO-Lite papers deal with very large CCRs and contrast them against a naive buffering implementation such as stdio -- they don't consider what a few heuristics such as apache's BUFF can do. Dean From dgaudet-list-new-httpd@arctic.org Tue Mar 28 11:35:26 2000 Path: engelschall.com!mail2news!apache.org!new-httpd-return-1524-rse+apache=en.muc.de From: dgaudet-list-new-httpd@arctic.org (Dean Gaudet) Newsgroups: en.lists.apache-new-httpd Subject: Re: canonical list of i/o layering use cases Date: 28 Mar 2000 07:04:14 +0200 Organization: Mail2News at engelschall.com Lines: 15 Approved: postmaster@m2ndom Message-ID: Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 954219854 46403 141.1.129.1 (28 Mar 2000 05:04:14 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 28 Mar 2000 05:04:14 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:37035 On Mon, 27 Mar 2000, Dean Gaudet wrote: > CCR = infinity: who cares -- straight through with no > transformation. the filter shouldn't even be there. thanks to ronald for pointing out CCR = infinity filters -- hash calculations. thankfully they're trivial to handle without a full zero-copy implementation, so i still stand by my assertions. please to submit more filters/sources/sinks i haven't considered yet. Dean From dgaudet-list-new-httpd@arctic.org Tue Mar 28 11:38:30 2000 Path: engelschall.com!mail2news!apache.org!new-httpd-return-1494-rse+apache=en.muc.de From: dgaudet-list-new-httpd@arctic.org (Dean Gaudet) Newsgroups: en.lists.apache-new-httpd Subject: Re: layered I/O (was: cvs commit: ...) Date: 28 Mar 2000 07:02:15 +0200 Organization: Mail2News at engelschall.com Lines: 47 Approved: postmaster@m2ndom Message-ID: Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 954219735 45661 141.1.129.1 (28 Mar 2000 05:02:15 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 28 Mar 2000 05:02:15 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:37016 On Mon, 27 Mar 2000, Roy T. Fielding wrote: > Whatever we do, it needs to be clean enough to enable later performance > enhancements along the lines of zero-copy streams. good luck, i think it's impossible. any code written to a read/write interface is passing around buffers with no reference counts, and the recipients of those buffers (the i/o layers) must immediately copy the contents before returning to the caller. zero-copy requires reference counted buffers. therefore any future zero-copy enhancement would require substantial code changes to the callers -- the modules. btw -- the code, as is, already supports zero-copy for those cases where it's actually a win... the cases where bwrite() is called with a large enough buffer, and we're able to pass it to the kernel immediately. i honestly believe there are very few applications which benefit from zero-copy. encryption and compression obvious don't, they require a copy. what other layers would there be in the stack? a mod_include-type filter which was doing zero-copy would probably be slower than it is now... that'd be using zero-copy to pass little bits and pieces of strings, the bits and pieces which were unchanged by the filter. zero-copy has all this overhead in maintaining the lists of buffers and the reference counts... more overhead in that than in the rather simple heuristics present in BUFF. a MUX layer might benefit from zero-copy ... but after doing lots of thinking on this a year ago i remain completely unconvinced that zero-copy from end to end is any better than the heuristics we already have... and the answer is different depending on how parallel mux requests are serviced (whether by threads or by an async core). there is one application i know of that benefits from zero-copy -- and that is TCP to TCP tunnelling. but even here, the biggest win i've seen is not from the zero-copy per se, as much as it is from wins you can get in reducing the need to have a bunch of 4k userland buffers for each socket. zero-copy is a very nice theoretical toy. i'm still waiting for a good demonstration or use case. i hope you don't hold up improvements in apache while this research is going on. Dean From dgaudet-list-new-httpd@arctic.org Tue Mar 28 11:40:37 2000 Path: engelschall.com!mail2news!apache.org!new-httpd-return-1514-rse+apache=en.muc.de From: dgaudet-list-new-httpd@arctic.org (Dean Gaudet) Newsgroups: en.lists.apache-new-httpd Subject: Re: layered I/O (was: cvs commit: ...) Date: 28 Mar 2000 07:04:08 +0200 Organization: Mail2News at engelschall.com Lines: 22 Approved: postmaster@m2ndom Message-ID: Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 954219848 46333 141.1.129.1 (28 Mar 2000 05:04:08 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 28 Mar 2000 05:04:08 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:37028 On Sun, 26 Mar 2000, Greg Stein wrote: > Below, you talk about doing this without performance implications. Well, > that loop is one that you've added. :-) it's very easy to optimize the loop further -- by hashing the strings which run the direct matches. it's really helpful to consider a simple example: accessing foo.cgi, which generates "Content-Type: text/x-parsed-html" which requires mod_include to run. to run foo.cgi r->handler is set to "cgi-handler", which assuming we do the hash right, is picked off immediately and run without looping. then r->content_type is updated, and set to "text/x-parsed-html", and again, if we've done the hash right, then it's picked off immediately without looping. Dean From bhyde@pobox.com Wed Mar 29 08:28:32 2000 Path: engelschall.com!mail2news!apache.org!new-httpd-return-1539-rse+apache=en.muc.de From: bhyde@pobox.com (Ben Hyde) Newsgroups: en.lists.apache-new-httpd Subject: Re: canonical list of i/o layering use cases Date: 28 Mar 2000 19:26:56 +0200 Organization: Mail2News at engelschall.com Lines: 44 Approved: postmaster@m2ndom Message-ID: <87ln335fbt.fsf@pobox.com> Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 954264417 68771 141.1.129.1 (28 Mar 2000 17:26:57 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 28 Mar 2000 17:26:57 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:37045 Dean Gaudet writes: > please to submit more filters/sources/sinks i haven't considered yet. :-) This discussion flairs up from time to time, doesn't it! The last time this flared up I came to the opinion that there is a knot here worth illuminating, but I can't recall if I bothered to say it outloud here. The inside of the knot is: planning, authorizing, executing. The request arrives and the beast has to assemble a plan for how to respond. These plans can, and ought to be, reasonably ornate; at least a tree. The primitive nodes are things like stream this file, do this character set conversion, etc. The slightly more complex nodes do things like store results in caches, and assemble bits into bundles, etc. The core ought to provide a way to manipulate these plans. The types of nodes and the operation sets on them should be provided by modules. Given the plan then the problem is to decide if this is approprate to execute it. I.e. do we have all the rights we need. This is a mess because it needs to draw information from three domains, the client's credentials, the process/thread rights, and the protection configuration on the named objects that are inputs to the plan. Finally execution. The execution is where we start wanting terrific efficencies. Zero copy, clever caching, kernel hackery, levering O/S specific delights like sendfile. The outside of the knot of plan/auth/exec is the necessity of letting all three phases spread across machines, processes, threads, code cult, and projects. (This is why that www.spread.org work is so right on). I suspect it was at about this point in my thinking that I become happy that this was slipping outside the scope of 2.0 where it could stew longer. - ben From fielding@kiwi.ICS.UCI.EDU Wed Mar 29 14:05:46 2000 Path: engelschall.com!mail2news!apache.org!new-httpd-return-1593-rse+apache=en.muc.de From: fielding@kiwi.ICS.UCI.EDU ("Roy T. Fielding") Newsgroups: en.lists.apache-new-httpd Subject: Re: layered I/O (was: cvs commit: ...) Date: 29 Mar 2000 07:58:05 +0200 Organization: Mail2News at engelschall.com Lines: 65 Approved: postmaster@m2ndom Message-ID: <200003282119.aa05413@gremlin-relay.ics.uci.edu> Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 954309485 4205 141.1.129.1 (29 Mar 2000 05:58:05 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 29 Mar 2000 05:58:05 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:37088 >> Bite me. You're being totally confrontational here, and it isn't called >> for. You didn't read my statements. I said no change to generators, and >> definite changes to processors. > >no it is called for. > >this item has been on the bullet list for as long as apache-2.0 has been a >wet dream. > >there's been a fuckload of hand waving over the years, and *no code*. >code speaks reams more in my book than anything else. > >so far all i'm seeing is more hand waving, and cries that this isn't the >wet dream folks thought it would be. > >welcome to reality. if you folks would stop waving your hands and >actually try to code it up you'd probably understand our proposed >solution. Dean, this is crossing my tolerance threshold for bullshit. You haven't even looked at the code that was committed. If you had, you would notice that it doesn't implement IO-layering. What it implements is IO-relaying and an infinite handler loop. This isn't handwaving. The code simply doesn't do IO-layering. Period. Layered-IO involves a cascaded sequence of filters that independently operate on a continuous stream in an incremental fashion. Relayed-IO is a sequence of processing entities that opportunistically operate on a unit of data to transform it to some other unit of data, which can then be made available again to the other processing entities. The former is called a pipe-and-filter architecture, and the latter is called a blackboard architecture, and the major distinctions between the two are: 1) in layered-IO, the handlers are identified by construction of a data flow network, whereas in relayed-IO the handlers simply exist in a "bag of handlers" and each one is triggered based on the current data state; 2) in layered-IO, the expectation is that the data is processed as a continuous stream moving through handlers, whereas in relayed-IO the data is operated upon in complete units and control is implicitly passed from one processor to the next; 3) in layered-IO, data processing ends at the outer layer, whereas in relayed-IO it ends when the data reaches a special state of "no processing left to be done". Yes, these two architectures are similar and can accomplish the same tasks, but they don't have the same performance characteristics and they don't have the same configuration interface. And, perhaps most significantly, relayed-IO systems are not as reliable because it is very hard to anticipate how processing will occur and very easy for the system to become stuck in an infinite loop. I don't want a blackboard architecture in Apache, regardless of the version of the release or how many users might be satisfied by the features it can implement. It is unreliable and hard to maintain and adds too much latency to the response processing. But if somebody else really wants such an architecture, and they understand its implications, then I won't prevent them from going with that solution -- I just don't want them thinking it is what we meant by layered-IO. ....Roy From fielding@kiwi.ICS.UCI.EDU Wed Mar 29 14:06:34 2000 Path: engelschall.com!mail2news!apache.org!new-httpd-return-1594-rse+apache=en.muc.de From: fielding@kiwi.ICS.UCI.EDU ("Roy T. Fielding") Newsgroups: en.lists.apache-new-httpd Subject: Re: layered I/O (was: cvs commit: ...) Date: 29 Mar 2000 07:58:15 +0200 Organization: Mail2News at engelschall.com Lines: 56 Approved: postmaster@m2ndom Message-ID: <200003282148.aa06947@gremlin-relay.ics.uci.edu> Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 954309496 4516 141.1.129.1 (29 Mar 2000 05:58:16 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 29 Mar 2000 05:58:16 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:37089 >On Mon, 27 Mar 2000, Roy T. Fielding wrote: > >> Whatever we do, it needs to be clean enough to enable later performance >> enhancements along the lines of zero-copy streams. > >good luck, i think it's impossible. any code written to a read/write >interface is passing around buffers with no reference counts, and the >recipients of those buffers (the i/o layers) must immediately copy the >contents before returning to the caller. zero-copy requires reference >counted buffers. therefore any future zero-copy enhancement would require >substantial code changes to the callers -- the modules. Yes, assuming we did zero copies. We could still do the interim thing with one-copy. I agree that taking advantage of the higher performance would require that the callers be attuned to the high-performance interface. >btw -- the code, as is, already supports zero-copy for those cases where >it's actually a win... the cases where bwrite() is called with a large >enough buffer, and we're able to pass it to the kernel immediately. > >i honestly believe there are very few applications which benefit from >zero-copy. encryption and compression obvious don't, they require a copy. >what other layers would there be in the stack? > >a mod_include-type filter which was doing zero-copy would probably be >slower than it is now... that'd be using zero-copy to pass little bits and >pieces of strings, the bits and pieces which were unchanged by the filter. >zero-copy has all this overhead in maintaining the lists of buffers and >the reference counts... more overhead in that than in the rather simple >heuristics present in BUFF. On the contrary, the vast majority of include-based templates consist of large junks of HTML with embedded separators. With zero-copy you can just split the data into three buckets by reference and replace the middle bucket with the included content, which may itself be a data stream. Not only does this reduce memory consumption, it also removes almost all of the special-case handling of data sources via subrequests/caching/proxy/whatever and vastly simplifies the SSI/PHP/whatever processing architecture. >a MUX layer might benefit from zero-copy ... but after doing lots of >thinking on this a year ago i remain completely unconvinced that zero-copy >from end to end is any better than the heuristics we already have... and >the answer is different depending on how parallel mux requests are >serviced (whether by threads or by an async core). The place where I need zero copy is in the request processing, where the first read off the network may result in multiple requests being placed within the same buffer. I don't want the initial request to be copied into separate buffers, since I still consider that initial copy to be more overhead than all of the reference counting combined. Maybe I'm just being too pessimistic about the cost of a data copy, and I should optimize around one-copy instead. ....Roy From jwbaker@acm.org Wed Mar 29 14:07:15 2000 Path: engelschall.com!mail2news!apache.org!new-httpd-return-1598-rse+apache=en.muc.de From: jwbaker@acm.org ("Jeffrey W. Baker") Newsgroups: en.lists.apache-new-httpd Subject: Re: layered I/O (was: cvs commit: ...) Date: 29 Mar 2000 13:45:26 +0200 Organization: Mail2News at engelschall.com Lines: 163 Approved: postmaster@m2ndom Message-ID: Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 954330326 23160 141.1.129.1 (29 Mar 2000 11:45:26 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 29 Mar 2000 11:45:26 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:37093 On Tue, 28 Mar 2000, Roy T. Fielding wrote: [ed] > Layered-IO involves a cascaded sequence of filters that independently > operate on a continuous stream in an incremental fashion. Relayed-IO > is a sequence of processing entities that opportunistically operate > on a unit of data to transform it to some other unit of data, which > can then be made available again to the other processing entities. > The former is called a pipe-and-filter architecture, and the latter > is called a blackboard architecture, and the major distinctions between > the two are: > > 1) in layered-IO, the handlers are identified by construction > of a data flow network, whereas in relayed-IO the handlers > simply exist in a "bag of handlers" and each one is triggered > based on the current data state; > > 2) in layered-IO, the expectation is that the data is processed > as a continuous stream moving through handlers, whereas in > relayed-IO the data is operated upon in complete units and > control is implicitly passed from one processor to the next; > > 3) in layered-IO, data processing ends at the outer layer, > whereas in relayed-IO it ends when the data reaches a special > state of "no processing left to be done". Forgive me for jumping in here. Sometimes those of us who are merely observers of the core group do not get a perspective on the design discussions that take place in private emails and in person. Thus, what I have to say will largely rehash what has been said already. It seems to me that a well-rounded IO-layering system has already been proposed here, in bits and pieces, by different people, over the course of many threads. The components of the system are: 1. a routine to place the IO layers in the proper order, 2. a routine to send data between IO layers, and 3. the layers themselves. Selection of IO Layers The core selects a source module and IO layers based on the urlspace configuration. Content might be generated by mod_perl, and the result is piped through mod_chunk, mod_ssl, and mod_net, in turn. When the content generator runs, the core enforces that the module set the content type before the first call to ap_bput. The content type is set by a function call. The function (ap_set_content_type(request_rec *, char *)) examines the content type and adds IO layers as neccessary. For server parsed html, the core might insert mod_include immediately after mod_perl. (Can anyone produce a use case where the IO chain could change after output begins?) Interface Between IO Layers The core is responsible for marshalling data between the IO layers. Each layer registers a callback function ((* ap_status_t)(request_rec *, buff_vec *)) on which it receives input. Data is sent to the next layer using ap_bput(request_rec *, buff_vec *). The buff_vec is simply an ordered array of address and length pairs. Whenever ap_bput is called, the input callback of the next layer is called. No message queueing, async handlers, or any of that business is needed. ap_bput keeps track of where in the output chain things are. Control flow in this systems tends to yo-yo up and down the IO chain. Examples later. The only other part of the IO interface is a flush routine. The IO layers are free to implement whatever they feel flushing involves. There are two notable things about this system. First, control flow need not ever reach the end of the output chain. Any layer is free to return without calling ap_bput. The layers can do whatever they please with the data. The network module would be such an example. It would always write the buffers over the network, and never pass them down the IO chain. If mod_ssl wanted to handle networking itself, it could do that, too. The second notable thing is that once a buffer has been sent down the chain, it is gone forever. Later layers are responsible for freeing the memory and whatnot. Diddling in a buffer that has already been sent would be bad form. Layer Implementation This system has implications for the design and implementation of the layers. Clearly, it would not be efficient to call ap_bput overly much. Also, the IO layers must be re-entrant in the threaded MPMs, so they will need some mechanism for storing module-specific state information in the request context (think mod_include when an include directive spans ap_bput calls). There will be basically three types of layers: those that insert content into the stream (chunking, SSI), those that replace the stream completely (encryption, compression), and those that sink the stream (network). The layers all demonstrate minimal copying: the inserting layers merely move the boundaries on the incoming buffers and insert a new buffer. The replacement layers have to create a new buffer and dealloc the old one, but you can't avoid that in any case. The sinks merely dealloc the buffers, so no problems there. Analysis by Example I considered two examples when coming up with this design. One is content which is dynamically generated by mod_perl, filtered through SSI, chunked, encrypted, and sent over the wire. The other is fast static content serving, where a module is blasting out pre-computed HTTP responses a la SGI's 10x patches. In the first situation, imagine that a 10 KB document is generated which contains two include directives. The include directives insert a standard banner and the contents of a 40 KB file. The generating module outputs the data via one ap_set_content_type call and five separate ap_bput calls. To see the worst case, assume that both include directives span ap_bput calls. Assume that the included content does not contain any include directives. The IO chain initially looks like this: mod_perl->mod_chunk->mod_ssl->mod_net After the content type is set, the chain changes: mod_perl->mod_include->mod_chunk->mod_ssl->mod_net During the inclusion of the 40 KB file, mod_include allocates a series of 4 KB buffers, fills them from the file, and sends them down the chain (or maybe it uses mmap). The analysis is left to the reader, but the end result is that ap_bput is called 50 times during the request phase. Is that a lot? Consider the amount of work being done, and the fact that we have avoided all the overhead of using, for example, actual pipes, or thread-safe queueing. Calling functions in a single userland context is known to be fast. The number of calls could be reduced if mod_include used a larger internal buffer, but at the expense of memory consumption (or it could use mmap). Note also that the number of ap_bput calls does not translate into packets on the wire. mod_net is free to do whatever is optimal with respect to packet boundaries. The second example represents high performance static content delivery. The content-generating module has all headers and content cached or mapped in memory. The entire output phase is accomplished in a single ap_bput call, and the networking module does The Right Thing to ensure best network usage. Am I rambling yet? I'd like to get some opinions on this system, if anybody feels it is significantly different from those already proposed. I realize that I have waved my hands regarding actually deciding when to use what IO layers and where, but I am confident that a logically appealing system could be devised. Regards, Jeffrey > > Yes, these two architectures are similar and can accomplish the > same tasks, but they don't have the same performance characteristics > and they don't have the same configuration interface. And, perhaps > most significantly, relayed-IO systems are not as reliable because > it is very hard to anticipate how processing will occur and very easy > for the system to become stuck in an infinite loop. > > I don't want a blackboard architecture in Apache, regardless of the > version of the release or how many users might be satisfied by the > features it can implement. It is unreliable and hard to maintain > and adds too much latency to the response processing. But if somebody > else really wants such an architecture, and they understand its implications, > then I won't prevent them from going with that solution -- I just > don't want them thinking it is what we meant by layered-IO. From fielding@kiwi.ICS.UCI.EDU Wed Mar 29 14:07:48 2000 Path: engelschall.com!mail2news!apache.org!new-httpd-return-1599-rse+apache=en.muc.de From: fielding@kiwi.ICS.UCI.EDU ("Roy T. Fielding") Newsgroups: en.lists.apache-new-httpd Subject: Re: layered I/O (was: cvs commit: ...) Date: 29 Mar 2000 13:45:30 +0200 Organization: Mail2News at engelschall.com Lines: 43 Approved: postmaster@m2ndom Message-ID: <200003290205.aa19557@gremlin-relay.ics.uci.edu> Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 954330330 23243 141.1.129.1 (29 Mar 2000 11:45:30 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 29 Mar 2000 11:45:30 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:37094 >Selection of IO Layers > >The core selects a source module and IO layers based on the urlspace >configuration. Content might be generated by mod_perl, and the result is >piped through mod_chunk, mod_ssl, and mod_net, in turn. When the content >generator runs, the core enforces that the module set the content type >before the first call to ap_bput. The content type is set by a function >call. The function (ap_set_content_type(request_rec *, char *)) examines >the content type and adds IO layers as neccessary. For server parsed >html, the core might insert mod_include immediately after mod_perl. The problem of thinking of it that way is that, like Dean mentioned, the output of one module may be filtered and the filter indicate that content should be embedded from another URL, which turns out to be a CGI script that outputs further parseable content. In this instance, the goal of layered-IO is to abstract away such behavior so that the instance is processed recursively and thus doesn't result in some tangled mess of processing code for subrequests. Doing it requires that each layer be able to pass both data and metadata, and have both data and metadata be processed at each layer (if desired), rather than call a single function that would set the metadata for the entire response. My "solution" to that is to pass three interlaced streams -- data, metadata, and meta-metadata -- through each layer. The metadata streams would point to a table of tokenized name-value pairs. There are lots of ways to do that, going back to my description of bucket brigades long ago. Basically, each block of memory would indicate what type of data, with metadata occurring in a block before the data block(s) that it describes (just like chunk-size describes the subsequent chunk-data) and the layers could be dynamically rearranged based on the metadata that passed through them, in accordance with the purpose of the filter. >(Can anyone produce a use case where the IO chain could change after >output begins?) Output is a little easier, but that is the normal case for input. We don't know what filters to apply to the request body until after we have passed through the HTTP headers, and the HTTP message processor is itself a filter in this model. ....Roy From dgaudet@arctic.org Thu Nov 18 17:25:06 1999 Path: engelschall.com!mail2news!apache.org!new-httpd-owner-rse+apache=en.muc.de From: dgaudet@arctic.org (Dean Gaudet) Newsgroups: en.lists.apache-new-httpd Subject: Re: bucket brigades and IOL Date: 18 Nov 1999 06:47:20 +0100 Organization: Mail2News at engelschall.com Lines: 23 Approved: postmaster@m2ndom Message-ID: Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 942904040 82793 141.1.129.1 (18 Nov 1999 05:47:20 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 18 Nov 1999 05:47:20 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:34101 On Sat, 13 Nov 1999, Ben Laurie wrote: > Also, the usual objections still apply - i.e. it is awkward to do things > like searching for particular strings, since they may cross boundaries. > I'm beginning to think that the right answer to this is to provide nice > matching functions that know about the chunked structures, and last > resort functions that'll glue it all back into one chunk... yeah, we use a zero-copy library at criticalpath and we frequently run into the case where we want to do some string-like operation on data in the zero-copy datastructure. you end up having to either copy it to a regular C string, or re-write all of the string functions. consider flex... or a regex library... neither work terribly well in the face of a zero-copy abstraction because they don't have the equivalent of writev()/readv(). but. if apache had it, maybe we would see more libraries start to adopt iovec-like interfaces... dunno. Dean From dgaudet@arctic.org Sat Nov 20 21:09:50 1999 Path: engelschall.com!mail2news!apache.org!new-httpd-owner-rse+apache=en.muc.de From: dgaudet@arctic.org (Dean Gaudet) Newsgroups: en.lists.apache-new-httpd Subject: Re: NO_WRITEV Date: 20 Nov 1999 06:42:11 +0100 Organization: Mail2News at engelschall.com Lines: 49 Approved: postmaster@m2ndom Message-ID: Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 943076531 76767 141.1.129.1 (20 Nov 1999 05:42:11 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 20 Nov 1999 05:42:11 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:34157 writev() allows us to reduce the number of packets on the network. on linux we could use TCP_CORK and get the same effect with less code... too bad linux is the only unix so far with this nice functionality. TCP_CORK is like the "other useful" setting other than nagle turned on... with TCP_CORK, the kernel flushes any packets which fill an entire frame, but holds partial packets until the socket is close()d or the cork is removed. in this way you can do multiple smaller write()s (or a write() and a sendfile()) without causing small packets to go on the wire. writev() may consume a small amount more cpu on the server, but it's my opinion that this is a fine price to pay for fewer packets on the wire. if you do choose to benchmark it, be sure to use slow modem clients, and not fast lan clients... and give strong consideration to using client latency as your metric rather than server cpu. http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html Dean On Fri, 19 Nov 1999, Eli Marmor wrote: > Hello, > > Is there any benchmark or statistics about how faster is Apache with > writev in comparison to without it? (i.e. a normal compilation under > a platform supporting writev, vs. a compilation with "-DNO_WRITEV"). > > If there is no official benchmark, can anybody estimate the difference > (in percents) under a typical use of Apache? Does it depend on the > type of use (static pages vs. dynamic, SSI and other parsings vs. one > block, etc.)? Or the other bottlenecks? (it makes sense that when > your connection to the client is slow, you will not notice the > difference between writing 2 buffers in two system calls or in one). > > If there is a big difference, does it mean that Apache for non-writev > platforms (such as SCO, BeOS and Tandem) is slower and that these > platforms are not recommended for Apache users? > > On the other hand, if the difference is very small (let's say lower > than 1%), maybe the benefits don't deserve the price (a much more > complex code, especially when the non-writev code must be supported > also in the future because of the non-writev platforms). > > -- > Eli Marmor > From dgaudet@arctic.org Mon Jun 28 19:06:50 1999 Path: engelschall.com!mail2news!apache.org!new-httpd-owner-rse+apache=en.muc.de From: dgaudet@arctic.org (Dean Gaudet) Newsgroups: en.lists.apache-new-httpd Subject: Re: async routines Date: 28 Jun 1999 17:33:24 +0200 Organization: Mail2News at engelschall.com Lines: 96 Approved: postmaster@m2ndom Message-ID: Reply-To: new-httpd@apache.org NNTP-Posting-Host: en1.engelschall.com X-Trace: en1.engelschall.com 930584004 99816 141.1.129.1 (28 Jun 1999 15:33:24 GMT) X-Complaints-To: postmaster@engelschall.com NNTP-Posting-Date: 28 Jun 1999 15:33:24 GMT X-Mail2News-Gateway: mail2news.engelschall.com Xref: engelschall.com en.lists.apache-new-httpd:31280 [hope you don't mind me cc'ing new-httpd zach, I think others will be interested.] On Mon, 28 Jun 1999, Zach Brown wrote: > so dean, I was wading through the mpm code to see if I could munge the > sigwait stuff into it. > > as far as I could tell, the http protocol routines are still blocking. > what does the future hold in the way for async routines? :) I basically > need a way to do something like.. You're still waiting for me to get the async stuff in there... I've done part of the work -- the BUFF layer now supports non-blocking sockets. However, the HTTP code will always remain blocking. There's no way I'm going to try to educate the world in how to write async code... and since our HTTP code has arbitrary call outs to third party modules... It'd have a drastic effect on everyone to make this change. But I honestly don't think this is a problem. Here's my observations: All the popular HTTP clients send their requests in one packet (or two in the case of a POST and netscape). So the HTTP code would almost never have to block while processing the request. It may block while processing a POST -- something which someone else can worry about later, my code won't be any worse than what we already have in apache. So any effort we put into making the HTTP parsing code async-safe would be wasted on the 99.9% case. Most responses fit in the socket's send buffer, and again don't require async support. But we currently do the lingering_close() routine which could easily use async support. Large responses also could use async support. The goal of HTTP parsing is to figure out which response object to send. In most cases we can reduce that to a bunch of common response types: - copying a file to the socket - copying a pipe/socket to the socket (IPC, CGIs) - copying a mem region to the socket (mmap, some dynamic responses) So what we do is we modify the response handlers only. We teach them about how to send async responses. There will be a few new primitives which will tell the core "the response fits one of these categories, please handle it". The core will do the rest -- and for MPMs which support async handling, the core will return to the MPM and let the MPM do the work async... the MPM will call a completion function supplied by the core. (Note that this will simplify things for lots of folks... for example, it'll let us move range request handling to a common spot so that more than just default_handler can support it.) I expect this to be a simple message passing protocol (pass by reference). Well rather, that's how I expect to implement it in ASH -- where I'll have a single thread per-process doing the select/poll stuff; and the other threads are in a pool that handles the protocol stuff. For your stuff you may want to do it another way -- but we'll be using a common structure that the core knows about... and that structure will look like a message: struct msg { enum { MSG_SEND_FILE, MSG_SEND_PIPE, MSG_SEND_MEM, MSG_LINGERING_CLOSE, MSG_WAIT_FOR_READ, /* for handling keep-alives */ ... } type; BUFF *client; void (*completion)(struct msg *, int status); union { ... extra data here for whichver types need it ...; } x; }; The nice thing about this is that these operations are protocol independant... at this level there's no knowledge of HTTP, so the same MPM core could be used to implement other protocols. > so as I was thinking about this stuff, I realized it might be neat to have > 'classes' of non blocking pending work and have different threads with > differnt priorities hacking on it. Say we have a very high priority > thread that accepts connectoins, does initial header parsing, and > sendfile()ing data out. We could have lower priority threads that are > spinning doing 'harder' BUFF work like an encryption layer or gziping > content, whatever. You should be able to implement this in your MPM easily I think... because you'll see the different message types and can distribute them as needed. Dean