ossp-pkg/sio/BRAINSTORM/doc_SFmtg.txt
From akosut@leland.Stanford.EDU Thu Jul 23 09:38:40 1998
Date: Sun, 19 Jul 1998 00:12:37 -0700 (PDT)
From: Alexei Kosut <akosut@leland.Stanford.EDU>
To: new-httpd@apache.org
Subject: Apache 2.0 - an overview
For those not at the Apache meeting in SF, and even for those who were,
here's a quick overview of (my understanding of) the Apache 2.0
architecture that we came up with. I present this to make sure that I have
it right, and to get opinions from the rest of the group. Enjoy.
1. "Well, if we haven't released 2.0 by Christmas of 1999, it won't
matter anyway."
A couple of notes about this plan: I'm looking at this right now from a
design standpoint, not an implementation one. If the plan herein were
actually coded as-is, you'd get a very inefficient web server. But as
Donald Knuth (Professor emeritus at Stanford, btw... :) points out,
"premature optimization is the root of all evil." Rest assured there are
plenty of ways to make sure Apache 2.0 is much faster than Apache 1.3.
Taking out all the "slowness" code, for example... :)
Also, the main ideas in this document mainly come from Dean Gaudet, Simon
Spero, Cliff Skolnick and a bunch of other people, from the Apache Group's
meeting in San Francisco, July 2 and 3, 1998. The other ideas come from
other people. I'm being vague because I can't quite remember. We should
have videotaped it. I've titled the sections of this document with quotes
from our meeting, but they are paraphrased from memory, so don't take them
too seriously.
2. "But Simon, how can you have a *middle* end?"
One of the main goals of Apache 2.0 is protocol independence (i.e.,
serving HTTP/1.1, HTTP-NG, and maybe FTP or gopher or something). Another
is to rid the server of the belief that everything is a file. Towards this
end, we divide the server up into three parts, the front end, the middle
end, and the back end.
The front end is essentially a combination of http_main and http_protocol
today. It takes care of all network and protocol matters, interpreting the
request, putting it into a protocol-neutral form, and (possibly) passing
it off to the rest of the server. This is approximately equivalent to the
part of Apache contained in Dean's flow stuff, and it also works very well
in certain non-Unix-like architectures such as clustered mainframes. In
addition, part of this front-end might be optionally run in kernel space,
giving a very fast server indeed...
The back end is what generates the content. At the back of the back end we
have backing stores (Cliff's term), which contain actual data. These might
represent files on a disk, entries in a database, CGI scripts, etc... The
back end also consists of other modules, which can alter the request in
various fashions. The objects the server acts on can be thought of (Cliff
again) as a filehandle and a set of key/value pairs (metainformation).
The modules are set up as filters that can alter either one of those,
stacking I/O routines onto the stream of data, or altering the
metainformation.
The middle end is what comes between the front and back ends. Think of
http_request. This section takes care of arranging the modules, backing
stores, etc... into a manner so that the path of the request will result
in the correct entity being delivered to the front end and sent to the
client.
3. "I won't embarrass you guys with the numbers for how well Apache
performs compared to IIS." (on NT)
For a server that was designed to handle flat files, Apache does it
surprisingly poorly, compared with other servers that have been optimized
for it. And the performance for non-static files is, of course, worse.
While Apache is still more than fast enough for 95% of Web servers, we'd
be remiss to dismiss those other 5% (they're the fun ones anyway). Another
problem Apache has is its lack of a good, caching, proxy module.
Put these together, along with the work Dean has done with the flow and
mod_mmap_static stuff, and we realize the most important part of Apache
2.0: a built-in, all-pervasive, cache. Every part of the request process
will involve caching. In the path outlined above, between each layer of
the request, between each module, sits the cache, which can (when it is
useful), cache the response and its metainformation - including its
variance, so it knows when it is safe to give out the cached copy. This
gives every opportunity to increase the speed of the server by making sure
it never has to dynamically create content more than it needs to, and
renders accelerators such as Squid unnecessary.
This also allows what I alluded to earlier: a kernel (or near-to-kernel)
based web server component, which could read the request, consult the
cache to find the requested object, and spit it back out, without so much
as an interrupt in the way. Of course, the rest of Apache (with all its
modules - it's generally a bad idea to let unknown, untrusted code, insert
itself into the kernel) sits up in user-space, ready to handle any request
the micro-Apache can't.
A built-in cache also makes a real working HTTP/1.1 proxy server trivially
easy to write.
4. "Stop asking about backwards compatibility with the API. We'll write a
compatibility module... later."
If modules are as described above, then obviously they are very much
distinct from how Apache's current modules function. The only module
function that is similar to the current model is the handler, or backing
store, that actually provides the basic stream of data that the server
alters to product a response entity.
The basic module's approach to its job is to stack a filter onto the
output. But it's better to think of the modules not as a stack that the
request flows through (a layer cake with cache icing between the layers),
but more of a mosaic (pretend I didn't use that word. I wrote collage. You
can't prove anything), with modules stuck onto various sides of the
request at different points, altering the request/response.
Today's Apache modules take an all-or-nothing approach to request
handlers. They tell Apache what they can do, overestimating, and then are
supposed to DECLINE if they don't pass a number of checks they are
supposed to make. Most modules don't do this correctly. The better
approach is to allow the modules to inform Apache exactly of what they can
do, and have Apache (the middle-end) take care of invoking them when
appropriate.
The final goal of all of this, of course, is simply to allow CGI output to
be parsed for server-side includes. But don't tell Dean that.
5. "Will Apache run without any of the normal Unix binaries installed,
only the BSD/POSIX libraries?"
Another major issue is, of course, configuration of the server. There are
a number of distinct opinions on this, both as to what should be
configured and how it should be done. We talked mainly about the latter,
but the did touch on the former. Obviously, with a radically distinct
module API, the configuration is radically different. We need a good way
to specify how the modules are supposed to interact, and of controlling
what they can do, when and how, balancing what the user asks the server to
do, and what the module (author) wants the server to do. We didn't really
come up with a good answer to this.
However, we did make some progress on the other side of the issue: We
agreed that the current configuration system is definitely taking the
right approach. Having a well-defined repository of the configuration
scheme, containing the possible directives, when they are applicable, what
their parameters are, etc... is the right way to go. We agreed that more
information and stronger-typing (no RAW_ARGS!) would be good, and may
enable on-the-fly generated configuration managers.
We agreed that such a program, probably external to Apache, would generate
a configuration and pass it to Apache, either via a standard config file,
or by calling Apache API functions. It is desirable to be able to go the
other way, pulling current configuration from Apache to look at, and
perhaps change it on the fly, but unfortunately is unlikely this
information would always be available; modules may perform optimizations
on their configuration that makes the original configuration unavailable.
For the language and specification of the configuration, we thought
perhaps XML might be a good approach, and agreed it should be looked
into. Other issues, such as SNMP, were brought up and laughed at.
6. "So you're saying that the OS that controls half the banks, and 90% of
the airlines, doesn't even have memory protection for seperate
processes?"
Obviously, there are a lot more items that have to be part of Apache 2.0,
and we talked about a number of them. However, the four points above, I
think, represent the core of the architecture we agreed on as a starting
point.
-- Alexei Kosut <akosut@stanford.edu> <http://www.stanford.edu/~akosut/>
Stanford University, Class of 2001 * Apache <http://www.apache.org> *