OSSP: CVS Repository: ossp-pkg/sio/BRAINSTORM/c10k.html 1.1

ossp-pkg/sio/BRAINSTORM/c10k.html 1.1
<html>
<head>
<title>The C10K problem</title>
</head>
<body>
<h1>The C10K problem</h1>
It's time for web servers to handle ten thousand clients simultaneously,
don't you think?  After all, the web is a big place now. 
<p>
And computers are big, too.  You can buy a 500MHz machine
with 1 gigabyte of RAM and six 100Mbit/sec Ethernet card for $3000 or so.  
Let's see - at 10000 clients, that's
50KHz, 100Kbytes, and 60Kbits/sec per client.  
It shouldn't take any more horsepower than that to take four kilobytes 
from the disk and send them to the network once a second for each
of ten thousand clients.
(That works out to $0.30 per client, by the way.  Those
$100/client licensing fees some operating systems charge are starting to 
look a little heavy!)  So hardware is no longer the bottleneck.
<p>
One of the busiest ftp sites, ftp.cdrom.com, serves (as of May 1999)
around 5000 clients simultaneously through a 70 megabit/second pipe.
Pipes this fast aren't common yet, but technology is improving rapidly.
<p>
With that in mind, here are a few notes on how to configure operating 
systems and write code to support thousands of clients.  The discussion
centers around Unix-like operating systems, for obvious reasons.

<h2>I/O Strategies</h2>
There seem to be four ways of writing a fast web server to handle many
clients:
<ol>
<li>serve many clients with each server process or thread, and use
select() or poll() to avoid blocking.  This is the traditional favorite,
and is sometimes referred to as "using nonblocking I/O".
<li>serve many clients with each server process or thread, and use
asynchronous I/O to avoid blocking.  This has not yet become popular,
possibly because of poorly designed asynchronous I/O interfaces.
Zach Brown (author of HoserFTPD) thinks this might now be the way to go
for highest performance; see 
<a href="sigio.txt">his 14 April 1999 post to hftpd-users</a>.
<br>
There are several flavors of asynchronous I/O: 
<ul>
<li><a href="http://www.opengroup.org/onlinepubs/007908799/xsh/realtime.html">the aio_ interface</a>
(scroll down from that link to "Asynchronous input and output"), 
which associates a signal and value with each I/O operation.
Signals and their values are queued and delivered efficiently to the user process.
This is from the POSIX 1003.1b realtime extensions, and is also in the Single Unix Specification,
version 2, and in glibc 2.1.
<li>
<a href="http://www.dejanews.com/getdoc.xp?AN=410413704">F_SETSIG</a>
(see also <a href="http://www.dejanews.com/getdoc.xp?AN=366163395">here</a>),
which associates a signal with each file descriptor.
When a normal I/O function like read() or write() completes, the signal
is raised, with the file descriptor as an argument.  Similar to aio_ but
without the new calls, and slightly less flexible (you know the handle,
but you don't know whether it is ready for read() or for write() without
doing a poll() on it).
(Currently only in Linux, I think.)
<li>SIGIO (see <a href="http://www.delorie.com/gnu/docs/glibc/libc_147.html">glibc doc</a> or
<a href="http://www-users.cs.umn.edu/~bentlema/unix/BSD-UNIX:SysCalls_and_IPC.html">BSD Sockets doc</a>)
-- doesn't tell you which handle needs servicing, so it seems kind of coarse.
Used by the Linux F_SETSIG/aio_ implementation as a fallback when the realtime signal 
queue overflows.  
<a href="http://www.linuxhq.com/lnxlists/linux-kernel/lk_9905_01/msg00973.html">Here's
an example of its use</a>.  (Was partly broken in Linux kernels 2.2.0 - 2.2.7, fixed in 2.2.8.)
</ul>
<li>serve one client with each server thread, and let read() and write()
block.  (This is the only model supported by Java.)
<li>Build the server code into the kernel.  
Novell and Microsoft are both said to have done this at various times, and at least one NFS implementation does this.
IBM and Sun are said to have released specweb benchmark results using this technique.  
</ol>
<p>
Richard Gooch has written <a href="http://www.atnf.csiro.au/~rgooch/linux/docs/io-events.html">a paper discussing these options</a>.  Interesting reading.  
<p>
The Apache mailing lists have some interesting posts 
(<a href="http://www.humanfactor.com/cgi-bin/cgi-delegate/apache-ML/nh/1999/March/1st-half/0143.html">one</a>,
<a href="http://www.humanfactor.com/cgi-bin/cgi-delegate/apache-ML/nh/1999/March/2nd-half/0007.html">two</a>,
<a href="http://www.humanfactor.com/cgi-bin/cgi-delegate/apache-ML/nh/1999/March/2nd-half/0009.html">three</a>)
about why they prefer not to use select() (basically, they think that makes plugins harder).   
<br>
I have not yet seen any data comparing the performance of the four approaches.
<p>
Mark Russinovich wrote 
<a href="http://linuxtoday.com/stories/5499.html">an editorial</a> and
<a href="http://winntmag.com/Magazine/Article.cfm?ArticleID=5048">an article</a>
discussing I/O strategy issues in the 2.2 Linux kernel.  Worth reading, even
he seems misinformed on some points.  In particular, he
seems to think that Linux 2.2's asyncrhonous I/O 
(see F_SETSIG above) doesn't notify the user process when data is ready, only
when new connections arrive.  This seems like a bizarre misunderstanding.
See also 
<a href="http://www.dejanews.com/getdoc.xp?AN=431444525">comments on an earlier draft</a>,
<a href="http://www.dejanews.com/getdoc.xp?AN=472893693">a rebuttal</a> from Mingo, 
<a href="http://www.linuxhq.com/lnxlists/linux-kernel/lk_9905_01/msg00089.html">Russinovich's comments of 2 May 1999</a>,
<a href="http://www.linuxhq.com/lnxlists/linux-kernel/lk_9905_01/msg00263.html">a rebuttal</a> from Alan Cox,
and various 
<a href="http://www.dejanews.com/dnquery.xp?ST=PS&QRY=threads&DBS=1&format=threaded&showsort=score&maxhits=100&LNG=ALL&groups=fa.linux.kernel+&fromdate=jun+1+1998">posts to linux-kernel</a>.  

<h2>Limits on open filehandles</h2>
<ul>
<li>Solaris: see <a
href="http://www.wins.uva.nl/pub/solaris/solaris2/Q3.45.html">the Solaris FAQ,
question 3.45</a>.
<li>FreeBSD: use <tt>sysctl -w kern.maxfiles=nnnn</tt> to raise limit
<li>Linux: Even the 2.2.5 kernel limits the number of open files to 1024.
I believe the AC series of patches remove this limit; see <a href="http://kernel.org">ftp.*.kernel.org/pub/linux/kernel/alan/</a>
for mirrors of e.g. ftp://ftp.linux.org.uk/pub/linux/alan/2.2/patch-2.2.5-ac7.bz2
<br>
(See also <a href="http://www.citi.umich.edu/projects/citi-netscape/reports/poll.html">this patch
to make poll scale beyond 1024 fd's</a>.  It's dated Dec 1998; I think it's already in the 2.2.5 kernel.)
<li>Any Unix: the limits set by ulimit or setrlimit.
</ul>

<h2>Limits on threads</h2>
<ul>
<li>Solaris: it supports as many threads as will fit in memory, I hear.
<li>FreeBSD: ?
<li>Linux: Even the 2.2.2 kernel limits the number of threads,
at least on Intel.  I don't know what the limits are on other architectures.
<a href="http://www.dejanews.com/getdoc.xp?AN=420466972">Mingo posted a patch
for 2.1.131 on Intel</a> that removed this limit; I hear he intends to
provide updated patches as time goes on, until it's time to integrate it
into the main version of the kernel.
<li>Java: See <a
href="http://www.javaworld.com/javaworld/jw-03-1999/jw-03-volanomark.html">the
Volanomark benchmark writeup in Javaworld</a>.  It recommends reducing
the amount of memory reserved by default for each thread.
</ul>

<h2>Other limits/tips</h2>
<ul>
<li>select() is limited to FD_SETSIZE handles.  This limit is compiled in to
the standard library and user programs.  The similar call poll() does not have
a comparable limit, and can have less overhead than select().
<li>Even the most recent glibc might use 16 bit variables to hold thread 
or file handles, which could cause trouble above 32767 handles/threads.
<li>Too much thread-local memory is preallocated by some operating systems;
if each thread gets 1MB, and total VM space is 2GB, that creates an upper limit
of 2000 threads.
<li>Normally, data gets copied many times on its way from here to there.
mmap() and sendfile() can be used to reduce this overhead in some cases.
<a href="http://www.dejanews.com/getdoc.xp?AN=450465542">IO-Lite</a> 
is a proposal (already implemented on FreeBSD) for a set of
I/O primitives that gets rid of the need for many copies.
It's sexy; go read it.  But see also <a href="http://www.linuxhq.com/lnxlists/linux-kernel/lk_9905_01/msg00263.html">Alan Cox's opinion of zero-copy</a>.
<li>The sendfile() function in Linux and FreeBSD lets you tell the kernel to send part 
or all of a file.  This lets the OS do it as efficiently as possible.
It can be used equally well in servers using threads or servers using
nonblocking I/O.  (In Linux, It's poorly documented at the moment; <a
href="http://www.dejanews.com/getdoc.xp?AN=422899634">use _syscall4 to 
call it</a>.  Andi Kleen is writing new man pages that cover this.)  
<a href="http://www.dejanews.com/getdoc.xp?AN=423005088">Rumor has it</a>, 
ftp.cdrom.com benefitted noticably from sendfile().
<li>A new socket option under Linux, TCP_CORK, tells the kernel to
avoid sending partial frames, which helps a bit e.g. when there are
lots of little write() calls you can't bundle together for some reason.
Unsetting the option flushes the buffer.
<li>Not all threads are created equal.  The clone() function in Linux
(and its friends in other operating systems)
lets you create a thread that has its own current working directory,
for instance, which can be very helpful when implementing an ftp server.
See Hoser FTPd for an example of the use of native threads rather than pthreads.
<li>To keep the number of filehandles per process down, servers
can fork() once they reach the desired maximum; the child 
finishes serving the existing clients, and the parent accepts and
services new clients.  (If the desired maximum is 1, this degenerates
to the classical one-process-per-client model.)
<li>One developer using sendfile() with Freebsd reports that using
POLLWRBAND instead of POLLOUT makes a big difference.  
<li>Look at the performance comparison graph at the bottom of
<a href="http://www.acme.com/software/thttpd/benchmarks.html">http://www.acme.com/software/thttpd/benchmarks.html</a>.
Notice how various servers have trouble above 128 connections, even on Solaris 2.6?  
Anyone who figures out why, let me know.  
<br>
Note: if the TCP stack has a bug that causes a short (200ms)
delay at SYN or FIN time, as Linux 2.2.0-2.2.6 had, and the OS or
http daemon has a hard limit on the number of connections open,
you would expect exactly this behavior.  There may be other causes.
<li>"Re: fix for hybrid server problems" by Vivek Sadananda Pai 
(vivek@cs.rice.edu) on 
<a href="http://www.humanfactor.com/cgi-bin/cgi-delegate/apache-ML/nh/1999/">new-httpd</a>, May 9th, notes:
<blockquote> 
"I've compared the raw performance of a select-based server with a
multiple-process server on both FreeBSD and Solaris/x86. On
microbenchmarks, there's only a marginal difference in performance
stemming from the software architecture. The big performance win for
select-based servers stems from doing application-level caching. While
multiple-process servers can do it at a higher cost, it's harder to
get the same benefits on real workloads (vs microbenchmarks).
I'll be presenting those measurements as part of a paper that'll
appear at the next Usenix conference. If you've got postscript,
the paper is available at 
<a href="http://www.cs.rice.edu/~vivek/flash99/">http://www.cs.rice.edu/~vivek/flash99/</a>"
</blockquote>
</ul>

<h2>Kernel Issues</h2> 
For Linux, it looks like kernel bottlenecks are being fixed constantly.
See <a href="http://www.linuxhq.com/">Linux HQ</a>, 
<a href="http://www.kt.opensrc.org/">Kernel Traffic</a>,
and <a href="http://www.linuxhq.com/lnxlists/linux-kernel/">the Linux-Kernel mailing list</a>
(Example interesting posts by 
<a href="http://www.linuxhq.com/lnxlists/linux-kernel/lk_9904_03/msg00146.html">a user asking how to tune</a>, and
<a href="http://www.linuxhq.com/lnxlists/linux-kernel/lk_9904_03/msg00309.html">Dean Gaudet</a>)
<p>
In March 1999, Microsoft sponsored a benchmark comparing NT to Linux
at serving large numbers of http and smb clients, in which they
failed to see good results from Linux.  
See also <a href="mindcraft_redux.html">my article on Mindcraft's April 1999 Benchmarks</a>
for more info.
<p>
See also <a href="http://www.citi.umich.edu/projects/citi-netscape/">The Linux Scalability Project</a>.

<h2>Measuring Server Performance</h2>
Two tests in particular are simple, interesting, and hard: 
<ol>
<li>raw connections per second (how many 512 byte files per second can you
serve?)
<li>total transfer rate on large files with many slow clients
(how many 28.8k modem clients can simultaneously download
from your server before performance goes to pot?)
</ol>
Jef Poskanzer has published benchmarks comparing many web servers.
See <a href="http://www.acme.com/software/thttpd/benchmarks.html">http://www.acme.com/software/thttpd/benchmarks.html</a>  
for his results.
<p>
I also have 
<a href="http://www.alumni.caltech.edu/~dank/fixing-overloaded-web-server.html">
a few old notes about comparing thttpd to Apache</a> that may be of interest
to beginners.

<h2>Interesting select()-based servers</h2>
<ul>
<li><a href="http://www.acme.com/software/thttpd/">thttpd</a>
Very simple.  Uses a single process.  It has good performance,
but doesn't scale with the number of CPU's.
<li><a href="http://mathop.diva.nl/">mathopd</a>.  Similar to thttpd.
<li><a href="http://www.zeustech.net/">Zeus</a>, a commercial server that tries to be the absolute fastest.
See their <a href="http://support.zeustech.net/tuning.htm">tuning guide</a>.   
<li>The other non-Java servers listed at <a href="http://www.acme.com/software/thttpd/benchmarks.html">http://www.acme.com/software/thttpd/benchmarks.html</a>
<li><a
href="http://ca.us.mirrors.freshmeat.net/appindex/1999/02/17/919251275.html">BetaFTPd</a>
<li><a href="http://www.cs.rice.edu/~vivek/iol98/">Flash-Lite</a> -
web server using IO-Lite.
<li><a
href="http://www.imatix.com/html/xitami/">xitami</a> - uses select() to
implement its own thread abstraction for portability to systems without 
threads.
<li><a href="http://www.nightmare.com/medusa/medusa.html">Medusa</a> - a server-writing toolkit in Python that tries to deliver very high performance. 
</ul>

<h2>Interesting thread-based servers</h2>
<ul>
<li><a href="http://www.zabbo.net/hftpd/">Hoser FTPD</a>.
See their <a href="http://www.zabbo.net/hftpd/bench.html">benchmark page</a>.
<li><a
href="http://ca.us.mirrors.freshmeat.net/appindex/1999/02/06/918317238.html">Peter Eriksson's phttpd</a> and <li><a href="http://ca.us.mirrors.freshmeat.net/appindex/1999/02/06/918313631.html">pftpd</a> 
<li>The Java-based servers listed at <a href="http://www.acme.com/software/thttpd/benchmarks.html">http://www.acme.com/software/thttpd/benchmarks.html</a>
<li>Sun's <a href="http://jserv.javasoft.com/">Java Web Server</a>
(which has been 
<a
href="http://archives.java.sun.com/cgi-bin/wa?A2=ind9901&L=jserv-interest&F=&S=&P=47739">reported to handle 500 simultaneous clients</a>)
</ul>

<h2>Other interesting servers</h2>
<ul>
<li><a href="http://www.novell.com/bordermanager/ispcon4.html">Novell's
FastCache</a> -- claims 10000 hits per second.  Quite the pretty performance graph.
</ul>

<hr>
<i>Copyright 1999 Dan Kegel</i><br>
dank@alumni.caltech.edu<br>
Last updated: 9 May 1999<br>
<a href="http://www.kegel.com">[Return to www.kegel.com]</a>
</body>
</html>
OSSP CVS Repository