ossp-pkg/sio/BRAINSTORM/io-events.html
<title>I/O Event Handling Under Linux</title>
<center>
<h1>I/O Event Handling Under Linux</h1>
<h2>Richard Gooch</h2>
<h4>28-JUN-1998</h4>
</center>
<hr>
<h4>Introduction</h4>
I/O Event handling is about how your Operating System allows you to
manage a large number of open files (file descriptors in UNIX/POSIX,
or FDs) in your application. You want the OS to notify you when FDs
become active (have data ready to be read or are ready for
writing). Ideally you want a mechanism that is scalable. This means a
large number of inactive FDs cost very little in memory and CPU time
to manage.
<p>
First I should start off by stating my goal: <b>to develop a thin
interface that makes best use of the facilities the OS provides,
scaling as much as possible. Where possible, use of standard POSIX
facilities are preferred</b>. A seconday goal is to provide a
convenient interface for applications. The reason for preferring
standard POSIX interfaces is that it introduces less conditional
blocks in the interface code.
<p>
<h4>The Traditional UNIX Way</h4>
Traditional Unix systems provide the <b>select(2)</b> and/or
<b>poll(2)</b> system calls. With both of these you pass an array of
FDs to the kernel, with an optional timeout. When there is activity,
or when the call times out, the system call will return. The
application must then scan the array to see which FDs are active. This
scheme works well with small numbers of FDs, and is simple to
use. Unfortunately, for thousands of FDs, this does not work so well.
<p>
The kernel has to scan your array of FDs and check which ones are
active. This takes approximately 3 microseconds (3 us) per FD on a
Pentium 100 running Linux 2.1.x. Now you might think that 3 us is
quite fast, but consider if you have an array of 1000 FDs. This is now
3 milliseconds (3 ms), which is 30% of your timeslice (each timeslice
is 10 ms). If it happens that there is initially no activity and you
specified a timeout, the kernel will have to perform a <em>second</em>
scan after some activity occurs or the syscall times out. Ouch! If you
have an even bigger application (like a large http server), you can
easily have 10000 FDs. Scanning times will then take 30 ms, which is
three timeslices! This is just way too much.
<p>
You might say that 3 ms for 1000 FDs is not such a big deal: a user
will hardly notice that. The problem is that the entire array of FDs
is scanned <em>each time</em> you want to go back to your polling
loop. The way these applications work is that after checking for
activity on FDs, the application processes the activity (for example,
reading data from active FDs). When all the activity has been
processed, the application goes back to polling the OS for more FD
activity. In many cases, only a small number of FDs are active at any
one time (say during each timeslice), so it may only take a few
milliseconds to process all the activity. High performance http
servers can process hundreds or thousands of transactions per
second. A server that takes 2 ms to process each active FD can process
500 transactions per second. If you add 3 ms for FD scanning in the
kernel, you now have 5 ms per transaction. That only gives 200
transactions per second, a massive drop in performance.
<p>
There is another problem, and that is that the application needs to
scan the "returned" FD array that the kernel has updated to see which
FDs are active. This is yet another scan of a large array. This isn't
as costly as the kernel scan, for reasons I'll get to later, but it is
still a finite cost.
<p>
<h4>New POSIX Interfaces</h4>
A fairly simple proposal is to use the POSIX.4 Asynchronous I/O (AIO)
interface (<b>aio_read()</b> and friends). Here we would call
<b>aio_read()</b> for each FD. This would then queue thousands of
asynchronous I/O requests. This model looks appealing, until we look
under the hood of some <b>aio_*()</b> implementations. The Linux glibc
implementation is a case in point: there is no kernel
support. Instead, the C library (glibc 2.1) launches a thread per FD
for which there are outstanding AIO requests (up to the maximum number
of configured threads).
In general, implementing this facility in the C library is reasonable,
as it avoids kernel bloat. However, if you use this facility to start
thousands of AIO requests, you may end up creating thousands of
threads. This is no good, since threads are costly. The "obvious"
solution is to implement AIO in the Linux kernel, then. Another
solution is to use userspace tricks to avoid the scalability problems
(see the description of migrating FDs below). These solutions may be
fine if you only want to run under Linux, but is not much help if you
want to run under another OS which also implements AIO using threads
(and for which you don't have the source code so you can change the
implementation). The point here is that there appears to be no
guarantee that <b>aio_*()</b> implementations are scalable across
platforms which support it.
<p>
It is also worth noting that POSIX.4 Asynchronous I/O is not
necessarily available on all POSIX.4 compliant systems (facilities
defined by POSIX.4 are optional). So even if you were prepared to
limit your application to POSIX.4 systems, there is still no guarantee
that AIO is available. Many or most implementations <em>will</em> be
scalable, but we can't be sure all are scalable, so we need an
alternative.
<p>
I should also point out that I don't think that the POSIX.4 AIO
interface is particularly appropriate for a network server
application. AIO doesn't provide a callback interface, so it is harder
to build higher-layer network connection objects in your
application. So just writing a better (scalable) implementation of
AIO is not the ideal, either.
<p>
<h4>Readiness Event Queues and other OSes</h4>
Some other operating systems provide a mechanism called I/O completion
ports and some have event queues. These are a mechanism to tell the OS
that you want to be notified when there is activity on a FD. Usually,
when a FD becomes active (i.e. becomes ready for reading or writing) ,
the OS will send a message (a "readiness event") to a "port" (perhaps
another FD such as a pipe). This message contains the FD that has
become active. The one port can have many readiness events from many
FDs sent to it. The key difference here is that you do not need to
pass the OS a massive FD array each time you want to listen for
events: you only need to tell the OS once for each FD that you want to
receive readiness events for <em>that</em> FD. The kernel no longer
needs to scan a massive FD array each time through your polling loop,
and nor does the application. This is an appealing approach, and
scales very well.
<p>
Unfortunately, I/O readiness queues are not POSIX. They're not even
Unix. We could add them to Linux, but it would mean that applications
that relied on this mechanism would be unportable, Linux-only. Also,
it could involve significant additions to the kernel, which sets off
my bloat alarms. I would hope that it is possible to develop an
effective alternative that uses standard POSIX/UNIX functionality.
<p>
<h4>Optimising Existing UNIX Interfaces</h4>
There are improvements we can make for the massive FD scanning
problem. Firstly we can optimise the way the scanning is done inside
the kernel. Right now (2.1.106) the kernel has to call the
<b>poll()</b> method for each file structure. This is expensive. Back
in the 2.1.5x kernels, I coded a
<a
href="ftp://ftp.atnf.csiro.au/pub/people/rgooch/linux/kernel-patches/v2.1/fastpoll-readme">
better implementation</a>
for the kernel which sped things up almost 3 times. While this
requires modifications to drivers to take advantage of this, it has
the advantage of not changing the semantics we expect from UNIX. Note
one other interesting feature of this optimisation: it centralises
event notification, which in turn would make implementing I/O
readiness queues simpler. I'm not sure how closure of FDs before
readiness events are read should be handled. This could complicate
their implementation.
<p>
Doing this optimisation does not solve our problem, though. It only
pushes the problem away for a while.
<p>
<h4>Making Better Use of Existing UNIX Interfaces</h4>
Note that for my purposes, it is better to optimise the application so
that it works well on many OSes rather than optimising a single
OS. Creating new interfaces for Linux is a last resort. Also note that
this section assumes that an OS of interest does not have an existing
(preferably POSIX) mechanism that supports FD management in a scalable
way.
<p>
Another solution (which would also benefit from the kernel
optimisation discussed above) is for the application to divide the FD
array into a number of smaller FD arrays, say 10. You then create 10
threads, each of which has a polling loop using its smaller FD
array. So each FD array is now 100 entries long. While this doesn't
change the total number of FDs that must be scanned, it <em>does</em>
change <em>when</em> they have to be scanned. Since most FDs are
inactive, not all the threads will be woken up. Too see how this
works, consider the example where, at any time (say during a single
timeslice of 10 ms), only 5 FDs are active. Assuming these FDs are
randomly, uniformly distributed, at most 5 threads will need to be
woken up. These threads then process the activity and go back to the
start of their polling loops. Where we win is that only 5 threads had
to go back and call <b>select(2)</b> or <b>poll(2)</b>. Since they
each have 100 entry FD arrays, the kernel only has to scan 500
FDs. This has halved the amount of scanning required. The scanning
load has gone from 30% to 15% by this simple change. If you were to
instead use 100 threads, you would still only have at most 5 threads
woken up for activity, and hence the total number of FDs scanned this
timeslice would be 50. This takes down the scanning load to 0.15%,
which is negligible.
<p>
There is one thing to watch out for here: if you use <b>select(2)</b>
in your polling loop, be aware that the size of your FD array is equal
to the value of your largest FD. This is because <b>select(2)</b> uses
a bitmask for its FD array. This means one of your threads will want
to poll FDs 991 to 1000. Unfortunately, your FD array is still 1000
long. What's worse, the kernel still has to do a minimal scan for all
those 1000 FDs. The solution to this is to use <b>poll(2)</b> instead,
where you only have to pass as many FDs as you want to poll, and the
kernel scans only those.
<p>
This solution sounds ideal: just create lots and lots of threads. At
the extreme, you create one thread per FD. There is a problem here,
however, as each thread consumes system resources. So you need to
compromise between the number of threads and the FD scanning load.
Also, the more threads you have the more cache misses you induce, so
this is something to avoid as well. Fortunately in this case most
threads will be running nearly the same code at the same time, so
cache pollution should not be a significant problem.
<p>
A more advanced solution is to have dynamic migration of FDs depending
on whether they are mostly active or inactive. In the simplest case,
you only have two threads. One which polls mostly active FDs and the
other polls mostly inactive FDs. The thread for active FDs will be
woken up very frequently, but on the other hand will have only a small
number of FDs to scan. The other thread will have to scan a large
number of FDs, but it will only be woken up occasionally. For each FD
an activity counter is kept. When a FD on the mostly inactive list is
deemed to be fairly active, it is migrated to the mostly active
list. A reverse operation occurs for fairly inactive FDs on the mostly
active list.
<p>
I favour this solution, since it can be implemented solely in
userspace and is portable to other POSIX systems. I have an existing
<a href="http://www.atnf.csiro.au/karma/">software library</a> which
has (amongst other things) support for
<a href="http://www.atnf.csiro.au/karma/lib/dm.html">
managing events on FDs</a>. I plan on extending this library to use
the above technique. The library is distributed under the LGPL. Watch
this space for results.
<p>
My approach is to squeeze as much performance as we can out of the
existing POSIX/UNIX interface by optimising the kernel and doing
clever things in userspace. We can then evaluate how much of a
bottleneck polling is under Linux. If polling overheads (for very
large numbers of FDs) are kept to within a few percent, I firmly
believe there is no need for I/O readiness queues. If polling overheads
remain above 10%, then we may consider I/O readiness queues or other
extensions to Linux. However, it would be better to add kernel support
for scalable AIO rather than implement readiness queues, since AIO is
a POSIX standard whereas readiness queues are not.
<p>
<h4>Minor Additions to UNIX Interfaces</h4>
OK, now I'll get back to another point I raised earlier, and that is
the time taken for the application to scan a list of FDs for
activity. We would like to reduce this time as well, and we can
without much effort at all. Instead of using <b>poll(2)</b> we can
implement <b>poll2(2)</b>, a new syscall which is very much like
<b>poll(2)</b> but returns an array of <em>active</em> FDs. I
implemented this last year when I first started thinking about
optimising FD management under Linux.
<p>
So now the application only needs to search a very small array. This
new syscall is based on the principle of not duplicating work. The
kernel has just gone and scanned all these FDs for us, why not keep a
record of that work, rather than doing it again in the application?
Note that the current Linux polling implementation is so slow that
implementing <b>poll2(2)</b> will make little difference. However,
once the optimisations I've proposed are added, it <em>will</em> make
a difference (yes, I've done the benchmarks). Oh, and before you take
my comments on the Linux polling implementation too far, note that
I've benchmarked it against several other OSes, and Linux came out far
better in any case. My point is that we can still do a lot better.
<p>
You might ask, why implement <b>poll2(2)</b> if we have this clever
userspace solution that avoids many of the problems? Well, the point
is that it would speed up the application processing the inactive list
of FDs, and that is always a good thing. However, it may turn out that
the gain from <b>poll2(2)</b> is marginal. You could also argue that
since <b>poll2(2)</b> is non-POSIX, non-UNIX, then why bother with it
at all, and why not simply implement I/O readiness queues? Well, I
could say that <b>poll2(2)</b> is only a small departure from
UNIX whereas I/O readiness queues are more radical. However, this
isn't a very strong argument. I'm waiting to see the results of the
userspace solution before considering <b>poll2(2)</b>
further. Furthermore, it would be better to make the Linux
implementation of AIO scalable, since this is a standard POSIX
interface. So <b>poll2(2)</b> is probably dead. It does have one nice
feature, though: it would be easier for an application to change to
using <b>poll2(2)</b> than to change to using AIO.
<p>
Another possibility is to make a slight extension to the existing UNIX
asynchronous signal delivery mechanism. Currently, you can reqest the
OS to deliver a signal when a FD becomes ready for reading or
writing. Unfortunately, if you do this for multiple FDs your signal
handler doesn't know <em>which</em> FD is ready for activity. This is
because UNIX signals do not carry extra information with them. You
can't use a separate signal number for each FD, since UNIX has only a
few signal numbers. However, POSIX real-time signals can carry a word
of data to the signal handler. So we could extend Linux such that if
you request a POSIX RT signal to be delivered when an FD is ready for
I/O, the word of data is the FD number itself. Unfortunately,
depending on this behaviour would once again be Linux-specific.
However, such a system does look like an attractive way of providing
the foundations for a scalable AIO implementation.
<p>
<h4>Mixing and Matching</h4>
A good implementation of POSIX.4 AIO should be superior to my
migrating FD scheme, since AIO should require no polling
whatsoever. Therefore my interface code should be able to make use of
AIO if available. However, since an AIO implementation may in fact not
scale well, it's performance will have to be compared to the migrating
FD scheme to determine whether or not it should be utilised.
<p>
Similarly, for a system without POSIX.4 AIO but with readiness queues
it would make sense for the interface code to utilise this facility.
<p>
<h4>The Thundering Herd Problem</h4>
A note on the "thundering herd" problem: it's not really of much
interest in this discussion. The "problem" arises because people
attempt to increase concurrency of accepting new connections by having
multiple threads all blocking waiting on <b>select(2)</b>. Because the
kernel wakes up all threads, rather than one thread per new
connection, we have a "thundering herd" of freshly woken up
threads. These problems are best solved by treating the accepting of
incoming connections as just another case of I/O management, rather
than special-casing them.
<p>
<h4>Other Resources</h4>
Readers may find a recent <a
href="http://www.cs.rice.edu/~gaurav/my_papers/usenix98.ps">paper</a>
presented at USENIX98 of interest. The author also has a list of
<a href="http://www.cs.rice.edu/~gaurav/gaurav_research.html">
other research</a> on this topic. The paper essentially describes a
mechanism for improving the speed of <b>select(2)</b> by adding
internal state information to the kernel about FD activity.
<p>
<h4>Comments Please</h4>
I invite comments or additions to this document. My purpose is to
explain the various issues involved and serve as a primer for more
debate. I hope to avoid recurring debates on the linux-kernel list
which go over the same ground and either never contribute anything new
to the arguments, or take many messages before something new is
added. If you have a strong technical argument that I've missed, I'll
be happy to add it to this document.
<hr>
Original: 22-JUN-1998
<hr>
<a href="../../index.html">Back to my Home Page</a>
<hr>
<center>
<address>Richard Gooch (rgooch@atnf.csiro.au)</address>
</center>
</body>
</html>