First I should start off by stating my goal: to develop a thin interface that makes best use of the facilities the OS provides, scaling as much as possible. Where possible, use of standard POSIX facilities are preferred. A seconday goal is to provide a convenient interface for applications. The reason for preferring standard POSIX interfaces is that it introduces less conditional blocks in the interface code.
The kernel has to scan your array of FDs and check which ones are active. This takes approximately 3 microseconds (3 us) per FD on a Pentium 100 running Linux 2.1.x. Now you might think that 3 us is quite fast, but consider if you have an array of 1000 FDs. This is now 3 milliseconds (3 ms), which is 30% of your timeslice (each timeslice is 10 ms). If it happens that there is initially no activity and you specified a timeout, the kernel will have to perform a second scan after some activity occurs or the syscall times out. Ouch! If you have an even bigger application (like a large http server), you can easily have 10000 FDs. Scanning times will then take 30 ms, which is three timeslices! This is just way too much.
You might say that 3 ms for 1000 FDs is not such a big deal: a user will hardly notice that. The problem is that the entire array of FDs is scanned each time you want to go back to your polling loop. The way these applications work is that after checking for activity on FDs, the application processes the activity (for example, reading data from active FDs). When all the activity has been processed, the application goes back to polling the OS for more FD activity. In many cases, only a small number of FDs are active at any one time (say during each timeslice), so it may only take a few milliseconds to process all the activity. High performance http servers can process hundreds or thousands of transactions per second. A server that takes 2 ms to process each active FD can process 500 transactions per second. If you add 3 ms for FD scanning in the kernel, you now have 5 ms per transaction. That only gives 200 transactions per second, a massive drop in performance.
There is another problem, and that is that the application needs to scan the "returned" FD array that the kernel has updated to see which FDs are active. This is yet another scan of a large array. This isn't as costly as the kernel scan, for reasons I'll get to later, but it is still a finite cost.
It is also worth noting that POSIX.4 Asynchronous I/O is not necessarily available on all POSIX.4 compliant systems (facilities defined by POSIX.4 are optional). So even if you were prepared to limit your application to POSIX.4 systems, there is still no guarantee that AIO is available. Many or most implementations will be scalable, but we can't be sure all are scalable, so we need an alternative.
I should also point out that I don't think that the POSIX.4 AIO interface is particularly appropriate for a network server application. AIO doesn't provide a callback interface, so it is harder to build higher-layer network connection objects in your application. So just writing a better (scalable) implementation of AIO is not the ideal, either.
Unfortunately, I/O readiness queues are not POSIX. They're not even Unix. We could add them to Linux, but it would mean that applications that relied on this mechanism would be unportable, Linux-only. Also, it could involve significant additions to the kernel, which sets off my bloat alarms. I would hope that it is possible to develop an effective alternative that uses standard POSIX/UNIX functionality.
Doing this optimisation does not solve our problem, though. It only pushes the problem away for a while.
Another solution (which would also benefit from the kernel optimisation discussed above) is for the application to divide the FD array into a number of smaller FD arrays, say 10. You then create 10 threads, each of which has a polling loop using its smaller FD array. So each FD array is now 100 entries long. While this doesn't change the total number of FDs that must be scanned, it does change when they have to be scanned. Since most FDs are inactive, not all the threads will be woken up. Too see how this works, consider the example where, at any time (say during a single timeslice of 10 ms), only 5 FDs are active. Assuming these FDs are randomly, uniformly distributed, at most 5 threads will need to be woken up. These threads then process the activity and go back to the start of their polling loops. Where we win is that only 5 threads had to go back and call select(2) or poll(2). Since they each have 100 entry FD arrays, the kernel only has to scan 500 FDs. This has halved the amount of scanning required. The scanning load has gone from 30% to 15% by this simple change. If you were to instead use 100 threads, you would still only have at most 5 threads woken up for activity, and hence the total number of FDs scanned this timeslice would be 50. This takes down the scanning load to 0.15%, which is negligible.
There is one thing to watch out for here: if you use select(2) in your polling loop, be aware that the size of your FD array is equal to the value of your largest FD. This is because select(2) uses a bitmask for its FD array. This means one of your threads will want to poll FDs 991 to 1000. Unfortunately, your FD array is still 1000 long. What's worse, the kernel still has to do a minimal scan for all those 1000 FDs. The solution to this is to use poll(2) instead, where you only have to pass as many FDs as you want to poll, and the kernel scans only those.
This solution sounds ideal: just create lots and lots of threads. At the extreme, you create one thread per FD. There is a problem here, however, as each thread consumes system resources. So you need to compromise between the number of threads and the FD scanning load. Also, the more threads you have the more cache misses you induce, so this is something to avoid as well. Fortunately in this case most threads will be running nearly the same code at the same time, so cache pollution should not be a significant problem.
A more advanced solution is to have dynamic migration of FDs depending on whether they are mostly active or inactive. In the simplest case, you only have two threads. One which polls mostly active FDs and the other polls mostly inactive FDs. The thread for active FDs will be woken up very frequently, but on the other hand will have only a small number of FDs to scan. The other thread will have to scan a large number of FDs, but it will only be woken up occasionally. For each FD an activity counter is kept. When a FD on the mostly inactive list is deemed to be fairly active, it is migrated to the mostly active list. A reverse operation occurs for fairly inactive FDs on the mostly active list.
I favour this solution, since it can be implemented solely in userspace and is portable to other POSIX systems. I have an existing software library which has (amongst other things) support for managing events on FDs. I plan on extending this library to use the above technique. The library is distributed under the LGPL. Watch this space for results.
My approach is to squeeze as much performance as we can out of the existing POSIX/UNIX interface by optimising the kernel and doing clever things in userspace. We can then evaluate how much of a bottleneck polling is under Linux. If polling overheads (for very large numbers of FDs) are kept to within a few percent, I firmly believe there is no need for I/O readiness queues. If polling overheads remain above 10%, then we may consider I/O readiness queues or other extensions to Linux. However, it would be better to add kernel support for scalable AIO rather than implement readiness queues, since AIO is a POSIX standard whereas readiness queues are not.
So now the application only needs to search a very small array. This new syscall is based on the principle of not duplicating work. The kernel has just gone and scanned all these FDs for us, why not keep a record of that work, rather than doing it again in the application? Note that the current Linux polling implementation is so slow that implementing poll2(2) will make little difference. However, once the optimisations I've proposed are added, it will make a difference (yes, I've done the benchmarks). Oh, and before you take my comments on the Linux polling implementation too far, note that I've benchmarked it against several other OSes, and Linux came out far better in any case. My point is that we can still do a lot better.
You might ask, why implement poll2(2) if we have this clever userspace solution that avoids many of the problems? Well, the point is that it would speed up the application processing the inactive list of FDs, and that is always a good thing. However, it may turn out that the gain from poll2(2) is marginal. You could also argue that since poll2(2) is non-POSIX, non-UNIX, then why bother with it at all, and why not simply implement I/O readiness queues? Well, I could say that poll2(2) is only a small departure from UNIX whereas I/O readiness queues are more radical. However, this isn't a very strong argument. I'm waiting to see the results of the userspace solution before considering poll2(2) further. Furthermore, it would be better to make the Linux implementation of AIO scalable, since this is a standard POSIX interface. So poll2(2) is probably dead. It does have one nice feature, though: it would be easier for an application to change to using poll2(2) than to change to using AIO.
Another possibility is to make a slight extension to the existing UNIX asynchronous signal delivery mechanism. Currently, you can reqest the OS to deliver a signal when a FD becomes ready for reading or writing. Unfortunately, if you do this for multiple FDs your signal handler doesn't know which FD is ready for activity. This is because UNIX signals do not carry extra information with them. You can't use a separate signal number for each FD, since UNIX has only a few signal numbers. However, POSIX real-time signals can carry a word of data to the signal handler. So we could extend Linux such that if you request a POSIX RT signal to be delivered when an FD is ready for I/O, the word of data is the FD number itself. Unfortunately, depending on this behaviour would once again be Linux-specific. However, such a system does look like an attractive way of providing the foundations for a scalable AIO implementation.
Similarly, for a system without POSIX.4 AIO but with readiness queues it would make sense for the interface code to utilise this facility.