The C10K problem

It's time for web servers to handle ten thousand clients simultaneously, don't you think? After all, the web is a big place now.

And computers are big, too. You can buy a 500MHz machine with 1 gigabyte of RAM and six 100Mbit/sec Ethernet card for $3000 or so. Let's see - at 10000 clients, that's 50KHz, 100Kbytes, and 60Kbits/sec per client. It shouldn't take any more horsepower than that to take four kilobytes from the disk and send them to the network once a second for each of ten thousand clients. (That works out to $0.30 per client, by the way. Those $100/client licensing fees some operating systems charge are starting to look a little heavy!) So hardware is no longer the bottleneck.

One of the busiest ftp sites, ftp.cdrom.com, currently serves around 3600 clients simultaneously through a 70 megabit/second pipe. Pipes this fast aren't common yet, but technology is improving rapidly.

With that in mind, here are a few notes on how to configure operating systems and write code to support thousands of clients. The discussion centers around Unix-like operating systems, for obvious reasons.

I/O Strategies

There seem to be four ways of writing a fast web server to handle many clients:

serve many clients with each server process or thread, and use select() or poll() to avoid blocking. This is the traditional favorite, and is sometimes referred to as "using nonblocking I/O".
serve many clients with each server process or thread, and use asynchronous I/O to avoid blocking. This has not yet become popular, possibly because of poorly designed asynchronous I/O interfaces. Zach Brown (author of HoserFTPD) thinks this might now be the way to go for highest performance; see his 14 April 1999 post to hftpd-users.
There are several flavors of asynchronous I/O:
- the aio_ interface (scroll down from that link to "Asynchronous input and output"), which associates a signal and value with each I/O operation. Signals and their values are queued and delivered efficiently to the user process. This is from the POSIX 1003.1b realtime extensions, and is also in the Single Unix Specification, version 2, and in glibc 2.1.
- F_SETSIG (see also here), which associates a signal with each file descriptor. When a normal I/O function like read() or write() completes, the signal is raised, with the file descriptor as an argument. Similar to aio_ but without the new calls, and slightly less flexible (you know the handle, but you don't know whether it is ready for read() or for write() without doing a poll() on it). (Currently only in Linux, I think.)
- SIGIO (see glibc doc or BSD Sockets doc) -- doesn't tell you which handle needs servicing, so it seems kind of coarse. Used by the Linux F_SETSIG/aio_ implementation as a fallback when the realtime signal queue overflows. Here's an example of its use. (Was partly broken in Linux kernels 2.2.0 - 2.2.7, fixed in 2.2.8.)
serve one client with each server thread, and let read() and write() block. (This is the only model supported by Java.)
Build the server code into the kernel. Novell and Microsoft are both said to have done this at various times, and at least one NFS implementation does this. IBM and Sun are said to have released specweb benchmark results using this technique.

Richard Gooch has written a paper discussing these options. Interesting reading.

The Apache mailing lists have some interesting posts (one, two, three) about why they prefer not to use select() (basically, they think that makes plugins harder).
I have not yet seen any data comparing the performance of the four approaches.

Mark Russinovich wrote an editorial and an article discussing I/O strategy issues in the 2.2 Linux kernel. Worth reading, even he seems misinformed on some points. In particular, he seems to think that Linux 2.2's asyncrhonous I/O (see F_SETSIG above) doesn't notify the user process when data is ready, only when new connections arrive. This seems like a bizarre misunderstanding. See also comments on an earlier draft, a rebuttal from Mingo, Russinovich's comments of 2 May 1999, a rebuttal from Alan Cox, and various posts to linux-kernel.

Limits on open filehandles

Solaris: see the Solaris FAQ, question 3.45.
FreeBSD: use sysctl -w kern.maxfiles=nnnn to raise limit
Linux: Even the 2.2.5 kernel limits the number of open files to 1024. I believe the AC series of patches remove this limit; see ftp.*.kernel.org/pub/linux/kernel/alan/ for mirrors of e.g. ftp://ftp.linux.org.uk/pub/linux/alan/2.2/patch-2.2.5-ac7.bz2
(See also this patch to make poll scale beyond 1024 fd's. It's dated Dec 1998; I think it's already in the 2.2.5 kernel.)
Any Unix: the limits set by ulimit or setrlimit.

Limits on threads

Solaris: it supports as many threads as will fit in memory, I hear.
FreeBSD: ?
Linux: Even the 2.2.2 kernel limits the number of threads, at least on Intel. I don't know what the limits are on other architectures. Mingo posted a patch for 2.1.131 on Intel that removed this limit; I hear he intends to provide updated patches as time goes on, until it's time to integrate it into the main version of the kernel.
Java: See the Volanomark benchmark writeup in Javaworld. It recommends reducing the amount of memory reserved by default for each thread.

Other limits/tips

select() is limited to FD_SETSIZE handles. This limit is compiled in to the standard library and user programs. The similar call poll() does not have a comparable limit, and can have less overhead than select().
Even the most recent glibc might use 16 bit variables to hold thread or file handles, which could cause trouble above 32767 handles/threads.
Too much thread-local memory is preallocated by some operating systems; if each thread gets 1MB, and total VM space is 2GB, that creates an upper limit of 2000 threads.
Normally, data gets copied many times on its way from here to there. mmap() and sendfile() can be used to reduce this overhead in some cases. IO-Lite is a proposal (already implemented on FreeBSD) for a set of I/O primitives that gets rid of the need for many copies. It's sexy; go read it. But see also Alan Cox's opinion of zero-copy.
The sendfile() function in Linux and FreeBSD lets you tell the kernel to send part or all of a file. This lets the OS do it as efficiently as possible. It can be used equally well in servers using threads or servers using nonblocking I/O. (In Linux, It's poorly documented at the moment; use _syscall4 to call it. Andi Kleen is writing new man pages that cover this.) Rumor has it, ftp.cdrom.com benefitted noticably from sendfile().
A new socket option under Linux, TCP_CORK, tells the kernel to avoid sending partial frames, which helps a bit e.g. when there are lots of little write() calls you can't bundle together for some reason. Unsetting the option flushes the buffer.
Not all threads are created equal. The clone() function in Linux (and its friends in other operating systems) lets you create a thread that has its own current working directory, for instance, which can be very helpful when implementing an ftp server. See Hoser FTPd for an example of the use of native threads rather than pthreads.
To keep the number of filehandles per process down, servers can fork() once they reach the desired maximum; the child finishes serving the existing clients, and the parent accepts and services new clients. (If the desired maximum is 1, this degenerates to the classical one-process-per-client model.)
One developer using sendfile() with Freebsd reports that using POLLWRBAND instead of POLLOUT makes a big difference.
Look at the performance comparison graph at the bottom of http://www.acme.com/software/thttpd/benchmarks.html. Notice how various servers have trouble above 128 connections, even on Solaris 2.6? Anyone who figures out why, let me know.

Kernel Issues

For Linux, it looks like kernel bottlenecks are being fixed constantly. See Linux HQ, Kernel Traffic, and the Linux-Kernel mailing list (Example interesting posts by a user asking how to tune, and Dean Gaudet)

Measuring Server Performance

Two tests in particular are simple, interesting, and hard:

raw connections per second (how many 512 byte files per second can you serve?)
total transfer rate on large files with many slow clients (how many 28.8k modem clients can simultaneously download from your server before performance goes to pot?)

Jef Poskanzer has published benchmarks comparing many web servers. See http://www.acme.com/software/thttpd/benchmarks.html for his results.

In March 1999, Microsoft sponsored a benchmark comparing NT to Linux at serving large numbers of http and smb clients, in which they failed to see good results from Linux. See also my article on Mindcraft's April 1999 Benchmarks for more info.

I also have a few old notes about comparing thttpd to Apache that may be of interest.

Interesting select()-based servers

thttpd Very simple. Uses a single process. It has good performance, but doesn't scale with the number of CPU's.
mathopd. Similar to thttpd.
Zeus, a commercial server that tries to be the absolute fastest. See their tuning guide.
The other non-Java servers listed at http://www.acme.com/software/thttpd/benchmarks.html
BetaFTPd
Flash-Lite - web server using IO-Lite.
xitami - uses select() to implement its own thread abstraction for portability to systems without threads.
Medusa - a server-writing toolkit in Python that tries to deliver very high performance.

Interesting thread-based servers

Hoser FTPD. See their benchmark page.
Peter Eriksson's phttpd and
pftpd
The Java-based servers listed at http://www.acme.com/software/thttpd/benchmarks.html
Sun's Java Web Server (which has been reported to handle 500 simultaneous clients)

Other interesting servers

Novell's FastCache -- claims 10000 hits per second. Quite the pretty performance graph.

Copyright 1999 Dan Kegel
dank@alumni.caltech.edu
Last updated: 8 May 1999
[Return to www.kegel.com]