TCP Stack Measurements on Lightly Loaded Testbeds

Introduction

Steven Low and his group at Caltech have developed a new FAST TCP stack that improves performance on high speed long RTT links. Tom Kelley of CERN has developed a Scalable TCP stack. Sally Floyd has proposed a High Speed TCP (HS TCP) and it has been impmemented by the web100/net100 team. FAST TCP is based on Vegas and uses the RTT to indicate congestion. The latter two are based on Reno for congestion recognition and modify the additive increase and multiplicative decrease congestion strategies of Reno (Scalable TCP uses exponential increase, while HS TCP uses a table to indicate how much to increase the congestion window by when an ACK is received). All of these stacks only require them to be implemented in the sender. We report here on some measurements made with all three of these stacks as well as the stock TCP stack. We also report related measurements with jumbo frames, and with varying the transmit queue length of the network device (txqueuelen).

These measurements were made using the DataGrid/Caltech/SLAC testbed. This consists of fast 2.4GHz Linux hosts with GE Network Interface Cards (NICs) located at CERN Geneva (GVA), StarLight Chicago (CHI), and at the Level(3) gateway in Sunnyvale California (SNV). At a later stage we also had 5 hosts (2 at SNV, 2 at CHI and 1 at GVA) with Intel 10GE NICs (see 10GE End-to-end TCP tests). The link between Sunnyvale to Chicago is by a a Level(3) provided OC912/POS (10Gbits/s) link. The link between CHI and GVA is an OC48 2.5Gbits/s link. The testbed was setup for SC2002 and more details can be found in "Extreme Bandwidth": SC2002 Bandwidth Challenge Proposal.

The FAST 10GE Experiment is the planning page for the Jan 2003 experiment, and will be continually updated. In some cases we reserved hosts at CERN for exclusive use using the DataTag reservation form available to DataTag users following the rules. Where possible we tried to avoid reserving so some of our results were affected by cross-traffic.

Comparisons with multiple streams

Methodology

The methodology is broadly outlined in Bulk throughput measurements. We set up the disk server routing to balance traffic between Sunnyvale and Chicago. We used iperf to send TCP data for 80 seconds with various windows and number of parallel streams. We chose 80 seconds¹ since that should allow slow start to complete in < 10% of the total 80 second measurement time. We estimated the slow start time for the normal (Reno/Tahoe) TCP stack for a single stream to be ~ 2*ceiling(log₂(optimum_window_size))*RTT, where RTT is the Round Trip Time (usually measured by ping). For an RTT of ~67ms (the RTT from Sunnyvale to Chicago) this yields ~ 2 seconds. We limited the product of windows*streams to 40,000 KBytes to limit the impact on resources such as memory.

FAST TCP and Multiple Streams

We made measurements from SNV11 (198.51.111.50) running the FAST TCP to CHI2. By default we configured the FAST TCP stack and host as follows:

[cottrell@cit-slac11 ~]$ bin/setup
proc/sys/net/ipv4/tcp_mem = 4096 67108864 67108864
/proc/sys/net/ipv4/tcp_rmem = 096 67108864 67108864
/proc/sys/net/ipv4/tcp_wmem = 4096 67108864 67108864
/proc/sys/net/core/wmem_max = 67108864
/proc/sys/net/core/rmem_max = 67108864
/proc/sys/net/ipv4/tcp_vegas_cong_avoid = 1
/proc/sys/net/ipv4/tcp_vegas_fast_converge = 1
/proc/sys/net/ipv4/tcp_vegas_alpha = 400
/proc/sys/net/ipv4/tcp_vegas_beta = 250
/proc/sys/net/ipv4/tcp_vegas_gamma = 150
MTU = 1500
tcpqueuelen=100
uname -a = Linux cit-slac12.caltech.edu 2.4.18-3combined #13 SMP Mon Nov 18 11:58:38 PST 2002 i686 unknown

The throughputs as a function of streams and windows are seen below. We use log-log plots to make the small numbers of streams stand out more clearly. The maximum number of streams was 120. The maximum throughput measured aggregated over 60 seconds was 870Mbits/s (for one stream at 16384MByte window, the next top two were also for a single stream and were 838Mbits/s at 65536KByte window and 829Mbits/s for a 32768KByte window). The average MHz/Mbits/s was 1.58+-0.26.

Stock TCP and Multiple Streams

We also measured the throughput from CHI1 (an identical host to CHI2, except is was running the standard TCP stack) to SNV11. Each measurement was for 50 seconds. The results are shown below. The maximum throughput was for 486Mbits/s for 25 streams with a 4096KB window, 2nd was 428Mbits/s for 120 streams and a 1024KB window, and third place was 417Mbits/sec with 12 streams and an 8192KByte window. The average MHz/Mbits/s was 1.58+-0.36 which agrees well with the cpu utilization for the standard TCP stack. It is also seen that the multi-stream performances are similar from the FAST to normal TCP stack for large numbers of streams (>20), but the behavior of small numbers of streams, in particular 1, is much improved (almost double for a single stream) by FAST.

FAST TCP and Multiple Streams with a longer RTT

We made measurements from SNV11 to GVA2. Here the round trip was about 182ms. The iperf TCP throughput using the FAST stack is shown below for various windows and streams. The maximum throughput measured was 855Mbits/s for one stream with a 65536KByte window. It is seen that the throughput begins to saturate at above 400Mbits/s or about 50% of the maximum achievable.

Multiple hosts

We made 80 second iperf TCP measurements between various hosts with the following FAST parameters:
cat /proc/sys/net/ipv4/tcp_mem = 4096 67108864 67108864
cat /proc/sys/net/ipv4/tcp_rmem = 4096 67108864 67108864
cat /proc/sys/net/ipv4/tcp_wmem = 4096 67108864 67108864
cat /proc/sys/net/core/wmem_max = 67108864
cat /proc/sys/net/core/rmem_max = 67108864
cat /proc/sys/net/ipv4/tcp_vegas_cong_avoid = 1
cat /proc/sys/net/ipv4/tcp_vegas_fast_converge = 1
cat /proc/sys/net/ipv4/tcp_vegas_alpha = 400
cat /proc/sys/net/ipv4/tcp_vegas_beta = 250
cat /proc/sys/net/ipv4/tcp_vegas_gamma = 150
uname -a = Linux cit-slac11.caltech.edu 2.4.18-3combined #13 SMP Mon No v 18 11:58:38 PST 2002 i686 unknown

We also made bbftp measurements for a 2GByte file transfer

For all of these measurements the window size was set to 32768KBytes unless otherwise stated, and a single stream and the MTU was 1500Bytes. The numbers in the columns are Mbits/s, an annotation of (j) indicates a measurement using jumbo frames. The numbers in parentheses are the maximum window size configured. The Servers marked with Disk have 2 TBytes of RAID disk space. The bbcp application at this time was only able to accept a maximum window request of 2 MBytes, the long-distance measurements between Sunnyvale and Chicago or CERN performed poorly since the window size was inadequate.

Server

CPU

Disk

CPU

Disk

CPU

Disk

CPU

TCP Stack

FAST

Std

FAST

Std

Std
jumbo

Std

Receiver >

SNV11
(67MB)

SNV2
(67MB)

SNV13 (32MB)

SNV17 (67MB)

CHI1 (32MB)

CHI2 (65MB)

CHI3 (32MB)

CHI10
(32MB)

GVA1 (32MB)

GVA2 (32MB)

GVA3 (32MB)

GVA4 (32MB)

NIK16 (24MB)

Server

TCP stack

Sender
V

iperf

CPU

FAST

SNV11
(67MB)

900

809

890

411

855

872

Disk

FAST

SNV17
(67MB)

860

760

725

450+-25, 209+-60(j)

790,
939(j)

528

840

CPU

Scalable

SNV10 (64MB)

922

CPU

Std

SNV2
(67MB)

928

914

800(j)

Disk

Std
jumbo

SNV13
(32MB)

338

34(j)

200

CPU

Std

GVA2 (32MB)

CPU

Std

CHI3 (32MB)

200, 900(j)

950(j)

CPU

Std

NIK16 (24MB)

200

Server

TCP stack

Sender
V

bbftp

CPU

FAST

SNV11
(67MB)

156

219

Disk

FAST

SNV17 (67MB)

608

157

376

204

Disk

Std
jumbo

SNV13
(32MB)

383

107

Server

TCP stack

Sender
V

bbcp
mem

bbcp mem

bbcp
mem

bbcp mem

bbcp
mem

bbcp mem

bbcp
mem

bbcp mem

CPU

FAST

SNV11
(67MB)

120

Disk

FAST

SNV17
(67MB)

826

128

Disk

Std
jumbo

SNV13
(32MB)

362

Receiver >

SNV11
(67MB)

SNV2
(67MB)

SNV13 (32MB)

SNV17 (64MB)

CHI1 (32MB)

CHI2 (65MB)

CHI3 (32MB)

CHI10
(32MB)

GVA1 (32MB)

GVA2 (32MB)

GVA3 (32MB)

GVA4 (32MB)

NIK16 (24MB)

Further study of initial poor iperf performance (449Mbits/s) between SNV11 and GVA2 measured on 1/12/2002 indicates it to be due to a very slow start and there was heavy (>> 1%) loss measured by pings. The losses were only during the iperf measurement. It may also be relevant that the Ethernet interface on GVA2 reported 5 receive interface errors. These errors appear to occur at the rate of about 1/minute regardless of whether iperf is running. Similar results are seen from SNV17 to GVA3. SNV17 to GVA2 on the other hand reaches stability of about 840Mbits/s by 15 seconds, hence its throughput is much higher. The poor performance between SNV11 and GVA2 appeared to be transient and on 1/14/2002 we achieved 855Mbits/s iperf throughput. Further studies of the behavior of the congestion window (cwnd), instantaneous throughput and RTT can be found at http://www.cs.caltech.edu/~chengjin/les/.

TCP Stack Comparisons with Single Streams

We used iperf as a client on Sunnyvale hosts to send iperf TCP throughput with a window size of 32 MBytes to an iperf server at GVA. By default we ran iperf for 1000s and reported incremental throughputs at 5 second intervals. SNV2 had the stock TCP installed, SNV10 had the Scalable TCP installed, SNV11 had the FAST TCP installed, SNV9 had Sally Floyd's HS TCP with web100 installed. The GVA iperf server hosts all had the stock TCP installed. Besides using different TCP stacks on the client, we also varied txqueuelen, and used jumbo (MTU=9000Bytes) frames as well as the standard MTU=1500Bytes. All tests except one were made for 1000 seconds. Before each test we used the Linux (root) command sysctl -w net.ipv4.route.flush=1 to ensure that the slow start threshold was not cached. We were also careful to ensure there was plenty of free memory (we queried the amount of free memory using the Linux top or free command, and if there was less than 400000K free we rebooted the host). We also checked that I was the only user logged onto the client and observed top on the server to see that our iperf was the only consumer of cpu cycles. The stock and scalable stack measurements were made with various txqueuelen settings, but for FAST we only used the recommended setting of 100 since FAST relies on the RTT for its congestion avoidance and large settings of txqueuelen can dramatically increase the RTT. The results are shown below. The following observations are in order:

All the results, unless otherwise marked are for MTU=1500Bytes.

The measurements were made at the following times:

Col, Row	Date, time PST	Stack	txqueuelen Packets	MTU Bytes	Avg throughput in 1000s (Mbits/s)	Avg throughput in first 80s (Mbits/s)	Notes
Jumbo	Feb 20 '03 17:00	HS	100	9000	934+-49	901+-163	Reached 900Mbits/s after 5s
2,4	Feb 20 '03 16:00	HS	100	1500	913+-69	881+-175	Reached 900Mbits/s after 10s
1,3	Feb 12 '03 21:00	Scalable	5000	1500	838+-101	795+-178
3,3	Feb 9 '03, 11:27	Stock	10000	1500	551+-48	480+-10
3,1a	Feb 9 '03, 11:08	Stock	100	1500	128+-46	56+-6
3,4b	Feb 9 '03, 13:06	Stock	1000	9000	625+-259	194+-23
3,2	Feb 9 '03, 11:53	Stock	1000	1500	94+-4	167+-46
3,4	Feb 9 '03, 12:38	Stock	100	9000	629+-259	210+-23
2,2a	Feb 9 '03, 11:02	FAST	100	1500	764+-247	333+-146	Reached 900Mbits/s after 220s
1,4a	Feb 9 '03, 9:38	Scalable	10000	1500	881+-109	914+-58	Reached 900Mbits/s after 20s
2,1	Feb 9 '03, 7:42	FAST	100	1500	128+-46	56+-7	Reached 900Mbits/s after 505s
2,2b	Feb 8 '03, 21:54	FAST	100	1500	763+-247	524+-208	200s measurement, reached 900Mbits/s after 85s (magenta)
1,1a	Feb 8 '03, 22:37	Scalable	100	1500	551+-234	387+-274
1,2	Feb 8 '03, 20:01	Scalable	2000	1500	669+-187	538+-217
1,1c	Feb 12 '03	Scalable	100	1500	571+-237	432+-274	Reached 700Mbits/s after 50 s (yellow)
1,4b	Feb 12 '03	Scalable	10000	1500	919+-48	892+-452	Reached 900 Mbits/s after 10s (magenta)
3,1b	Feb 13 '03 9:10	FAST	100	1500	919+-50	892+-133	Reached 900 Mbits/s after 100s (green)
1,1b	Feb 13 '03 9:15	Scalable	100	1500	464+-253	248+-243	Reached 700 Mbits/s after 90s (magenta)

The differences in the FAST TCP throughputs for otherwise identical parameters (i.e. the two plots with txq=100), is not understood, it might be due to congestion possibly induced by other users of the testbed. More work is still needed to understand the dramatic differences in throughputs seen at different times. For example why the top FAST, txq=100 plot seems to have difficulty reaching maximum throughput until about 500 seconds, and then consistently reaches over 900 Mbits/s. This may also be a congestion effect.
For the stock TCP utilizing a larger txqueuelen appears to assist slow start in coming up with a better starting value for the congestion avoidance phase. The rate of increase of throughput is seen to be very slow (roughly 150Mbits/s increase in 1000 seconds) for the stock TCP with an MTU = 1500Bytes.
Jumbo frames can be seen to have a faster throughput improvement rate in the congestion phase. This is to be expected since the rate of throughput increase is proportional to the MTU and the MTU is 6 times larger (9000/1500).
With jumbo frames txqueuelen does not appear to help.
Both the scalable and FAST TCP stacks succeed in achieving throughputs of over 600Mbits/s within about 100 seconds.
We repeated many of the measurements on February 12 and 13, 2003, with similar results so the measurements were repeatable.

Also see Floyd's TCP slow-start and AIMD mods by Tom Dunigan, for comparisons of how HS TCP works compared to stock TCP.

Jumbo Frames

Jumbo frames and multiple streams with stock TCP

Hosts were set up to enable jumbo frames (MTU = 9000 Bytes). These were SNV13 (198.51.111.58) and CHI10 (192.91.236.10). Both utilized the stock TCP stack. We used tcpdump and the iperf -m option to verify that large MTUs were being transmitted. We also used cat /proc/net/snmp to look for evidence of fragmentation. Also the Linux tcptrace has a -F (don't fragment option) thta when used with large frames (the frame size follows the hosr address/name in the command) can be used to see if jumbo frames reach the destination. For information on Path MTU discovery see RFC 1191. We measured the throughput for various windows and streams with the results shown below.

Comparing with the SNV11 to CHI2 (FAST stack standard MTU), it is seen that in the linear domain (i.e. throughput increases linearly on the plot for SNV11 to CHI2) the throughputs are almost identical apart from the window size 64KB (which is currently not understood). For large numbers of streams jumbo frames out-performs the FAST stack.

In theory, FAST works better than standard TCP(Reno) when the bandwidth (in packet per second) is high. Reno cannot perform well as the packet rate increases. FAST is scalable for both low and high packet rates.

Jumbo frame Reno is better than 1500 MTU Reno since the packet per second rate is reduced to 1/9 by Jumbo frame. The Reno's problem in high packet rate is alleviated.

For multiple connection case, the packet rate is much smaller for each connection. So the FAST has no advantage (that is something like on 10Mbps link, FAST cannot show its advantage than Reno.), but the jumbo frame has its own advantage of less interrupts and higher payload... So, jumbo frame Reno is better than FAST. (I expect FAST with jumbo frame can have similar performance as Reno with jumbo frame.) Xiaoliang Wei, Caltech FAST team.

Comparing against CHI1 to SNV11 (standard stack and MTU) again the behavior in the linear region is almost identical. Saturation sets in around 400Mbits/s for the standard stack with standard MTU whereas for jumbo frames saturation sets in close to 1Gbits/s.

Jumbos and Single Stream with stock TCP

We also made measurements between Amsterdam/NIKHEF (145.146.97.16) and Chicago (CHI3). Both hosts were enabled for jumbo frames and were running stock TCP. The NIKHEF host (2*2.4GHz Linux PC) had txqueuelen set to 1000 (packets), and CHI3 (2*2.2GHz Linux PC) was set to 2500. The max TCP window was set to 32MBytes, and iperf used 1 stream. No errors were observed in the NICs. The routes were asymmetric, and jumbo frames were only enabled in one direction (CHI3 to NIK16). The RTT was about 129ms.:

From Chicago the route was via the T640 router at StarLight to Abilene to New York to SurfNet and NIKHEF, 10Gbits/s all the way. On this route jumbo frames were enabled all the way. Throughput reached over 800Mbits/s in 5 seconds and stabilized between 850 and 950Mbits/s.
From Amsterdam the route was via GEANT to SWITCH at 2.5Gbits/s to CERN and then via 622Mbits/s to a Cisco 7609 at Chicago/StarLight. On this route jumbo frames were not enabled. Throughput slowly built up from about 20Mbits/s through 100Mbits/s after 265 secs to about 200 Mbits/s after 500-600 seconds
We repeated the above measurements the next day making sure the txqueuelen on both ends were identical (2500) and got the same results..

This illustrates that jumbo frames for a single stream with stock TCP between 2 identical hosts with an RTT of 128ms can improve performance by a factor of 5 compared to a 1500Byte MTU.

From Sunnyvale to CERN we set up SNV1 (198.51.111.10) and GVA3 (192.91.239.3) with jumbo frames, txqueuelen=1000, running stock TCP. We sent TCP data from SNV1 to GVA3 using iperf with 32MByte window. The results indicate that in the first 5 seconds we achieved about 400Mbits/s, at 80 seconds it reached about 800 Mbits/s and after 240 seconds it reached about 990Mbits/s. The aggregate throughput reached after 1000 seconds was 972Mbits/s It is apparent that for this RTT (181ms) for jumbo frames one needs to run for considerable time (240 seconds) to reach the optimum performance. The figure below illustrates the additive increase (increases by time*MTU/RTT² for each ACK received, the factor 0.5 accounts for the delayed ACKs) of the stock TCP with jumbo frames.

For stock TCP the maximum single stream throughput for MTU = 9000 Bytes exceeded that for MTU =1 500 Bytes by almost a factor of 5 (200 Mbits/s for MTU=1500Bytes vs 967Mbits/.s with MTU=9000Bytes).

Jumbo frames and Various TCP Stacks

We made measurements with iperf/TCP from SNV to GVA1 with both server and client configured for MTU = 9000 Bytes and txqueuelen = 100. The results are shown below. Comparing these figure with those in the section on Comparing TCP Stacks it can be seen that the jumbo frames help significantly in improving throughput for all stacks evaluated.

For this unloaded path, with both FAST and HS TCP and with txqueuelen = 100, we were able to achieve > 900Mbits/s within 10 seconds. We also tried other values of txqueulen for the scalable TCP to see how it affected the overall throughput and stability but for shorter durations. The average values of throughputs observed so far 5, 20, 40,80 and 400 seconds are seen in the table below. It can be seen that larger txqueuelen results in larger throughputs for scalable TCP with MTU = 9000 Bytes.

txquerlen	Time to reach 800Mbits/s	Time to reach 900 Mbits/	Average throughput after 5 seconds	Average throughput after 400s	Average throughput after 80s
2000	5s	5s	657 Mbits/s	982+-38	966+-81
1000	15s	15s	500 Mbits/s	901+-163	840+-247
500	20s	20s	380 Mbits/s	844+-145	814+-156
200	25s	40s	291 Mbits/s	798+-125	741+-186
100	20s	105s	147 Mbits/s	774+-128	715+195

The behavior of the throughputs with txqueuelen is plotted below. It can be seen that there is little growth in the average throughput after 80 seconds. Also the points fit well (R² > 0.9) to logarithmic series. The curves shown are fits to logarithmic series of the form f(t) = a*ln(t) +b with the parameters shown in the table below. The throughput at 5s is mainly dominated by slow start.

Parameters & R² of fit of f(t)=aln(t)+b* to throughputs vs txqueuelen
Seconds so far (T)	a	b	R²
5	161	-589	0.98
10	174	-545	0.95
20	142	-164	0.98
40	108	119	0.97
80	79	334	0.93
400	68	445	0.95

Measurements from Sunnyvale to Amsterdam

The path partially used a production network (from StarLight to NIKHEF).. The maximum MTU was 8192Bytes. We followed the methodology described earlier. We also measured the ping RTT from the server to the client simultaneously with eth iperf measurements. The results of measurements made on February 18th and Feb 21st, 2003 using Scalable, FAST, HS and stock TCPs with various txqueuelen and MTU = 8192 Bytes are shown below in tabular and graphical forms. The approximate time stamp for when each measurement ended is also given on each plot. The behaviors of the TCP stacks are markedly differentFor HS TCP one can see the slope of the recovery increasing with the congestion window (cwnd).

Col, Row	Date, time PST	Stack	txqueuelen	MTU Bytes	Avg throughput in 1000s (Mbits/s)	Avg throughput in 80s (Mbits/s)	Comments
1,1	Feb 8 '03, 09:30	FAST	100	8192	461+-241	447+-194
2,1	Feb 18 '03, 10:54	Scale	100	8192	387+-68	397+-82
2,2	Feb 18 '03, 16:04	Scale	500	8192	507+-140	517+-155
2,3	Feb 16 '03, 16:27	Scale	1000	8192	530+-161	568+-164
2,4	Feb 18 '03, 10:11	Scale	2000	8192	622+-146	684+-179
2,5	Feb 18 '03, 16:56	Scale	2000	8192	682+-133	644+-145
1,2	Feb 21 '03, 11:25	HS	100	8192	303+-123	180+-137
1,3	Feb 21 '03, 17:12	HS	500	8192	290+-142	289+-138
1,4	Feb 21 '03, 10:50	HS	2000	8192	334+-207	323+-146
1,5	Feb 21 '03, 15:50	HS	10000	8192	371+-239	292+-175
3,1a	Feb 18 '03, 13:21	Stock	100	8192	438+-53	237+-26
3,2	Feb 21 '03	Stock	100	8192	318+-51	248+-28
3,1b	Feb 18 '03, 10:15	Stock	1000	8192	502+-101	740+-116

The strange behavior of the Scalable txqueuelen=2000, MTU=8192 appears to be related to the RTT, note that the RTT axis is logarithmic. The missing points (both the RTT and throughput measurements were supposed to report once every 5 seconds) were since the tool (either ping or iperf) reported no numeric value (iperf reported "[Warning] Skipping report, interval too small", and ping reported "Destination Host Unreachable"). The effect was reproduced when we used txqueuelen.1000. Antony Antony of NIKHEF also reported errors on the NIKHEF host (145.146.97.17) of the form
"NET: 7 messages suppressed.
Out of memory when allocating jumbo receive buffer
Out of memory when allocating jumbo receive buffers
Out of memory when allocating jumbo receive buffers"
and he rebooted the NIKHEF host at about 12:15 PST.
Both the above measurements at txqueuelen = 1000 and 2000 were made in the morning (PST). However, when we repeated the measurement later in the day (around 4-5pm PST) when Europe would be quieter, with either txqueuelen = 1000 or 2000 the effect was not noticeable (see the 3rd down and bottom Scalable graphs above). It is currently unclear whether the effect was due to loading (cross-traffic) on the path, or due to some system related effect such as lack of memory that was cleared up by the reboot.
An MRTG plot of the utilization of the bottleneck at Amsterdam are shown below. The times are for CET, i.e. 9 hours ahead of the PST times in the plots above. It can be seen there appears to be little competing traffic.
The aggregate throughputs after 80 seconds and 1000 seconds for the Scalable TCP as a function of txqueuelen are shown in the figure below.
From looking at it once-over-lightly, one thing that sticks out is that HS TCP might not get as high throughput as FAST or Scalable TCP over a congested link with large packets (MTU = 8192 Bytes). This seems plausible, as HSTCP's aggressiveness is a function of the congestion window in MTU-sized packets, not of the congestion window in bytes. The goal of HSTCP is to be doing roughly the same as Stock TCP in those environments where Stock TCP is capable of fully utilizing the link bandwidth, so perhaps this might not be too bad. Sally Floyd.

CPU Utilization

The cpu utilization measured at the client (SNV11 a 2*2.4GHz Pentium 4 running the FAST TCP stack) using the Unix time command is shown below. The average was 1.58MHz/Mbits/s +- 0.26 Mbits/s. This is in reasonable agreement with measurements for more standard (Reno/Tahoe) TCP stacks.

We manually observed the cpu utilization of the iperf server using the Unix top command and noted down its values together with the throughputs recorded using the iperf -i incremental recording option. The MHz/Mbps was 1.6 +- 0.2. A plot of the server cpu utilization for SNV11 (running FAST) to GVA4 (stock TCP) with 1500Byte MTU is shown below. We repeated the measurement with the iperf client sending data from SNV10 running the Scalable TCP stack and from SNV11 running the FAST TCP stack to an iperf server at GVA4 with similar server utilization results. There was however, a big difference when using jumbo (MTU=9000Bytes). The server cpu utilization for MTU=9000 Bytes is about a factor 3 less than for MTU=1500Bytes., or more quantitatively 0.59+-0.1 compared to 1.6+-0.2.

To compare iperf client cpu utilization between standard MTUs (1500Bytes) and jumbo frames we used iperf/TCP to send data for 80 seconds from a FAST TCP host (SNV17) at Sunnyvale to a standard TCP stack host at Chicago (CHI2) and Geneva (GVa2) We used txqueuelen=100, a single stream and varying window sizes to achieve different throughputs. The GVA MTU was set (using ifconfig eth0 mtu 9000) to 9000Bytes. The MTU at Sunnyvale was alternated between 1500 and 9000 Bytes. The results are shown below. It is seen that for the FAST TCP stack the iperf client cpu utilization is about a factor 2 less for jumbo frames. For the stock TCP stack the difference in CPU utilization/Mbits/s between MTU=1500Bytes and 9000Bytes was fairly small (see 2nd figure below) and was close to that for the FAST stack for a 9000Byte MTU.

Startup

The single stream slow start for a TCP Reno/Tahoe stack assuming no losses should take about 2*ceiling(log₂(ideal_window_size))*RTT which yields about 2 seconds for an RTT of 67msec and window of 65000KBytes. The FAST TCP stack appears to take longer or closer to 8 seconds as can be seen below from the iperf -i (interval) option output: After slow start the throughput appears to remain fairly steady at between 840Mbits/s and 1Gbits/s. The aggregate throughput for the 60 seconds was 883Mbits/s. The aggregate throughput measured from 8 seconds after the start until the end (i.e. when the throughput is stable) was 938Mbits/s.

Fair Share

To demonstrate the way that FAST TCP shares throughput among multiple streams we plot the throughput/stream where the aggregate throughput is saturated. The figures below are for the measurements from SNV11 to GVA2. The stacked graph shows the iperf flow throughputs for the 512KByte window for streams of 32, 40, 64, 90 and 120 where as seen from the figure above the throughput is fairly saturated. Quick inspection shows that the flows share the throughput about equally. The error bar plot shows the average throughput/stream for all the measurements. The magenta crosses are for the measurements where the aggregate throughputs are > 400 Mbits/s. The error bars indicate the standard deviations (stdev). The 3rd graph shows the relative standard deviations (i.e. stdev/avg) for the per stream throughputs. The magenta squares are for aggregate throughputs of over 400 Mbits/s. The points with 0 stdev/avg are single stream measurements where the stdev is zero. It can be seen that FAST TCP does a good job of fair sharing the throughputs among competing FAST TCP streams between the same source and destination.

bbcp

We ran bbcp in memory (/dev/zero) to memory (/dev/null) mode from SNV11 to CHI2. We set the window size to 32768KB to match that nominally needed by an RTT of 182ms and a bandwidth of 1Gbits/s.

/home/cottrell/package/bbcp/bin/i386_linux24/bbcp -f -v -b 4 -t 80 -P 1 -s 1 -D

-w 32768k -T "ssh -l cottrell 192.91.239.2 /home/cottrell/package/bbcp/bin/i386_linux24/bbcp" /dev/zero cottrell@192.91.239.2:/dev/null

With such a window size, we were only able to achieve a throughput of about 15.2MBytes/s or 127Mbits/s. The client cpu load while running this (measured by the Unix top command) varied from 6-9% so it does not appear to be a cpu starvation problem. We observed that though we specified a window of 32768Kb bbcp set the window back to 2MBytes

bbcp_CTL: Sending to 192.91.239.2: -b 4 -D -f -m 644 -P 1 -s 1 -t 80 -v -W 2096128 -Y 2e565f3e -H none:0

which then Linux increased to 4MB:

bbcp_SNK 6144: Window size set to 2096128 (actual snd=4192256

rcv=4192256)

The problem may be caused by "disk buffers are tied to window size buffers, you can quickly spiral out of control and kill the whole system." Andy Hanushevsky. I tried setting a window of 2048KB and a single stream and achieved 36,205KBytes/s (~304Mbits/s) over 80 seconds.

bbcp -f -v -b 4 -t 80 -P 1 -s 16 -D -w 2048k -T "ssh -l cottrell 192.91.236.2 /home/cottrell/package/bbcp/bin/i386_linux24/bbcp" /dev/zero cottrell@192.91.236.2:/dev/null

bbcp_SNK 9038: Window size set to 2096128 (actual snd=4192256 rcv=4192256)
bbcp: Source I/O buffers (18423K) > 25% of available free memory (16216K); copy may be slow

I tried setting the window to 2048k and using 16 streams, however bbcp failed to complete.

After further discussions with Andy Hanushevsky, he identified a problem with using -w to try and set window sizes > 2MBytes and suggested using the -W option. This allowed larger windows to be set. With 32768Kbyte windows, a single stream, a txqueuelen of 100, stock TCP, we achieved the following bbcp throughputs between SNV13 aka cit-slac13 (198.51.111.58) to SNV17 aka cit-slac17 (198.51.111.78) (click on the throughputs to view the console output):

Mode	From	To	BBCP Throughput MBytes/s (Mbits/s)
Memory-to-memory	SNV13:/dev/zero	SNV17:/dev/null	104.2 (833.6)
Disk-to-memory	SNV13:/raid/dummy.2000000000	SNV17:/dev/null	72.3 (578.4)
Disk-to-disk	SNV13:/raid/dummy.2000000000	SNV17:/raid/bbcpdat	62.6 (500.8)

Link Losses

To investigate the loss performance of the link we ran 64 Byte pings at 1 second intervals from SNV11 to GVA2. To first order pings should indicate the non congestion loss on the link. We ran ~ 60K pings starting at 18:12:49 on December 19, 2002. The overall loss rate was 0.55% (326 packets in 59465 sent). They were lost in a burst of 154 sequential packets (i.e. an outage of 154 seconds) ending at sequence number 33454, and 157 packets ending at sequence number 26361, plus 2 packets ending at sequence 12212 plus a single packet being lost 15 times. For the single packet losses 2 came so that lost packets were within 10 sequence numbers of one another or another burst (i.e. within roughly the time it takes FAST to climb back up to full throughput after a loss). This suggests that the losses are very bursty, there appear to be 14 bursts of losses separated by 10 seconds or more in a time period of 60,000 seconds, or a burst loss rate of 0.02% or 2.2 in 10,000 or a Bit Error Rate (BER) assuming 1500Byte MTUs of 2 in 10⁸. If we assume the single packet losses are caused by congestion, then the burst loss rate is 2 in 60,000 or a BER of 3 in 10⁹. Possibly the shorter bursts are caused by congestion at the routers, for example caused by iperf tests. The sequence number at the end of each burst, the length of each burst and the separation between bursts can be seen in the table below.

The losses from SNV11 to CHI1 for 60000 64 Byte pings at 1 second intervals starting on Friday Dec. 20 at 12:54:34 2002 PST indicate a loss rate of 0.035% (21 pings lost in 60000) and that the losses are non-bursty. All the losses were of a single ping.

SNV11 GVA2

SNV11 CHI1

Seq	Burst Loss	Burst Sep (sec)
1652	1	109
1757	1	105
5090	1	3333
12212	2	7122
26361	154	14149
27936	1	1575
28295	1	359
28733	1	438
29616	1	883
32046	1	2430
33454	157	1408
44276	1	10822
44278	1	2
48468	1	4190
48492	1	24
54929	1	6437

Burst Loss	Seq	Burst Sep (sec)
1	722	722
1	5399	4677
1	5754	355
1	6094	340
1	6129	35
1	8989	2860
1	13182	4193
1	19548	6366
1	23359	3811
1	30525	7166
1	30837	312
1	33702	2865
1	34135	433
1	34338	203
1	37242	2904
1	37517	275
1	44324	6807
1	54906	10582
1	54923	17
1	55620	697
1	58809	3189

File transfers

Four hosts at Sunnyvale (198.51.111.58, 62, 66, 70, 74, 78, 82 note 2 NICs per hosts so 2 addresses/host),were set up with dual Pentium 4 2.4Ghz cpus, and dual Gbit/s NICs, plus 8* 120GB disks in each ATA RAID array and 2 such arrays per server. Similar setups were available at Geneva and Chicago. Host 198.51.111.74 was set up with the FAST TCP stack. Jumbo Frames were configured on 192.91.236.10 (at Chicago) and on 158.51.111.58 (at Sunnyvale) and #ifconfig ethx mtu 9000. See 3ware RAID arrays tests with Linux 2.4.19 on P4DPE with twin 2.4 GHz CPUs for information on the performance of the RAID arrays.

Sylvain Ravot reported:

"Without any tuning, I could get 350 Mbit/s with 8 streams using iperf between Chicago and Sunnyvale. I could get 700 Mbit/s with standard MTU and 8 streams by increasing txqueuelen (transmit queue of the NIC).
#ifconfig ethx txqueuelen 4000
With Jumbo Frame I could saturate the link using 8 streams."

Local file transfers

We set up IEPM/BW monitoring to make hourly measurements between SNV74 and SNV58. Typical bbftp disk to disk 2GB transfers with 1 stream and 32768KByte window requested between two disk server hosts at Sunnyvale (from 198.51.111.74 (running FAST) to 198.198.51.111.58) consistently attained about 70KBytes/s (~550Mbits/s) and take about 66% of the cpu (i.e. 66%~(21.8+0.11)/39.23 = (sys+user)/real_time from the Unix time command).

Below is a log from a transfer to substantiate the above:

#BBFTP(12/28/2002 00:58:23 1041065903) - ssh -f cottrell@198.51.111.58 rm -f /raid/bbcpdat/bbftpdat 2>&1
#BBFTP(12/28/2002 00:58:23 1041065903) - CMD: /usr/bin/time -p /usr/local/bin/bbftp -r 1 -V -t -p 1 -L "s h " -E "/home/cottrell/bin/bbftpd -s -m 40" -e " setrecvwinsize 32768; setsendwinsize 32768;put /raid/temp/dummy.2000000000 /raid/bbcpdat/bbftpdat" -u cottrell 198.51.111.58 2>&1
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : >> COMMAND : setremotecos 0
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : << OK : COS set
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : >> COMMAND : setrecvwinsize 32768
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : << OK
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : >> COMMAND : setsendwinsize 32768
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : << OK
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : >> COMMAND : put /raid/temp/dummy.20\ 00000000 /raid/bbcpdat/bbftpdat
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:59:11 (PST) : << OK
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:59:11 (PST) : 2048000000 bytes send in 27.9 secs (\ 7.17e+04 Kbytes/sec or 560 Mbits/s)
#BBFTP(12/28/2002 00:58:23 1041065903) - real 39.23
#BBFTP(12/28/2002 00:58:23 1041065903) - user 0.11
#BBFTP(12/28/2002 00:58:23 1041065903) - sys 21.84

Below are extracted results to show the consistency of the results and also the throughputs measured for iperf:

#date       time     pingloss     iperf   bbcpmem  bbcpdisk     bbftp    pingAverage
12/28/2002 00:49:58         0    864828  471773.6    332440 560000.00         0
12/28/2002 01:49:13         0    864536  502606.4  333952.8 566000.00         0
12/28/2002 02:52:07         0    863935    441564  331125.6 547000.00         0
12/28/2002 03:55:28         0    865657  520597.6  319654.4 530000.00         0
12/28/2002 04:55:16         0    864924  505457.6  344523.2 520000.00         0
12/28/2002 05:54:21         0    865480  511899.2  327285.6 534000.00         0
12/28/2002 06:52:00         0    864459  367658.4    332876 534000.00         0
12/28/2002 07:50:43         0    864164  516190.4  342773.6 550000.00         0
12/28/2002 08:53:22         0    865493  399200.8  336049.6 560000.00         0
12/28/2002 09:52:35         0    863321    474264  335219.2 547000.00         0
12/28/2002 10:54:09         0    862472  450069.6  340237.6 562000.00         0
12/28/2002 11:47:44         0    865408  433470.4  343923.2 554000.00         0
12/28/2002 12:51:20         0    865140  478308.8  357270.4 530000.00         0
12/28/2002 13:52:12         0    865377  457005.6  348348.8 557000.00         0

Some initial measurements of throughput vs. CPU utilization are shown below.

SNV-CHI file transfers

Using bbftp to transfer a 2GByte file from 198.51.111.66 (cit-slac14 eth0 1GE interface) to 192.91.236.10 (v10chi eth2 1GE interface) using the FAST TCP stack, we tried varying txqueuelen and MTU . For MTU 1500 we found little change in throughput with txqueuelens of 100, 1000 and 10000 (typical throughputs 240-280Mbits/s) We then set txqueuelen to 100, and varied the MTU with values of 1500, 3000, 5000 and 9000. We achieved bbftp file transfer throughputs of: MTU=1500 throughputs = 272, 280 Mbits/s MTU=3000 throughputs = 277, 302, 333, 228 Mbits/s MTU=5000 throughputs = 342, 399, 321, 337 Mbits/s MTU=9000 throughputs = 350, 342, 356, 337 Mbits/s

To ensure that the throughput was not limiuted by TCP or lower levels, we also measured TCP throughput with iperf with a 32MByte requested window, a txqueuelen of 100, and an MTU of 9000Bytes. We were able to achieve 570 Mbits/s after 5.2 secs and 990 Mbits/s after 10.2 secs and also able to confirm we were using jumbo frames (console).

We also remeasured the bbftp local performance from 198.51.111.66 to 198.51.111.58 (cit-slac13) with 32M window MTU=9000, txqueuelen 100 and got 364, 339 Mbits/s

We thus believe the bbftp performance of 270-360 Mbits/s was not limited by the underlying TCP network performance.

Footnotes

1. The initial slow-start was designed to be very slow so that it is stable and does not overshoot too much when flows start in a dynamic scenario. Large overshoot can cause massive losses (thousands of packets) at such large window and we try hard to prevent such losses at the expense of very slow slow start, which is alright for huge files but bad for small files. Our newer version (which Cheng and David are working on now) should have a better balance. Steven Low 1/11/02. The estimate of 80 seconds for a measurement duration was based on the classic slow start algorithm, and should be increased to between 100-150 seconds for the TCP FAST stack for the links from Sunnyvale to CERN and Chicago.

Comments to iepm-l@slac.stanford.edu