TCP Stack Measurements on Lightly Loaded TestbedsBulk throughput measurements| Bulk throughput simulation | Windows vs. streams | Effect of load on RTT and loss | Bulk file transfer measurements | QBSS measurements | Internet 2 Land Speed Record Les Cottrell. Created 16 Dec '02, last update 15 February '03 |
|
Steven Low and his group at Caltech have developed a new FAST TCP stack that improves performance on high speed long RTT links. Tom Kelley of CERN has developed a Scalable TCP stack. Sally Floyd has proposed a High Speed TCP (HS TCP) and it has been impmemented by the web100/net100 team. FAST TCP is based on Vegas and uses the RTT to indicate congestion. The latter two are based on Reno for congestion recognition and modify the additive increase and multiplicative decrease congestion strategies of Reno (Scalable TCP uses exponential increase, while HS TCP uses a table to indicate how much to increase the congestion window by when an ACK is received). All of these stacks only require them to be implemented in the sender. We report here on some measurements made with all three of these stacks as well as the stock TCP stack. We also report related measurements with jumbo frames, and with varying the transmit queue length of the network device (txqueuelen).
These measurements were made using the DataGrid/Caltech/SLAC testbed. This consists of fast 2.4GHz Linux hosts with GE Network Interface Cards (NICs) located at CERN Geneva (GVA), StarLight Chicago (CHI), and at the Level(3) gateway in Sunnyvale California (SNV). At a later stage we also had 5 hosts (2 at SNV, 2 at CHI and 1 at GVA) with Intel 10GE NICs (see 10GE End-to-end TCP tests). The link between Sunnyvale to Chicago is by a a Level(3) provided OC912/POS (10Gbits/s) link. The link between CHI and GVA is an OC48 2.5Gbits/s link. The testbed was setup for SC2002 and more details can be found in "Extreme Bandwidth": SC2002 Bandwidth Challenge Proposal.
The FAST 10GE Experiment is the planning page for the Jan 2003 experiment, and will be continually updated. In some cases we reserved hosts at CERN for exclusive use using the DataTag reservation form available to DataTag users following the rules. Where possible we tried to avoid reserving so some of our results were affected by cross-traffic.
[cottrell@cit-slac11 ~]$ bin/setup
proc/sys/net/ipv4/tcp_mem = 4096 67108864 67108864
/proc/sys/net/ipv4/tcp_rmem = 096 67108864 67108864
/proc/sys/net/ipv4/tcp_wmem = 4096 67108864 67108864
/proc/sys/net/core/wmem_max = 67108864
/proc/sys/net/core/rmem_max = 67108864
/proc/sys/net/ipv4/tcp_vegas_cong_avoid = 1
/proc/sys/net/ipv4/tcp_vegas_fast_converge = 1
/proc/sys/net/ipv4/tcp_vegas_alpha = 400
/proc/sys/net/ipv4/tcp_vegas_beta = 250
/proc/sys/net/ipv4/tcp_vegas_gamma = 150
MTU = 1500
tcpqueuelen=100
uname -a = Linux cit-slac12.caltech.edu 2.4.18-3combined #13 SMP Mon Nov 18 11:58:38 PST 2002 i686 unknown
The throughputs as a function of streams and windows are seen below. We use
log-log plots to make the small numbers of streams stand out more clearly. The
maximum number of streams was 120. The maximum throughput measured aggregated
over 60 seconds was 870Mbits/s (for one stream at 16384MByte window, the next
top two were also for a single stream and were 838Mbits/s at 65536KByte window
and 829Mbits/s for a 32768KByte window). The average MHz/Mbits/s was 1.58+-0.26.
We also measured the throughput from CHI1 (an identical host to CHI2, except is was running the standard TCP stack) to SNV11. Each measurement was for 50 seconds. The results are shown below. The maximum throughput was for 486Mbits/s for 25 streams with a 4096KB window, 2nd was 428Mbits/s for 120 streams and a 1024KB window, and third place was 417Mbits/sec with 12 streams and an 8192KByte window. The average MHz/Mbits/s was 1.58+-0.36 which agrees well with the cpu utilization for the standard TCP stack. It is also seen that the multi-stream performances are similar from the FAST to normal TCP stack for large numbers of streams (>20), but the behavior of small numbers of streams, in particular 1, is much improved (almost double for a single stream) by FAST.
We made measurements from SNV11 to GVA2. Here the round trip was about 182ms. The iperf TCP throughput using the
FAST stack is shown below for various windows and streams. The maximum
throughput measured was 855Mbits/s for one stream with a 65536KByte window. It
is seen that the throughput begins to saturate at above 400Mbits/s or about 50%
of the maximum achievable.
We made 80 second iperf TCP measurements between various hosts with the following
FAST parameters:
cat /proc/sys/net/ipv4/tcp_mem = 4096 67108864 67108864
cat /proc/sys/net/ipv4/tcp_rmem = 4096 67108864 67108864
cat /proc/sys/net/ipv4/tcp_wmem = 4096 67108864 67108864
cat /proc/sys/net/core/wmem_max = 67108864
cat /proc/sys/net/core/rmem_max = 67108864
cat /proc/sys/net/ipv4/tcp_vegas_cong_avoid = 1
cat /proc/sys/net/ipv4/tcp_vegas_fast_converge = 1
cat /proc/sys/net/ipv4/tcp_vegas_alpha = 400
cat /proc/sys/net/ipv4/tcp_vegas_beta = 250
cat /proc/sys/net/ipv4/tcp_vegas_gamma = 150
uname -a = Linux cit-slac11.caltech.edu 2.4.18-3combined #13 SMP Mon No v 18
11:58:38 PST 2002 i686 unknown
We also made bbftp measurements for a 2GByte file transfer
For all of these measurements the window size was set to 32768KBytes unless otherwise stated, and a single stream and the MTU was 1500Bytes. The numbers in the columns are Mbits/s, an annotation of (j) indicates a measurement using jumbo frames. The numbers in parentheses are the maximum window size configured. The Servers marked with Disk have 2 TBytes of RAID disk space. The bbcp application at this time was only able to accept a maximum window request of 2 MBytes, the long-distance measurements between Sunnyvale and Chicago or CERN performed poorly since the window size was inadequate.
Server | CPU | CPU | Disk | Disk | CPU | CPU | CPU | Disk | CPU | Disk | Disk | Disk | CPU | ||
TCP Stack | FAST | Std | Std | FAST | Std | Std | Std | Std jumbo |
Std | Std | Std | Std | Std | ||
Receiver > | SNV11 (67MB) |
SNV2 (67MB) |
SNV13 (32MB) | SNV17 (67MB) | CHI1 (32MB) | CHI2 (65MB) | CHI3 (32MB) | CHI10 (32MB) |
GVA1 (32MB) | GVA2 (32MB) | GVA3 (32MB) | GVA4 (32MB) | NIK16 (24MB) | ||
Server | TCP stack | Sender V |
iperf | iperf | iperf | iperf | iperf | iperf | iperf | iperf | iperf | iperf | iperf | iperf | iperf |
CPU | FAST | SNV11 (67MB) |
900 | 809 | 890 | 411 | 855 | 872 | |||||||
Disk | FAST | SNV17 (67MB) |
860 | 760 | 725 | 450+-25, 209+-60(j) | 790, 939(j) |
528 | 840 | ||||||
CPU | Scalable | SNV10 (64MB) | 922 | ||||||||||||
CPU | Std | SNV2 (67MB) |
928 | 914 | 800(j) | ||||||||||
Disk | Std jumbo |
SNV13 (32MB) |
338 | 34(j) | 200 | ||||||||||
CPU | Std | GVA2 (32MB) | 73 | 74 | 26 | ||||||||||
CPU | Std | CHI3 (32MB) | 200, 900(j) | 950(j) | |||||||||||
CPU | Std | NIK16 (24MB) | 200 | ||||||||||||
Server | TCP stack | Sender V |
bbftp | bbftp | bbftp | bbftp | bbftp | bbftp | bbftp | bbftp | bbftp | bbftp | bbftp | bbftp | bbftp |
CPU | FAST | SNV11 (67MB) |
156 | 219 | |||||||||||
Disk | FAST | SNV17 (67MB) | 608 | 157 | 376 | 204 | |||||||||
Disk | Std jumbo |
SNV13 (32MB) |
383 | 107 | |||||||||||
Server | TCP stack | Sender V |
bbcp mem |
bbcp mem |
bbcp mem |
bbcp mem | bbcp mem |
bbcp mem |
bbcp mem | bbcp mem |
bbcp mem | bbcp mem |
bbcp mem |
bbcp mem | bbcp mem |
CPU | FAST | SNV11 (67MB) |
120 | ||||||||||||
Disk | FAST | SNV17 (67MB) |
826 | 128 | |||||||||||
Disk | Std jumbo |
SNV13 (32MB) |
362 | ||||||||||||
Receiver > | SNV11 (67MB) |
SNV2 (67MB) |
SNV13 (32MB) | SNV17 (64MB) | CHI1 (32MB) | CHI2 (65MB) | CHI3 (32MB) | CHI10 (32MB) |
GVA1 (32MB) | GVA2 (32MB) | GVA3 (32MB) | GVA4 (32MB) | NIK16 (24MB) |
Further study of initial poor iperf performance (449Mbits/s) between SNV11 and GVA2 measured on 1/12/2002 indicates it to be due to a very slow start and there was heavy (>> 1%) loss measured by pings. The losses were only during the iperf measurement. It may also be relevant that the Ethernet interface on GVA2 reported 5 receive interface errors. These errors appear to occur at the rate of about 1/minute regardless of whether iperf is running. Similar results are seen from SNV17 to GVA3. SNV17 to GVA2 on the other hand reaches stability of about 840Mbits/s by 15 seconds, hence its throughput is much higher. The poor performance between SNV11 and GVA2 appeared to be transient and on 1/14/2002 we achieved 855Mbits/s iperf throughput. Further studies of the behavior of the congestion window (cwnd), instantaneous throughput and RTT can be found at http://www.cs.caltech.edu/~chengjin/les/.
Col, Row | Date, time PST | Stack | txqueuelen Packets | MTU Bytes | Avg throughput in 1000s (Mbits/s) | Avg throughput in first 80s (Mbits/s) | Notes |
Jumbo | Feb 20 '03 17:00 | HS | 100 | 9000 | 934+-49 | 901+-163 | Reached 900Mbits/s after 5s |
2,4 | Feb 20 '03 16:00 | HS | 100 | 1500 | 913+-69 | 881+-175 | Reached 900Mbits/s after 10s |
1,3 | Feb 12 '03 21:00 | Scalable | 5000 | 1500 | 838+-101 | 795+-178 | |
3,3 | Feb 9 '03, 11:27 | Stock | 10000 | 1500 | 551+-48 | 480+-10 | |
3,1a | Feb 9 '03, 11:08 | Stock | 100 | 1500 | 128+-46 | 56+-6 | |
3,4b | Feb 9 '03, 13:06 | Stock | 1000 | 9000 | 625+-259 | 194+-23 | |
3,2 | Feb 9 '03, 11:53 | Stock | 1000 | 1500 | 94+-4 | 167+-46 | |
3,4 | Feb 9 '03, 12:38 | Stock | 100 | 9000 | 629+-259 | 210+-23 | |
2,2a | Feb 9 '03, 11:02 | FAST | 100 | 1500 | 764+-247 | 333+-146 | Reached 900Mbits/s after 220s |
1,4a | Feb 9 '03, 9:38 | Scalable | 10000 | 1500 | 881+-109 | 914+-58 | Reached 900Mbits/s after 20s |
2,1 | Feb 9 '03, 7:42 | FAST | 100 | 1500 | 128+-46 | 56+-7 | Reached 900Mbits/s after 505s |
2,2b | Feb 8 '03, 21:54 | FAST | 100 | 1500 | 763+-247 | 524+-208 | 200s measurement, reached 900Mbits/s after 85s (magenta) |
1,1a | Feb 8 '03, 22:37 | Scalable | 100 | 1500 | 551+-234 | 387+-274 | |
1,2 | Feb 8 '03, 20:01 | Scalable | 2000 | 1500 | 669+-187 | 538+-217 | |
1,1c | Feb 12 '03 | Scalable | 100 | 1500 | 571+-237 | 432+-274 | Reached 700Mbits/s after 50 s (yellow) |
1,4b | Feb 12 '03 | Scalable | 10000 | 1500 | 919+-48 | 892+-452 | Reached 900 Mbits/s after 10s (magenta) |
3,1b | Feb 13 '03 9:10 | FAST | 100 | 1500 | 919+-50 | 892+-133 | Reached 900 Mbits/s after 100s (green) |
1,1b | Feb 13 '03 9:15 | Scalable | 100 | 1500 | 464+-253 | 248+-243 | Reached 700 Mbits/s after 90s (magenta) |
Also see Floyd's TCP slow-start and AIMD mods by Tom Dunigan, for comparisons of how HS TCP works compared to stock TCP.
Comparing with the SNV11 to CHI2 (FAST stack standard MTU), it is seen that in the linear domain (i.e. throughput increases linearly on the plot for SNV11 to CHI2) the throughputs are almost identical apart from the window size 64KB (which is currently not understood). For large numbers of streams jumbo frames out-performs the FAST stack.
In theory, FAST works better than standard TCP(Reno) when the bandwidth (in packet per second) is high. Reno cannot perform well as the packet rate increases. FAST is scalable for both low and high packet rates.
Jumbo frame Reno is better than 1500 MTU Reno since the packet per second rate is reduced to 1/9 by Jumbo frame. The Reno's problem in high packet rate is alleviated.
For multiple connection case, the packet rate is much smaller for each connection. So the FAST has no advantage (that is something like on 10Mbps link, FAST cannot show its advantage than Reno.), but the jumbo frame has its own advantage of less interrupts and higher payload... So, jumbo frame Reno is better than FAST. (I expect FAST with jumbo frame can have similar performance as Reno with jumbo frame.) Xiaoliang Wei, Caltech FAST team.
Comparing against CHI1 to SNV11 (standard stack and MTU) again the behavior in the linear region is almost identical. Saturation sets in around 400Mbits/s for the standard stack with standard MTU whereas for jumbo frames saturation sets in close to 1Gbits/s.
We also made measurements between Amsterdam/NIKHEF (145.146.97.16) and Chicago (CHI3). Both hosts were enabled for jumbo frames and were running stock TCP. The NIKHEF host (2*2.4GHz Linux PC) had txqueuelen set to 1000 (packets), and CHI3 (2*2.2GHz Linux PC) was set to 2500. The max TCP window was set to 32MBytes, and iperf used 1 stream. No errors were observed in the NICs. The routes were asymmetric, and jumbo frames were only enabled in one direction (CHI3 to NIK16). The RTT was about 129ms.:
This illustrates that jumbo frames for a single stream with stock TCP between 2 identical hosts with an RTT of 128ms can improve performance by a factor of 5 compared to a 1500Byte MTU.
From Sunnyvale to CERN we set up SNV1 (198.51.111.10) and GVA3 (192.91.239.3) with jumbo frames, txqueuelen=1000, running stock TCP. We sent TCP data from SNV1 to GVA3 using iperf with 32MByte window. The results indicate that in the first 5 seconds we achieved about 400Mbits/s, at 80 seconds it reached about 800 Mbits/s and after 240 seconds it reached about 990Mbits/s. The aggregate throughput reached after 1000 seconds was 972Mbits/s It is apparent that for this RTT (181ms) for jumbo frames one needs to run for considerable time (240 seconds) to reach the optimum performance. The figure below illustrates the additive increase (increases by time*MTU/RTT2 for each ACK received, the factor 0.5 accounts for the delayed ACKs) of the stock TCP with jumbo frames.
For stock TCP the maximum single stream throughput for MTU = 9000 Bytes exceeded that for MTU =1 500 Bytes by almost a factor of 5 (200 Mbits/s for MTU=1500Bytes vs 967Mbits/.s with MTU=9000Bytes).
We made measurements with iperf/TCP from SNV to GVA1 with both server and client configured for MTU = 9000 Bytes and txqueuelen = 100. The results are shown below. Comparing these figure with those in the section on Comparing TCP Stacks it can be seen that the jumbo frames help significantly in improving throughput for all stacks evaluated.
For this unloaded path, with both FAST and HS TCP and with txqueuelen = 100, we were able to achieve > 900Mbits/s within 10 seconds. We also tried other values of txqueulen for the scalable TCP to see how it affected the overall throughput and stability but for shorter durations. The average values of throughputs observed so far 5, 20, 40,80 and 400 seconds are seen in the table below. It can be seen that larger txqueuelen results in larger throughputs for scalable TCP with MTU = 9000 Bytes.
txquerlen | Time to reach 800Mbits/s | Time to reach 900 Mbits/ | Average throughput after 5 seconds | Average throughput after 400s | Average throughput after 80s |
2000 | 5s | 5s | 657 Mbits/s | 982+-38 | 966+-81 |
1000 | 15s | 15s | 500 Mbits/s | 901+-163 | 840+-247 |
500 | 20s | 20s | 380 Mbits/s | 844+-145 | 814+-156 |
200 | 25s | 40s | 291 Mbits/s | 798+-125 | 741+-186 |
100 | 20s | 105s | 147 Mbits/s | 774+-128 | 715+195 |
The behavior of the throughputs with txqueuelen is plotted below. It can be seen that there is little growth in the average throughput after 80 seconds. Also the points fit well (R2 > 0.9) to logarithmic series. The curves shown are fits to logarithmic series of the form f(t) = a*ln(t) +b with the parameters shown in the table below. The throughput at 5s is mainly dominated by slow start.
|
Col, Row | Date, time PST | Stack | txqueuelen | MTU Bytes | Avg throughput in 1000s (Mbits/s) | Avg throughput in 80s (Mbits/s) | Comments |
1,1 | Feb 8 '03, 09:30 | FAST | 100 | 8192 | 461+-241 | 447+-194 | |
2,1 | Feb 18 '03, 10:54 | Scale | 100 | 8192 | 387+-68 | 397+-82 | |
2,2 | Feb 18 '03, 16:04 | Scale | 500 | 8192 | 507+-140 | 517+-155 | |
2,3 | Feb 16 '03, 16:27 | Scale | 1000 | 8192 | 530+-161 | 568+-164 | |
2,4 | Feb 18 '03, 10:11 | Scale | 2000 | 8192 | 622+-146 | 684+-179 | |
2,5 | Feb 18 '03, 16:56 | Scale | 2000 | 8192 | 682+-133 | 644+-145 | |
1,2 | Feb 21 '03, 11:25 | HS | 100 | 8192 | 303+-123 | 180+-137 | |
1,3 | Feb 21 '03, 17:12 | HS | 500 | 8192 | 290+-142 | 289+-138 | |
1,4 | Feb 21 '03, 10:50 | HS | 2000 | 8192 | 334+-207 | 323+-146 | |
1,5 | Feb 21 '03, 15:50 | HS | 10000 | 8192 | 371+-239 | 292+-175 | |
3,1a | Feb 18 '03, 13:21 | Stock | 100 | 8192 | 438+-53 | 237+-26 | |
3,2 | Feb 21 '03 | Stock | 100 | 8192 | 318+-51 | 248+-28 | |
3,1b | Feb 18 '03, 10:15 | Stock | 1000 | 8192 | 502+-101 | 740+-116 |
We manually observed the cpu utilization of the iperf server using the Unix
top command and noted down its values together
with the throughputs recorded using the iperf -i
incremental recording option. The MHz/Mbps was 1.6 +- 0.2. A plot of the server
cpu utilization for SNV11 (running FAST) to GVA4 (stock TCP) with 1500Byte MTU
is shown below. We repeated the measurement with the iperf client sending data
from SNV10 running the Scalable TCP stack and from SNV11 running the FAST TCP
stack to an iperf server at GVA4 with
similar server utilization results. There was however, a big difference when
using jumbo (MTU=9000Bytes). The server cpu utilization for MTU=9000 Bytes is
about a factor 3 less than for MTU=1500Bytes., or more quantitatively
0.59+-0.1 compared to 1.6+-0.2.
To compare iperf client cpu utilization between standard MTUs (1500Bytes) and jumbo frames we used iperf/TCP to send data for 80 seconds from a FAST TCP host (SNV17) at Sunnyvale to a standard TCP stack host at Chicago (CHI2) and Geneva (GVa2) We used txqueuelen=100, a single stream and varying window sizes to achieve different throughputs. The GVA MTU was set (using ifconfig eth0 mtu 9000) to 9000Bytes. The MTU at Sunnyvale was alternated between 1500 and 9000 Bytes. The results are shown below. It is seen that for the FAST TCP stack the iperf client cpu utilization is about a factor 2 less for jumbo frames. For the stock TCP stack the difference in CPU utilization/Mbits/s between MTU=1500Bytes and 9000Bytes was fairly small (see 2nd figure below) and was close to that for the FAST stack for a 9000Byte MTU.
The
single stream slow start for a TCP Reno/Tahoe stack assuming no losses should
take about 2*ceiling(log2(ideal_window_size))*RTT which yields
about 2 seconds for an RTT of 67msec and window of 65000KBytes. The FAST TCP
stack appears to take longer or closer to 8 seconds as can be seen below from
the iperf -i (interval) option output: After slow start the throughput appears
to remain fairly steady at between 840Mbits/s and 1Gbits/s. The aggregate
throughput for the 60 seconds was 883Mbits/s. The aggregate throughput measured
from 8 seconds after the start until the end (i.e. when the throughput is
stable) was 938Mbits/s.
To demonstrate the way that FAST TCP shares throughput among multiple streams we plot the throughput/stream where the aggregate throughput is saturated. The figures below are for the measurements from SNV11 to GVA2. The stacked graph shows the iperf flow throughputs for the 512KByte window for streams of 32, 40, 64, 90 and 120 where as seen from the figure above the throughput is fairly saturated. Quick inspection shows that the flows share the throughput about equally. The error bar plot shows the average throughput/stream for all the measurements. The magenta crosses are for the measurements where the aggregate throughputs are > 400 Mbits/s. The error bars indicate the standard deviations (stdev). The 3rd graph shows the relative standard deviations (i.e. stdev/avg) for the per stream throughputs. The magenta squares are for aggregate throughputs of over 400 Mbits/s. The points with 0 stdev/avg are single stream measurements where the stdev is zero. It can be seen that FAST TCP does a good job of fair sharing the throughputs among competing FAST TCP streams between the same source and destination.
We ran bbcp in memory (/dev/zero) to memory (/dev/null) mode from SNV11 to
CHI2. We set the window size to 32768KB to match that nominally needed by an RTT
of 182ms and a bandwidth of 1Gbits/s.
/home/cottrell/package/bbcp/bin/i386_linux24/bbcp -f -v -b 4 -t 80 -P 1 -s 1 -D
With such a window size, we were only able to achieve a throughput of about 15.2MBytes/s or 127Mbits/s. The client cpu load while running this (measured by the Unix top command) varied from 6-9% so it does not appear to be a cpu starvation problem. We observed that though we specified a window of 32768Kb bbcp set the window back to 2MBytes
bbcp_CTL: Sending to 192.91.239.2: -b 4 -D -f -m 644 -P 1 -s 1 -t 80 -v -W 2096128 -Y 2e565f3e -H none:0
which then Linux increased to 4MB:
bbcp_SNK 6144: Window size set to 2096128 (actual snd=4192256
rcv=4192256)The problem may be caused by "disk buffers are tied to window size buffers, you can quickly spiral out of control and kill the whole system." Andy Hanushevsky. I tried setting a window of 2048KB and a single stream and achieved 36,205KBytes/s (~304Mbits/s) over 80 seconds.
bbcp -f -v -b 4 -t 80 -P 1 -s 16 -D -w 2048k -T "ssh -l cottrell 192.91.236.2 /home/cottrell/package/bbcp/bin/i386_linux24/bbcp" /dev/zero cottrell@192.91.236.2:/dev/null
bbcp_SNK 9038: Window size set to 2096128 (actual snd=4192256 rcv=4192256)
bbcp: Source I/O buffers (18423K) > 25% of available free memory (16216K); copy may be slow
I tried setting the window to 2048k and using 16 streams, however bbcp failed to complete.
After further discussions with Andy Hanushevsky, he identified a problem with using -w to try and set window sizes > 2MBytes and suggested using the -W option. This allowed larger windows to be set. With 32768Kbyte windows, a single stream, a txqueuelen of 100, stock TCP, we achieved the following bbcp throughputs between SNV13 aka cit-slac13 (198.51.111.58) to SNV17 aka cit-slac17 (198.51.111.78) (click on the throughputs to view the console output):
Mode | From | To | BBCP Throughput MBytes/s (Mbits/s) |
---|---|---|---|
Memory-to-memory | SNV13:/dev/zero | SNV17:/dev/null | 104.2 (833.6) |
Disk-to-memory | SNV13:/raid/dummy.2000000000 | SNV17:/dev/null | 72.3 (578.4) |
Disk-to-disk | SNV13:/raid/dummy.2000000000 | SNV17:/raid/bbcpdat | 62.6 (500.8) |
To investigate the loss performance of the link we ran 64 Byte pings at 1 second intervals from SNV11 to GVA2. To first order pings should indicate the non congestion loss on the link. We ran ~ 60K pings starting at 18:12:49 on December 19, 2002. The overall loss rate was 0.55% (326 packets in 59465 sent). They were lost in a burst of 154 sequential packets (i.e. an outage of 154 seconds) ending at sequence number 33454, and 157 packets ending at sequence number 26361, plus 2 packets ending at sequence 12212 plus a single packet being lost 15 times. For the single packet losses 2 came so that lost packets were within 10 sequence numbers of one another or another burst (i.e. within roughly the time it takes FAST to climb back up to full throughput after a loss). This suggests that the losses are very bursty, there appear to be 14 bursts of losses separated by 10 seconds or more in a time period of 60,000 seconds, or a burst loss rate of 0.02% or 2.2 in 10,000 or a Bit Error Rate (BER) assuming 1500Byte MTUs of 2 in 108. If we assume the single packet losses are caused by congestion, then the burst loss rate is 2 in 60,000 or a BER of 3 in 109. Possibly the shorter bursts are caused by congestion at the routers, for example caused by iperf tests. The sequence number at the end of each burst, the length of each burst and the separation between bursts can be seen in the table below.
The losses from SNV11 to CHI1 for 60000 64 Byte pings at 1 second intervals starting on Friday Dec. 20 at 12:54:34 2002 PST indicate a loss rate of 0.035% (21 pings lost in 60000) and that the losses are non-bursty. All the losses were of a single ping.
SNV11 GVA2 | SNV11 CHI1 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Sylvain Ravot reported:
"Without any tuning, I could get 350 Mbit/s with 8 streams using iperf
between Chicago and Sunnyvale. I could get 700 Mbit/s with standard MTU and 8
streams by increasing txqueuelen (transmit queue of the NIC).
#ifconfig ethx txqueuelen 4000
With Jumbo Frame I could saturate the link using 8 streams."
We set up IEPM/BW monitoring to make hourly measurements between SNV74 and SNV58. Typical bbftp disk to disk 2GB transfers with 1 stream and 32768KByte window requested between two disk server hosts at Sunnyvale (from 198.51.111.74 (running FAST) to 198.198.51.111.58) consistently attained about 70KBytes/s (~550Mbits/s) and take about 66% of the cpu (i.e. 66%~(21.8+0.11)/39.23 = (sys+user)/real_time from the Unix time command).
Below is a log from a transfer to substantiate the above:
#BBFTP(12/28/2002 00:58:23 1041065903) - ssh -f cottrell@198.51.111.58 rm -f
/raid/bbcpdat/bbftpdat 2>&1
#BBFTP(12/28/2002 00:58:23 1041065903) - CMD: /usr/bin/time -p /usr/local/bin/bbftp
-r 1 -V -t -p 1 -L "s h " -E "/home/cottrell/bin/bbftpd -s -m 40" -e "
setrecvwinsize 32768; setsendwinsize 32768;put /raid/temp/dummy.2000000000
/raid/bbcpdat/bbftpdat" -u cottrell 198.51.111.58 2>&1
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : >> COMMAND
: setremotecos 0
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : << OK : COS
set
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : >> COMMAND
: setrecvwinsize 32768
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : << OK
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : >> COMMAND
: setsendwinsize 32768
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : << OK
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : >> COMMAND
: put /raid/temp/dummy.20\ 00000000 /raid/bbcpdat/bbftpdat
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:59:11 (PST) : << OK
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:59:11 (PST) : 2048000000
bytes send in 27.9 secs (\ 7.17e+04 Kbytes/sec or 560 Mbits/s)
#BBFTP(12/28/2002 00:58:23 1041065903) - real 39.23
#BBFTP(12/28/2002 00:58:23 1041065903) - user 0.11
#BBFTP(12/28/2002 00:58:23 1041065903) - sys 21.84
Below are extracted results to show the consistency of the results and also the throughputs measured for iperf:
#date time pingloss iperf bbcpmem bbcpdisk bbftp pingAverage 12/28/2002 00:49:58 0 864828 471773.6 332440 560000.00 0 12/28/2002 01:49:13 0 864536 502606.4 333952.8 566000.00 0 12/28/2002 02:52:07 0 863935 441564 331125.6 547000.00 0 12/28/2002 03:55:28 0 865657 520597.6 319654.4 530000.00 0 12/28/2002 04:55:16 0 864924 505457.6 344523.2 520000.00 0 12/28/2002 05:54:21 0 865480 511899.2 327285.6 534000.00 0 12/28/2002 06:52:00 0 864459 367658.4 332876 534000.00 0 12/28/2002 07:50:43 0 864164 516190.4 342773.6 550000.00 0 12/28/2002 08:53:22 0 865493 399200.8 336049.6 560000.00 0 12/28/2002 09:52:35 0 863321 474264 335219.2 547000.00 0 12/28/2002 10:54:09 0 862472 450069.6 340237.6 562000.00 0 12/28/2002 11:47:44 0 865408 433470.4 343923.2 554000.00 0 12/28/2002 12:51:20 0 865140 478308.8 357270.4 530000.00 0 12/28/2002 13:52:12 0 865377 457005.6 348348.8 557000.00 0Some initial measurements of throughput vs. CPU utilization are shown below.
To ensure that the throughput was not limiuted by TCP or lower levels, we also measured TCP throughput with iperf with a 32MByte requested window, a txqueuelen of 100, and an MTU of 9000Bytes. We were able to achieve 570 Mbits/s after 5.2 secs and 990 Mbits/s after 10.2 secs and also able to confirm we were using jumbo frames (console).
We also remeasured the bbftp local performance from 198.51.111.66 to 198.51.111.58 (cit-slac13) with 32M window MTU=9000, txqueuelen 100 and got 364, 339 Mbits/s
We thus believe the bbftp performance of 270-360 Mbits/s was not limited by the underlying TCP network performance.
1. The initial slow-start was designed to be very slow so that it is stable and does not overshoot too much when flows start in a dynamic scenario. Large overshoot can cause massive losses (thousands of packets) at such large window and we try hard to prevent such losses at the expense of very slow slow start, which is alright for huge files but bad for small files. Our newer version (which Cheng and David are working on now) should have a better balance. Steven Low 1/11/02. The estimate of 80 seconds for a measurement duration was based on the classic slow start algorithm, and should be increased to between 100-150 seconds for the TCP FAST stack for the links from Sunnyvale to CERN and Chicago.
Comments to iepm-l@slac.stanford.edu