IEPM

TCP Stack Measurements on Lightly Loaded Testbeds

SLAC Home Page
Bulk throughput measurements| Bulk throughput simulation | Windows vs. streams | Effect of load on RTT and loss | Bulk file transfer measurements | QBSS measurements | Internet 2 Land Speed Record
Les Cottrell. Created 16 Dec '02, last update 15 February '03

 


Introduction | Comparisons with Multiple Streams | Multiple Hosts | TCP Stack Comparisons with Single | Jumbo Frames | Measrements from Sunnyvale to Amsterdam | CPU Utilization | Startup | Fair Share | bbcp | Link Losses | File transfers | Comparison plot

Introduction

Steven Low and his group at Caltech have developed a new FAST TCP stack that improves performance on high speed long RTT links. Tom Kelley of CERN has developed a Scalable TCP stack. Sally Floyd has proposed a High Speed TCP (HS TCP) and it has been impmemented by the web100/net100 team. FAST TCP is based on Vegas and uses the RTT to indicate congestion. The latter two are based on Reno for congestion recognition and modify the additive increase and multiplicative decrease congestion strategies of Reno (Scalable TCP uses exponential increase, while HS TCP uses a table to indicate how much to increase the congestion window by when an ACK is received). All of these stacks only require them to be implemented in the sender. We report here on some measurements made with all three of these stacks as well as the stock TCP stack. We also report related measurements with jumbo frames, and with varying the transmit queue length of the network device (txqueuelen).

These measurements were made using the DataGrid/Caltech/SLAC testbed. This consists of fast 2.4GHz Linux hosts with GE Network Interface Cards (NICs) located at CERN Geneva (GVA), StarLight Chicago (CHI), and at the Level(3) gateway in Sunnyvale California (SNV). At a later stage we also had 5 hosts (2 at SNV, 2 at CHI and 1 at GVA) with Intel 10GE NICs (see 10GE End-to-end TCP tests). The link between Sunnyvale to Chicago is by a a Level(3) provided OC912/POS (10Gbits/s) link. The link between CHI and GVA is an OC48 2.5Gbits/s link. The testbed was setup for SC2002 and more details can be found in "Extreme Bandwidth": SC2002 Bandwidth Challenge Proposal.

The FAST 10GE Experiment is the planning page for the Jan 2003 experiment, and will be continually updated.  In some cases we reserved hosts at CERN for exclusive use using the DataTag reservation form available to DataTag users following the rules. Where possible we tried to avoid reserving so some of our results were affected by cross-traffic.

Comparisons with multiple streams

Methodology

The methodology is broadly outlined in  Bulk throughput measurements. We set up the disk server routing to balance traffic between Sunnyvale and Chicago.  We used iperf to send TCP data for 80 seconds with various windows and number of parallel streams. We chose 80 seconds1 since that should allow slow start to complete in < 10% of the total 80 second measurement time. We estimated the slow start time for the normal (Reno/Tahoe) TCP stack for a single stream  to be ~ 2*ceiling(log2(optimum_window_size))*RTT, where RTT is the Round Trip Time (usually measured by ping). For an RTT of ~67ms (the RTT from Sunnyvale to Chicago) this yields ~ 2 seconds. We limited the product of windows*streams to 40,000 KBytes to limit the impact on resources such as memory. 

FAST TCP and Multiple Streams

We made measurements from SNV11 (198.51.111.50) running the FAST TCP to CHI2. By default we configured the FAST TCP stack and host as follows:

[cottrell@cit-slac11 ~]$ bin/setup
 proc/sys/net/ipv4/tcp_mem = 4096 67108864 67108864
/proc/sys/net/ipv4/tcp_rmem = 096 67108864 67108864
/proc/sys/net/ipv4/tcp_wmem = 4096 67108864 67108864
/proc/sys/net/core/wmem_max = 67108864
/proc/sys/net/core/rmem_max = 67108864
/proc/sys/net/ipv4/tcp_vegas_cong_avoid = 1
/proc/sys/net/ipv4/tcp_vegas_fast_converge = 1
/proc/sys/net/ipv4/tcp_vegas_alpha = 400
/proc/sys/net/ipv4/tcp_vegas_beta = 250
/proc/sys/net/ipv4/tcp_vegas_gamma = 150
MTU = 1500
tcpqueuelen=100
uname -a = Linux cit-slac12.caltech.edu 2.4.18-3combined #13 SMP Mon Nov 18 11:58:38 PST 2002 i686 unknown

The throughputs as a function of streams and windows are seen below. We use log-log plots to make the small numbers of streams stand out more clearly. The maximum number of streams was 120. The maximum throughput measured aggregated over 60 seconds was 870Mbits/s (for one stream at 16384MByte window, the next top two were also for a single stream and were 838Mbits/s at 65536KByte window and 829Mbits/s for a 32768KByte window). The average MHz/Mbits/s was 1.58+-0.26.

Stock TCP and Multiple Streams

We also measured the throughput from CHI1 (an identical host to CHI2, except is was running the standard TCP stack) to SNV11. Each measurement was for 50 seconds.  The results are shown below. The maximum throughput was for 486Mbits/s for 25 streams with a 4096KB window, 2nd was 428Mbits/s for 120 streams and a 1024KB window, and third place was 417Mbits/sec with 12 streams and an 8192KByte window.  The average MHz/Mbits/s was  1.58+-0.36 which agrees well with the cpu utilization for the standard TCP stack. It is also seen that the multi-stream performances are similar from the FAST to normal TCP stack for large numbers of streams (>20), but the behavior of small numbers of streams, in particular 1, is much improved (almost double for a single stream) by FAST.

FAST TCP and Multiple Streams with a longer RTT

We made measurements from SNV11 to GVA2. Here the round trip was about 182ms.  The iperf TCP throughput using the FAST stack is shown below for various windows and streams. The maximum throughput measured was 855Mbits/s for one stream with a 65536KByte window. It is seen that the throughput begins to saturate at above 400Mbits/s or about 50% of the maximum achievable. 

Multiple hosts

We made 80 second iperf TCP measurements between various hosts with the following FAST parameters: 
cat /proc/sys/net/ipv4/tcp_mem = 4096 67108864 67108864
cat /proc/sys/net/ipv4/tcp_rmem = 4096 67108864 67108864
cat /proc/sys/net/ipv4/tcp_wmem = 4096 67108864 67108864
cat /proc/sys/net/core/wmem_max = 67108864
cat /proc/sys/net/core/rmem_max = 67108864
cat /proc/sys/net/ipv4/tcp_vegas_cong_avoid = 1
cat /proc/sys/net/ipv4/tcp_vegas_fast_converge = 1
cat /proc/sys/net/ipv4/tcp_vegas_alpha = 400
cat /proc/sys/net/ipv4/tcp_vegas_beta = 250
cat /proc/sys/net/ipv4/tcp_vegas_gamma = 150
uname -a = Linux cit-slac11.caltech.edu 2.4.18-3combined #13 SMP Mon No v 18 11:58:38 PST 2002 i686 unknown

We also made bbftp measurements for a 2GByte file transfer

For all of these measurements the window size was set to 32768KBytes unless otherwise stated, and a single stream and  the MTU was 1500Bytes. The numbers in the columns are Mbits/s, an annotation of (j) indicates a measurement using jumbo frames. The numbers in parentheses are the maximum window size configured. The Servers marked with Disk have 2 TBytes of RAID disk space. The bbcp application at this time was only able to accept a maximum window request of 2 MBytes, the long-distance measurements between Sunnyvale and Chicago or CERN performed poorly since the window size was inadequate.

Server CPU CPU Disk Disk CPU CPU CPU Disk CPU Disk Disk Disk CPU
TCP Stack FAST Std Std FAST Std Std Std Std
jumbo
Std Std Std Std Std
Receiver > SNV11
(67MB)
SNV2
(67MB)
SNV13 (32MB) SNV17 (67MB) CHI1 (32MB) CHI2 (65MB) CHI3 (32MB) CHI10
(32MB)
GVA1 (32MB)  GVA2 (32MB) GVA3 (32MB) GVA4 (32MB) NIK16 (24MB)
Server TCP stack Sender
   V
iperf iperf iperf iperf iperf iperf iperf iperf iperf iperf iperf iperf iperf
CPU FAST SNV11
(67MB)
  900 809   890       411 855 872    
Disk  FAST SNV17
(67MB)
    860     760   725 450+-25, 209+-60(j) 790,
939(j)
528 840  
CPU Scalable SNV10 (64MB)                       922  
CPU Std SNV2 
(67MB)
928   914               800(j)    
Disk Std
jumbo
SNV13
(32MB)
              338 34(j) 200      
CPU Std GVA2 (32MB)     73 74   26              
CPU Std CHI3 (32MB)                 200, 900(j)       950(j)
CPU Std NIK16 (24MB)             200            
Server TCP stack Sender
   V
bbftp bbftp bbftp bbftp bbftp bbftp bbftp bbftp bbftp bbftp bbftp bbftp bbftp
CPU FAST SNV11
(67MB)
        156         219      
Disk FAST SNV17 (67MB)     608     157   376   204      
Disk Std
jumbo
SNV13 
(32MB)
              383   107      
Server TCP stack Sender
   V
bbcp
mem
bbcp
mem
bbcp
mem
bbcp mem bbcp
mem
bbcp
mem
bbcp mem bbcp
mem
bbcp mem bbcp
mem
bbcp
mem
bbcp mem bbcp mem
CPU FAST SNV11
(67MB)
                  120      
Disk FAST SNV17
(67MB)
    826             128      
Disk Std
jumbo
SNV13
(32MB)
              362          
Receiver > SNV11
(67MB)
SNV2
(67MB)
SNV13 (32MB) SNV17 (64MB) CHI1 (32MB) CHI2 (65MB) CHI3 (32MB) CHI10
(32MB)
GVA1 (32MB)  GVA2 (32MB) GVA3 (32MB) GVA4 (32MB) NIK16 (24MB)

Further study of initial poor iperf performance (449Mbits/s) between SNV11 and GVA2 measured  on 1/12/2002 indicates it  to be due to a very slow start and there was heavy (>> 1%) loss measured by pings. The losses were only during the iperf measurement. It may also be relevant that the Ethernet interface on GVA2 reported 5 receive interface errors. These errors appear to occur  at the rate of about 1/minute regardless of whether iperf is running. Similar results are seen from SNV17 to GVA3. SNV17 to GVA2 on the other hand reaches stability of about 840Mbits/s by 15 seconds, hence its throughput is much higher. The poor performance between SNV11 and GVA2 appeared to be transient and on 1/14/2002 we achieved 855Mbits/s iperf throughput. Further studies of the behavior of the congestion window (cwnd), instantaneous throughput and RTT can be found at http://www.cs.caltech.edu/~chengjin/les/.

TCP Stack Comparisons with Single Streams

We used iperf as a client on Sunnyvale hosts  to send iperf TCP throughput with a window size of 32 MBytes to an iperf server at GVA. By default we ran iperf for 1000s and reported incremental throughputs at 5 second intervals. SNV2 had the stock TCP installed, SNV10 had the Scalable TCP installed, SNV11 had the FAST TCP installed, SNV9 had Sally Floyd's HS TCP with web100 installed. The GVA iperf server hosts all had the stock TCP installed. Besides using different TCP stacks on the client, we also varied txqueuelen, and used jumbo (MTU=9000Bytes) frames as well as the standard MTU=1500Bytes. All tests except one were made for 1000 seconds. Before each test we used the Linux (root) command sysctl -w net.ipv4.route.flush=1 to ensure that the slow start threshold was not cached. We were also careful to ensure there was plenty of free memory (we queried the amount of free memory using the Linux top or free command, and if there was less than 400000K free we rebooted the host). We also checked that I was the only user logged onto the client and observed top on the server to see that our iperf was the only consumer of cpu cycles. The stock and scalable stack measurements were made with various txqueuelen settings, but for FAST we only used the recommended setting of 100 since FAST relies on the RTT for its congestion avoidance and large settings of txqueuelen can dramatically increase the RTT. The results are shown below.  The following observations are in order:

Also see Floyd's TCP slow-start and AIMD mods by Tom Dunigan, for comparisons of how HS TCP works compared to stock TCP.

Jumbo Frames

Jumbo frames and multiple streams with stock TCP

Hosts were set up to enable jumbo frames (MTU = 9000 Bytes). These were SNV13 (198.51.111.58) and CHI10 (192.91.236.10). Both utilized the stock TCP stack. We used tcpdump and the iperf -m option to verify that large MTUs were being transmitted. We also used cat /proc/net/snmp to look for evidence of fragmentation. Also the Linux tcptrace has a -F (don't fragment option) thta when used with large frames (the frame size follows the hosr address/name in the command) can be used to see if jumbo frames reach the destination.  For information on Path MTU discovery see RFC 1191. We measured the throughput for various windows and streams with the results shown below. 

Comparing with the SNV11 to CHI2 (FAST stack standard MTU), it is seen that in the linear domain (i.e. throughput increases linearly on the plot for SNV11 to CHI2) the throughputs are almost identical apart from the window size 64KB (which is currently not understood). For large numbers of streams jumbo frames out-performs the FAST stack.

In theory, FAST works better than standard TCP(Reno) when the bandwidth (in packet per second) is high. Reno cannot perform well as the packet rate increases. FAST is scalable for both low and high packet rates. 

Jumbo frame Reno is better than 1500 MTU Reno since the packet  per second rate is reduced to 1/9 by Jumbo frame. The Reno's problem  in high packet rate is alleviated. 

For multiple connection case, the packet rate is much smaller for each connection. So the FAST has no advantage (that is something like on 10Mbps link, FAST cannot show its advantage than Reno.), but the jumbo frame has its own advantage of less interrupts and higher payload... So, jumbo frame Reno is better than FAST. (I expect FAST with jumbo frame can have similar performance as Reno with jumbo frame.)  Xiaoliang Wei, Caltech FAST team.

Comparing against CHI1 to SNV11 (standard stack and MTU) again the behavior in the linear region is almost identical. Saturation sets in around 400Mbits/s for the standard stack with standard MTU whereas for jumbo frames saturation sets in close to 1Gbits/s.

Jumbos and Single Stream with stock TCP

We also made measurements between Amsterdam/NIKHEF (145.146.97.16) and Chicago (CHI3). Both hosts were enabled for jumbo frames and were running stock TCP. The NIKHEF host (2*2.4GHz Linux PC) had txqueuelen set to 1000 (packets), and CHI3 (2*2.2GHz Linux PC) was set to 2500. The max TCP window was set to 32MBytes, and iperf used 1 stream. No errors were observed in the NICs. The routes were asymmetric, and jumbo frames were only enabled in one direction (CHI3 to NIK16). The RTT was about 129ms.:

This illustrates that jumbo frames for a single stream with stock TCP between 2 identical hosts with an RTT of 128ms can improve performance by a factor of 5 compared to a 1500Byte MTU.

From Sunnyvale to CERN we set up SNV1 (198.51.111.10) and GVA3 (192.91.239.3) with jumbo frames, txqueuelen=1000, running stock TCP. We sent TCP data from SNV1 to GVA3 using iperf with 32MByte window.  The results indicate that in the first 5 seconds we achieved about 400Mbits/s, at 80 seconds it reached about 800 Mbits/s and after 240 seconds it reached about 990Mbits/s. The aggregate throughput reached after 1000 seconds was 972Mbits/s It is apparent that for this RTT (181ms) for jumbo frames one needs to run for considerable time (240 seconds) to reach the optimum performance. The figure below illustrates the additive increase (increases by time*MTU/RTT2 for each ACK received, the factor 0.5 accounts for the delayed ACKs) of the stock TCP with jumbo frames.

 For stock TCP the maximum single stream throughput for MTU = 9000 Bytes exceeded that for MTU =1 500 Bytes by almost a  factor of 5 (200 Mbits/s for MTU=1500Bytes vs 967Mbits/.s with MTU=9000Bytes). 

Jumbo frames and Various TCP Stacks

We made measurements with iperf/TCP from SNV to GVA1 with both server and client configured for MTU = 9000 Bytes and txqueuelen = 100. The results are shown below. Comparing these figure with those in the section on Comparing TCP Stacks it can be seen that the jumbo frames help significantly in improving throughput for all stacks evaluated.

For this unloaded path, with both FAST and HS TCP and with txqueuelen = 100, we were able to achieve > 900Mbits/s within 10 seconds. We also tried other values of txqueulen for the scalable TCP to see how it affected the overall throughput and stability but for shorter durations. The average values of throughputs observed so far 5, 20, 40,80 and 400 seconds are seen in the table below. It can be seen that larger txqueuelen results in larger throughputs for scalable TCP with MTU = 9000 Bytes.

txquerlen Time to reach 800Mbits/s Time to reach 900 Mbits/ Average throughput after 5 seconds Average  throughput after 400s Average throughput after 80s
2000 5s 5s 657 Mbits/s 982+-38 966+-81
1000 15s 15s 500 Mbits/s 901+-163 840+-247
500 20s 20s 380 Mbits/s 844+-145 814+-156
200 25s 40s 291 Mbits/s 798+-125 741+-186
100 20s 105s 147 Mbits/s 774+-128 715+195

The behavior of the throughputs with txqueuelen is plotted below.  It can be seen that there is little growth in the average throughput after 80 seconds. Also the points fit well (R2 > 0.9) to logarithmic series. The curves shown are fits to logarithmic series of the form f(t) = a*ln(t) +b with the parameters shown in the table below. The throughput at 5s is mainly dominated by slow start.

Parameters & R2 of fit of
f(t)=a*ln(t)+b

to throughputs vs txqueuelen
Seconds so far (T) a b R2
5 161 -589 0.98
10 174 -545 0.95
20 142 -164 0.98
40 108 119 0.97
80 79 334 0.93
400 68 445 0.95
 

 

Measurements from Sunnyvale to Amsterdam

The path partially used a production network (from StarLight to NIKHEF).. The maximum MTU was 8192Bytes. We followed the methodology described earlier. We also measured the ping RTT from the server to the client simultaneously with eth iperf measurements. The results of measurements made on February 18th and Feb 21st, 2003 using Scalable, FAST, HS and stock TCPs with various txqueuelen and MTU = 8192 Bytes are shown below in tabular and graphical forms. The approximate time stamp for when each measurement ended is also given on each plot. The behaviors of the TCP stacks are markedly differentFor HS TCP one can see the slope of the recovery  increasing with the congestion window (cwnd).
Col, Row Date, time PST Stack txqueuelen MTU Bytes Avg throughput in 1000s (Mbits/s) Avg throughput in 80s (Mbits/s) Comments
1,1 Feb 8 '03, 09:30 FAST 100 8192 461+-241 447+-194  
2,1 Feb 18 '03, 10:54 Scale 100 8192 387+-68 397+-82  
2,2 Feb 18 '03, 16:04 Scale 500 8192 507+-140 517+-155  
2,3 Feb 16 '03, 16:27 Scale 1000 8192 530+-161 568+-164  
2,4 Feb 18 '03, 10:11 Scale 2000 8192 622+-146 684+-179  
2,5 Feb 18 '03, 16:56 Scale 2000 8192 682+-133 644+-145  
1,2 Feb 21 '03, 11:25 HS 100 8192 303+-123 180+-137  
1,3 Feb 21 '03, 17:12 HS 500 8192 290+-142 289+-138  
1,4 Feb 21 '03, 10:50 HS 2000 8192 334+-207 323+-146  
1,5 Feb 21 '03, 15:50 HS 10000 8192 371+-239 292+-175  
3,1a Feb 18 '03, 13:21 Stock 100 8192 438+-53 237+-26  
3,2 Feb 21 '03 Stock 100 8192 318+-51 248+-28  
3,1b Feb 18 '03, 10:15 Stock 1000 8192 502+-101 740+-116  

CPU Utilization

The cpu utilization measured at the client (SNV11 a 2*2.4GHz Pentium 4 running the FAST TCP stack)  using the Unix time command is shown below. The average was 1.58MHz/Mbits/s +- 0.26 Mbits/s.  This is in reasonable agreement with measurements for more standard (Reno/Tahoe) TCP stacks.

We manually observed the cpu utilization of the iperf server using the Unix top command and noted down its values together with the throughputs recorded using the iperf -i incremental recording option. The MHz/Mbps was 1.6 +- 0.2. A plot of the server cpu utilization for SNV11 (running FAST) to GVA4 (stock TCP) with 1500Byte MTU is shown below. We repeated the measurement with the iperf client sending data from SNV10 running the Scalable TCP stack and from SNV11 running the FAST TCP stack to an iperf server at GVA4 with similar server utilization results. There was however, a big difference when using jumbo (MTU=9000Bytes). The server cpu utilization for MTU=9000 Bytes is about a factor 3 less than for MTU=1500Bytes., or more quantitatively 0.59+-0.1 compared to 1.6+-0.2.

To compare iperf client cpu utilization between standard MTUs (1500Bytes) and jumbo frames we used iperf/TCP to send data for 80 seconds from a FAST TCP host (SNV17) at Sunnyvale to a standard TCP stack host at Chicago (CHI2) and Geneva (GVa2) We used txqueuelen=100, a single stream and varying window sizes to achieve different throughputs. The GVA MTU was set (using ifconfig eth0 mtu 9000) to 9000Bytes. The MTU at Sunnyvale was alternated between 1500 and 9000 Bytes. The results are shown below. It is seen that for the FAST TCP stack the iperf client cpu utilization is about a factor 2 less for jumbo frames. For the stock TCP stack the difference in CPU utilization/Mbits/s between MTU=1500Bytes and 9000Bytes was fairly small (see 2nd figure below) and was close to that for the FAST stack for a 9000Byte MTU. 

Startup

The single stream slow start for a TCP Reno/Tahoe stack assuming no losses should take about 2*ceiling(log2(ideal_window_size))*RTT which yields about 2 seconds for an RTT of 67msec and window of 65000KBytes. The FAST TCP stack appears to take longer or closer to 8 seconds as can be seen below from the iperf -i (interval) option output: After slow start the throughput appears to remain fairly steady at between 840Mbits/s and 1Gbits/s. The aggregate throughput for the 60 seconds was 883Mbits/s. The aggregate throughput measured from 8 seconds after the start  until the end (i.e. when the throughput is stable) was 938Mbits/s.

Fair Share

To demonstrate the way that FAST TCP shares throughput among multiple streams we plot the throughput/stream where the aggregate throughput is saturated. The figures below are for the measurements from SNV11 to GVA2. The stacked graph shows the iperf flow throughputs for the 512KByte window for streams of 32, 40, 64, 90 and 120 where as seen from the figure above the throughput is fairly saturated. Quick inspection shows that the flows share the throughput about equally. The error bar plot shows the average throughput/stream for all the measurements. The magenta crosses are for the measurements where the aggregate throughputs are > 400 Mbits/s. The error bars indicate the standard deviations (stdev). The 3rd graph shows the relative standard deviations (i.e. stdev/avg) for the per stream throughputs. The magenta squares are for aggregate throughputs of over 400 Mbits/s. The points with 0 stdev/avg are single stream measurements where the stdev is zero. It can be seen that FAST TCP does a good job of fair sharing the throughputs among competing FAST TCP streams between the same source and destination.

bbcp

We ran bbcp in memory (/dev/zero) to memory (/dev/null) mode from SNV11 to CHI2. We set the window size to 32768KB to match that nominally needed by an RTT of 182ms and a bandwidth of 1Gbits/s. 

/home/cottrell/package/bbcp/bin/i386_linux24/bbcp -f -v -b  4 -t 80 -P 1 -s 1 -D

-w 32768k -T "ssh -l cottrell 192.91.239.2 /home/cottrell/package/bbcp/bin/i386_linux24/bbcp" /dev/zero cottrell@192.91.239.2:/dev/null

With such a window size, we were only able to achieve a throughput of  about 15.2MBytes/s or 127Mbits/s. The client cpu load while running this (measured by the Unix top command) varied from 6-9% so it does not appear to be a cpu starvation problem. We observed  that though we specified a window of 32768Kb bbcp set the window back to 2MBytes

bbcp_CTL: Sending to 192.91.239.2: -b 4 -D -f -m 644 -P 1 -s 1 -t 80 -v -W 2096128 -Y 2e565f3e -H none:0

which then Linux increased to 4MB:

bbcp_SNK 6144: Window size set to 2096128 (actual snd=4192256

  rcv=4192256)

The problem may be caused by "disk buffers are tied to window size buffers, you can quickly spiral out of control and kill the whole system." Andy Hanushevsky.  I tried setting a window of 2048KB and a single stream and achieved 36,205KBytes/s (~304Mbits/s) over 80 seconds. 

bbcp -f -v -b 4 -t 80 -P 1 -s 16 -D -w 2048k -T "ssh -l cottrell 192.91.236.2 /home/cottrell/package/bbcp/bin/i386_linux24/bbcp" /dev/zero cottrell@192.91.236.2:/dev/null

bbcp_SNK 9038: Window size set to 2096128 (actual snd=4192256 rcv=4192256)
bbcp: Source I/O buffers (18423K) > 25% of available free memory (16216K); copy may be slow

I  tried setting the window to 2048k and using 16 streams, however bbcp failed to complete.

After further discussions with Andy Hanushevsky, he identified a problem with using -w to try and set window sizes > 2MBytes and suggested using the -W option. This allowed larger windows to be set. With 32768Kbyte windows, a single stream, a txqueuelen of 100, stock TCP, we achieved the following bbcp throughputs between SNV13 aka cit-slac13 (198.51.111.58) to SNV17 aka cit-slac17 (198.51.111.78) (click on the throughputs to view the console output):

ModeFromToBBCP Throughput MBytes/s (Mbits/s)
Memory-to-memorySNV13:/dev/zeroSNV17:/dev/null 104.2 (833.6)
Disk-to-memorySNV13:/raid/dummy.2000000000SNV17:/dev/null 72.3 (578.4)
Disk-to-diskSNV13:/raid/dummy.2000000000SNV17:/raid/bbcpdat 62.6 (500.8)

Link Losses

To investigate the loss performance of the link we ran 64 Byte pings at 1 second intervals from SNV11 to GVA2. To first order pings should indicate the non congestion loss on the link. We ran ~ 60K pings starting at 18:12:49 on December 19, 2002. The overall loss rate was 0.55% (326 packets in 59465 sent). They were lost in a burst of 154 sequential packets (i.e. an outage of 154 seconds)  ending at sequence number 33454, and 157 packets ending at sequence number 26361, plus 2 packets ending at sequence 12212 plus a single packet being lost 15 times. For the single packet losses 2 came so that lost packets were within 10 sequence numbers of one another or another burst (i.e. within roughly the time it takes FAST to climb back up to full throughput after a loss). This suggests that the losses are very bursty, there appear to be 14 bursts of losses separated by 10 seconds or more in a time period of 60,000 seconds, or a burst loss rate of  0.02% or 2.2 in 10,000 or a Bit Error Rate (BER) assuming 1500Byte MTUs of 2 in 108.  If we assume the single packet losses are caused by congestion, then the burst loss rate is 2 in 60,000 or a BER of 3 in 109. Possibly the shorter bursts are caused by congestion at the routers, for example caused by iperf tests. The sequence number at the end of each burst, the length of each burst and the separation between bursts can be seen in the table below.

The losses from SNV11 to CHI1 for 60000 64 Byte pings at 1 second intervals starting on Friday Dec. 20 at 12:54:34 2002 PST indicate a loss rate of 0.035% (21 pings lost in 60000) and  that the losses are non-bursty. All the losses were of a single ping.

SNV11 GVA2 SNV11 CHI1  
Seq Burst Loss Burst Sep (sec)
1652 1 109
1757 1 105
5090 1 3333
12212 2 7122
26361 154 14149
27936 1 1575
28295 1 359
28733 1 438
29616 1 883
32046 1 2430
33454 157 1408
44276 1 10822
44278 1 2
48468 1 4190
48492 1 24
54929 1 6437
Burst Loss Seq Burst Sep (sec)
1 722 722
1 5399 4677
1 5754 355
1 6094 340
1 6129 35
1 8989 2860
1 13182 4193
1 19548 6366
1 23359 3811
1 30525 7166
1 30837 312
1 33702 2865
1 34135 433
1 34338 203
1 37242 2904
1 37517 275
1 44324 6807
1 54906 10582
1 54923 17
1 55620 697
1 58809 3189
 

File transfers

Four hosts at Sunnyvale (198.51.111.58, 62, 66, 70, 74, 78, 82 note 2 NICs per hosts so 2 addresses/host),were set up with dual Pentium 4 2.4Ghz cpus, and dual Gbit/s NICs, plus 8* 120GB disks in each ATA RAID array and 2 such arrays per server. Similar setups were available at Geneva and Chicago. Host 198.51.111.74 was set up with the FAST TCP stack. Jumbo Frames were configured on 192.91.236.10 (at Chicago) and on 158.51.111.58 (at Sunnyvale) and #ifconfig ethx mtu 9000. See 3ware RAID arrays tests with Linux 2.4.19 on P4DPE with twin 2.4 GHz CPUs for information on the performance of the RAID arrays.

Sylvain Ravot reported:

"Without any tuning, I could get 350 Mbit/s with 8 streams using iperf between Chicago and Sunnyvale. I could get 700 Mbit/s with standard MTU and 8 streams by increasing txqueuelen (transmit queue of the NIC).
#ifconfig ethx txqueuelen 4000
With Jumbo Frame I could saturate the link using 8 streams."

Local file transfers

We set up IEPM/BW monitoring to make hourly measurements between SNV74 and SNV58. Typical bbftp disk to disk 2GB transfers with 1 stream and 32768KByte window requested between two disk server hosts at Sunnyvale (from 198.51.111.74 (running FAST) to 198.198.51.111.58) consistently attained about 70KBytes/s (~550Mbits/s) and take about 66% of the cpu (i.e. 66%~(21.8+0.11)/39.23 = (sys+user)/real_time from the Unix time command).

Below is a log from a transfer to substantiate the above:

#BBFTP(12/28/2002 00:58:23 1041065903) - ssh -f cottrell@198.51.111.58 rm -f /raid/bbcpdat/bbftpdat 2>&1
#BBFTP(12/28/2002 00:58:23 1041065903) - CMD: /usr/bin/time -p /usr/local/bin/bbftp -r 1 -V -t -p 1 -L "s h " -E "/home/cottrell/bin/bbftpd -s -m 40" -e " setrecvwinsize 32768; setsendwinsize 32768;put /raid/temp/dummy.2000000000 /raid/bbcpdat/bbftpdat" -u cottrell 198.51.111.58 2>&1
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : >> COMMAND : setremotecos 0
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : << OK : COS set
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : >> COMMAND : setrecvwinsize 32768
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : << OK
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : >> COMMAND : setsendwinsize 32768
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : << OK
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:58:43 (PST) : >> COMMAND : put /raid/temp/dummy.20\ 00000000 /raid/bbcpdat/bbftpdat
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:59:11 (PST) : << OK
#BBFTP(12/28/2002 00:58:23 1041065903) - Sat Dec 28 00:59:11 (PST) : 2048000000 bytes send in 27.9 secs (\ 7.17e+04 Kbytes/sec or 560 Mbits/s)
#BBFTP(12/28/2002 00:58:23 1041065903) - real 39.23
#BBFTP(12/28/2002 00:58:23 1041065903) - user 0.11
#BBFTP(12/28/2002 00:58:23 1041065903) - sys 21.84

Below are extracted results to show the consistency of the results and also the throughputs measured for iperf:

#date       time     pingloss     iperf   bbcpmem  bbcpdisk     bbftp    pingAverage
12/28/2002 00:49:58         0    864828  471773.6    332440 560000.00         0
12/28/2002 01:49:13         0    864536  502606.4  333952.8 566000.00         0
12/28/2002 02:52:07         0    863935    441564  331125.6 547000.00         0
12/28/2002 03:55:28         0    865657  520597.6  319654.4 530000.00         0
12/28/2002 04:55:16         0    864924  505457.6  344523.2 520000.00         0
12/28/2002 05:54:21         0    865480  511899.2  327285.6 534000.00         0
12/28/2002 06:52:00         0    864459  367658.4    332876 534000.00         0
12/28/2002 07:50:43         0    864164  516190.4  342773.6 550000.00         0
12/28/2002 08:53:22         0    865493  399200.8  336049.6 560000.00         0
12/28/2002 09:52:35         0    863321    474264  335219.2 547000.00         0
12/28/2002 10:54:09         0    862472  450069.6  340237.6 562000.00         0
12/28/2002 11:47:44         0    865408  433470.4  343923.2 554000.00         0
12/28/2002 12:51:20         0    865140  478308.8  357270.4 530000.00         0
12/28/2002 13:52:12         0    865377  457005.6  348348.8 557000.00         0
Some initial measurements of throughput vs. CPU utilization are shown below.

SNV-CHI file transfers

Using bbftp to transfer a 2GByte file from 198.51.111.66 (cit-slac14 eth0 1GE interface) to 192.91.236.10 (v10chi eth2 1GE interface) using the FAST TCP stack, we tried varying txqueuelen and MTU . For MTU 1500 we found little change in throughput with txqueuelens of 100, 1000 and 10000 (typical throughputs 240-280Mbits/s) We then set txqueuelen to 100, and varied the MTU with values of 1500, 3000, 5000 and 9000. We achieved bbftp file transfer throughputs of: MTU=1500 throughputs = 272, 280 Mbits/s MTU=3000 throughputs = 277, 302, 333, 228 Mbits/s MTU=5000 throughputs = 342, 399, 321, 337 Mbits/s MTU=9000 throughputs = 350, 342, 356, 337 Mbits/s

To ensure that the throughput was not limiuted by TCP or lower levels, we also measured TCP throughput with iperf with a 32MByte requested window, a txqueuelen of 100, and an MTU of 9000Bytes. We were able to achieve 570 Mbits/s after 5.2 secs and 990 Mbits/s after 10.2 secs and also able to confirm we were using jumbo frames (console).

We also remeasured the bbftp local performance from 198.51.111.66 to 198.51.111.58 (cit-slac13) with 32M window MTU=9000, txqueuelen 100 and got 364, 339 Mbits/s

We thus believe the bbftp performance of 270-360 Mbits/s was not limited by the underlying TCP network performance.

Footnotes

1. The initial slow-start was designed to be very slow so that it is stable and does not overshoot  too much when flows start in a dynamic scenario. Large overshoot can cause massive losses (thousands of packets) at such large window and we try hard to prevent such losses at the expense of very slow slow start, which is alright for huge files but bad for small files. Our newer version (which Cheng and David are working on now) should have a better balance. Steven Low 1/11/02. The estimate of 80 seconds for a measurement duration was based on the classic slow start algorithm, and should be increased to between 100-150 seconds for the TCP FAST stack for the links from Sunnyvale to CERN and Chicago.


Comments to iepm-l@slac.stanford.edu