Ancient History of Buffer Sizing

The 56 Kb/s NSF funded national network existed from 1986-1988. It was replaced by a T-1 (1.5 Mb/s) backbone. The DS-3 (45 Mb/s) replacement for the T-1 network became production ready in October 1991. It ran til 1995. The network was provided by ANS, a consortium of MCI, IBM and MERIT (Michigan Educational Research Information Triad) and was called ANSNET.

LBL researchers observed serious congestion based collapse of Internet traffic in 1986. They disected the problem and worked fixes into BSD4.3 in a paper published in 1988. Exponential backoff of the sending rate was triggered on detection of packet loss. That along with slow start fixed congestion collapse of the Internet.

In 1994 Villamizar and Song published a paper that is credited with the rule of thumb that recommended router buffer sizes equal to the BW * Delay product (D*BW). D is the round trip transit delay time. They had some experimental data taken on ANSNET with a small number of TCP flows. The number of flows was small. This may have favored flow synchronization. And that would increase the utility of large buffers. The paper High performance TCP in ANSNET is available on-line to those that have permission to read the ACM Digital Library or a ready credit card. D*BW has been used by equipment manufacturers to size buffers in backbone routers on little more than this work from 20 years ago.

This paper has been cited many times and been the subject of lots of commentary. It can be explained that payload stored in the queue is sufficient to feed the bottleneck circuit while TCP recovers from the slow-down caused by tail dropping packets. Thus there is no idle time on the circuit. It seems to assume that many flows will back off at the same time from a buffer-full event. The flows are in this sense synchronized.

The history of the NSFNet included frantic efforts to build it fast enough to provide service before the network collapsed upon itself, It's so popular that no one goes there any more. There is an emphasis on keeping links full and to a lesser extent, minimizing retransmissions so that the goodput is as high as possible. What there was not is high concern for the well being of individual persistent flows.

The underpinnings of the D*BW buffer sizing prescription are weak. Much to the delight of researchers who write papers on protocol performance, this nut has yielded much meat.

Revised Buffer Size Theory

In 2004 G. Appenzeller in McKeown's group at Stanford published Sizing Router Buffers. They said Villamizar over-estimated required router buffers by 100X. Their formula for required buffers is D*BW/sqrt(N) where N is the number of flows. 10,000 simultaneous flows at the core of the network yields 100X difference in buffer. The assumption here is that on the backbone, the flows that make up the payload are not synchronized and behave independently.

Least anyone think -- If there is enough buffer no packets will drop -- that is not what the researchers had in mind. Their objective is to keep link utilization high in the face of packet loss. This requires operating in a regime where flows don't synchronize and back off together. Dropping packets is the feedback mechanism that adjusts senders to keep the network from over filling; it is supposed to be a share the pain system.

On point of what is important to researchers: The interest in small buffers comes from the difficulty of building buffers in an all-optical router. There is a belief that we are headed that direction, and best to start preparing for it now. That is important, but it isn't directly related to big data set transfers today.

McKeown's lab published an experimental report in 2008 that verifies Appenzeller D*BW/sqrt(N) buffer requirements on the Level(3) commercial backbone. The default buffer size was 190 mS, consistent with the 1994 Villamizar recommendation. In this study buffers were reduced 40X before the first packet was dropped. This provides empirical confirmation of Appenzeller on at least one significant network. The details like link utilization are important but I defer to the cited paper.

That the commercial Internet might enjoy smaller buffers is not a result that can be blindly applied to research networks that specialize in big data file transfers. Two important differences:

  • In the commercial internet the backbone links run at higher speed than any of the access links. This causes a pacing effect on each flow with gaps between the packets.
  • A small count of large flows may by itself lead to synchronization that further requires increased buffers.

These differences not withstanding, Ben Zhao et al considered network use for high-performance scientific computing in Performance of high-speed TCP applications in networks with very small buffers. They also investigated pacing packets as a method of further reducing buffer requirements.

TCP disiplines like Reno and CUBIC are AIMD -- Additive Increase Multiplicate Decrease. After each round trip, a increment (the additive increase) is added to the window size and this continues until the buffer at the bottleneck link overflows and packets are tail-dropped. AIMD trys to fill buffers. This increases the round trip time. Full buffers have no reserve capacity to absorb transient bursts.

A Review Paper

A 2009 paper written like a review but published as an opinion editorial summarizes a lot of the nuances.

http://www2.ee.unsw.edu.au/~vijay/pubs/jrnl/09ccr.pdf

Delay-based congestion detection

In 2015 Google proposed TIMELY, RTT delay-based congestion detection for use in the data center. A big advantage is that no change in the network was required. Hardware timers in new NICs has the microsecond accuracy necessary to make this work.

In 2016 Google proposed BBR (Bottleneck Bandwidth and RTT) congestion control. No cooperation is needed from the network. RTT measurements are made at rates that bracket the current data rate. The high probes look for increases in available BW at the bottleneck. When the RTT creeps upward -- this taken as a signal of buffer occupancy congestion. BBR is suitable for use in wide area networks. Google is using it for their bulk TCP traffic (e.g.youtube).

A False God ?

A focus on full link utilization and maximizing goodput may not be what leads to happy users. Buffer size should be selected to control stalling of individual flows. A 2005 paper Buffer Sizing for Congested Internet Links developed Buffer Sizing for Congested Links (BSCL) that incorporates per-flow performance along with link utilization. The authors made use of a network simulation program to explore effect of buffer size on on performance under the assumption of varying numbers of flows bottlenecked at the target link. Their conclusions in their Table I are that D*BW over-estimates buffer requirements by between 5x and 10X. For massively oversubscribed links, buffer requirements shift to greater than D*BW, but at that level of load, happiness is no longer a possibility. Sadly, their model of a router has output queuing. They acknowledge that virtual output queuing will yield a different result, but then move on without further comment.

Pacing

TCP Reno and friends emit packets in back-to-back bursts. If we could just get them to stop doing that, the effect on buffer requirements would be beneficial. This problem shines its importance most brightly when access connections to the network run at the same (or greater) speed as the backbone links. The question of how to do pacing and how much it might help has been of some interest. Mark Allman and Ethan Blanton wrote Notes on Burst Mitigation for Transport Protocols that suggests ways to smooth out TCP. A recent paper (2015) is Edge versus host pacing of TCP traffic in small buffer networks. At risk of a strained analogy, given freeway congestion, we could either add more lanes [expensive] or add metering to the on-ramps [not as expensive]. Networks that permit access links at the same speed as their backbone circuits are set up for a much more difficult time providing queue memory than those that do not. TCP Pacing and Buffer Sizing attempts to shine light on the question of whether pacing actually helps or results in flow synchronization that makes things worse. Answer: it helps.

This may not be hopeless. An article TSO sizing and the FQ scheduler in LWN.net describes how it is possible to smooth out dumping packets into the net. This reference is cited by an ESnet host tuning web page.

Conclusion

It is tempting to reason thusly:

  • Cisco, Juniper, Arista all make routers with large buffers
  • They wouldn't do that if there was no purpose to it
  • Since there is a purpose, the prudent designer will get as much buffer as possible as a hedge against not having enough.
The down side to more-is-better is that memory in high speed devices comes at considerable cost, and it consumes a lot of power. Having more than you need is wasteful.

I believe that some product specs are set by the marketing department. If customers tell the equipment manufacturers that they need really large buffers, that will be sufficient to make products appear. It is remarkable how persistent the 1994 Villamizar and Song recommendation has been.

To drive home a point made above, the effect of large buffers is not to eliminate packet loss. It is to keep bottleneck links full in spite of packet loss.

This page last edited Nov 30 2019