Tom Hutton from UCSD has suggested
Understanding TCP Incast Throughput Collapse in Datacenter Networks
as a good tutorial on incast.
In RFC 6298 (June 2011) the initial TCP retransmission time out (RTO) was made shorter. The initial RTO value is used when TCP starts until a measured RTT can be derived frm the stream. The previous value was 3 seconds; the new recommendation is 1 second. This makes people that use TCP in datacenters for short RTT flows feel misunderstood and unappreciated.
Within a data center, an adjusted initial RTO was shown by the authors of Understanding... to go a long way toward addressing the ills caused by incast. Clearly, ultra-short initial RTOs would not be a good idea for science data set transfer over WAN distances. The incast problem is that the nature of the work (e.g. MapReduce, distributed file system) causes client responders to synchronize their messages which in turn creates an overload on the link carrying results back to the server querier.
Understanding. . . did not seriously consider large buffers as a solution to incast. They said:
The authors also attempted a variety of non-TCP work-arounds and identified the shortcomings of each. Increasing the size of switch and router buffers delays the onset problem to configurations with more concurrent senders. However, switches and routers with large buffers are expensive, and even large buffers may be filled up quickly with ever higher speed links.Understanding used Nortel 5500 switches for their experimental work. These switches are included in the buffer tabulation table. An important point is that mega-datacenters have voted with their purchases and merchant silicon switch chips with integrated buffers have won the TOR beauty contest. One can only conclude that while the Kilobytes in older switches were insuffient, the Megabytes in current SoC switches solve the problem.
Broadcom addressed the data center situation in Broadcom Smart-Buffer Technology in Data Center Switches in April 2012. As commercial white papers go, this one is less painful than most.
That the authors of Understanding reject really large buffers [>1000 MB] as a solution to this problem, this has clearly not stopped vendors from offering them.
Cisco put together a 2011 white paper that covers networking for big data that shows buffer use at stages of map/reduce.