In a former (professional) life, I had the pleasure to hunt a nasty network problem with Mr. Wlfrth.

tl;dr: If you're doing ip routing on a VMware virtual machine, make sure to disable Large Receive Offload (LRO) at the vmware level or it will fail in interesting ways.

For some (not so) good reasons, we were setting up a virtual machine that would be routing packets between around 10 000ish nodes and the rest of our client's infrastructure (and the internet), on VMWare. In the initial testing, everything seemed to be working fine, so the customer decided to route some (all...) of its nodes' packets through the VM. After a while, his internet pipe (1gbit) was completly saturated but we were not seeing much traffic on the client nodes side of the router...

After quite a bit of investigation, we found out that we were getting frames like this (yy.yy.yy.yy is a node behind our router, xx.xx.xx.xx is a server on the internet and zz.zz.zz.zz is the router):

 IP xx.xx.xx.xx.80 > yy.yy.yy.yy.54731: Flags [.], ack 9593, win 2048, length 10220
 IP zz.zz.zz.zz > xx.xx.xx.xx: ICMP yy.yy.yy.yy unreachable - need to frag (mtu 1500), length 556
 IP xx.xx.xx.xx.80 > yy.yy.yy.yy.54731: Flags [.], ack 9593, win 2048, length 10220
 IP zz.zz.zz.zz > xx.xx.xx.xx: ICMP yy.yy.yy.yy unreachable - need to frag (mtu 1500), length 556

Here's what happening:

  • xx.xx.xx.xx.80 sends us a reply that is 10261 bytes long, with the don't fragment bit set.
  • Our router tries to forward this packet down to yy.yy.yy.yy.54731 and fails to do so since the MTU of the interface is a standard 1500 bytes.
  • Out router send back an ICMP error unreachable - need to fragment (DF is set) to the original server (xx.xx.xx.xx.80).
  • That server, upon receiving the ICMP error, retransmit the packet(s)
  • Repeat

This loop was generating lots of traffic, saturating the 1gbps internet pipe of the customer.

Great success...

A few hours of debugging later, we found that VMWare, being the helpful software that it is, was doing TCP segment reassembly (or Large receive offload) and passing huge packets to the linux kernel, which triggered the behaviour described above.

This feature was developed to achieve better throughput performance on >1gbps interfaces (we were using 10gbps NICs). It would probably have worked properly in that environment if the magic reassembly didn't set the DF bit. But since it was causing these problems (and we didn't really have more time to investigate more on the VMWare side of things...), we simply disabled the packet reassembly on VMware.

To do so, follow the official doc:

Log in to the ESXi host or vCenter Server by using the vSphere Client.
Navigate to the host in the inventory tree, and on the Configuration tab click Advanced Settings under Software.
Select Net and scroll down until you reach parameters starting with Vmxnet.
Set the following LRO parameters from 1 to 0:

 Net.VmxnetSwLROSL
 Net.Vmxnet3SwLRO
 Net.Vmxnet3HwLRO
 Net.Vmxnet2SwLRO
 Net.Vmxnet2HwLRO
Reboot the ESXi/ESX host to apply the changes.