VMWare: Routing and Large Receive Offload considered harmful
In a former (professional) life, I had the pleasure to hunt a nasty network problem with Mr. Wlfrth.
tl;dr: If you're doing ip routing on a VMware virtual machine, make sure to disable Large Receive Offload (LRO) at the vmware level or it will fail in interesting ways.
For some (not so) good reasons, we were setting up a virtual machine that would be routing packets between around 10 000ish nodes and the rest of our client's infrastructure (and the internet), on VMWare. In the initial testing, everything seemed to be working fine, so the customer decided to route some (all...) of its nodes' packets through the VM. After a while, his internet pipe (1gbit) was completly saturated but we were not seeing much traffic on the client nodes side of the router...
After quite a bit of investigation, we found out that we were getting frames like this (yy.yy.yy.yy is a node behind our router, xx.xx.xx.xx is a server on the internet and zz.zz.zz.zz is the router):
IP xx.xx.xx.xx.80 > yy.yy.yy.yy.54731: Flags [.], ack 9593, win 2048, length 10220 IP zz.zz.zz.zz > xx.xx.xx.xx: ICMP yy.yy.yy.yy unreachable - need to frag (mtu 1500), length 556 IP xx.xx.xx.xx.80 > yy.yy.yy.yy.54731: Flags [.], ack 9593, win 2048, length 10220 IP zz.zz.zz.zz > xx.xx.xx.xx: ICMP yy.yy.yy.yy unreachable - need to frag (mtu 1500), length 556
Here's what happening:
xx.xx.xx.xx.80sends us a reply that is 10261 bytes long, with the
don't fragmentbit set.
- Our router tries to forward this packet down to
yy.yy.yy.yy.54731and fails to do so since the MTU of the interface is a standard 1500 bytes.
- Out router send back an ICMP error
unreachable - need to fragment(DF is set) to the original server (
- That server, upon receiving the ICMP error, retransmit the packet(s)
This loop was generating lots of traffic, saturating the 1gbps internet pipe of the customer.
A few hours of debugging later, we found that VMWare, being the helpful software that it is, was doing TCP segment reassembly (or Large receive offload) and passing huge packets to the linux kernel, which triggered the behaviour described above.
This feature was developed to achieve better throughput performance on >1gbps interfaces (we were using 10gbps NICs). It would probably have worked properly in that environment if the magic reassembly didn't set the DF bit. But since it was causing these problems (and we didn't really have more time to investigate more on the VMWare side of things...), we simply disabled the packet reassembly on VMware.
To do so, follow the official doc:
Log in to the ESXi host or vCenter Server by using the vSphere Client. Navigate to the host in the inventory tree, and on the Configuration tab click Advanced Settings under Software. Select Net and scroll down until you reach parameters starting with Vmxnet. Set the following LRO parameters from 1 to 0: Net.VmxnetSwLROSL Net.Vmxnet3SwLRO Net.Vmxnet3HwLRO Net.Vmxnet2SwLRO Net.Vmxnet2HwLRO Reboot the ESXi/ESX host to apply the changes.