When Kubernetes met VMWare: look at the size of those packets

Whether your software counts as "microservices" or not, if it has more than three distinct components, Kubernetes is one way of deploying it on-site to a large number of customers. Assuming you can get them over the Linux-distribution-specific humps of installing Docker, everything from there upwards is then standard and matches what you're getting up to in your own deployments on AWS/Azure/...

Like all abstractions, though, this one does occasionally spring a leak. What follows is one of the most vicious and difficult to diagnose ones I've encountered.

The application in question consisted of a pile of services running in Kubernetes. These were then exposed via HTTP to another component running on a separate Windows server. The traffic between the two sides was bidirectional, i.e. both sides can call the other over HTTP.

This had been deployed for several dozen customers when we hit a strange problem: HTTP requests from services running inside Kubernetes would take 4 seconds or so to complete. That's 4 seconds for transferring a trivial amount of data across a LAN with all hosts on the same switch. You can well imagine the impact on a modern, reasonably chatty web application.

We quickly established that running those same requests from the command prompt on one of the Kubernetes nodes completed in the expected sub-second time.

At this point (with help from certain colleagues with much more networking knowledge than me), we broke out Wireshark and started looking at traffic captures. You can get these using the command line "tcpdump" tool.

We eventually got to the bottom of the problem: traffic flowing into the Kubernetes pod network from the outside was encountering severe packet fragmentation, with most requests having to be retried until the packets broke down small enough to work.

We fixed (or at least worked around) the problem by increasing the MTU on the Kubernetes pod network to 65,404.

Say what now?

Let me explain. The Maximum Transmission Unit, or MTU, of a TCP/IP link, is the biggest size of packet which can be transmitted. This is more normally set to around 1,500 - and something close to this is the default in all the Kubernetes pod network providers. That's also about what you'd expect on a physical link, be it within your own LAN or something like an ADSL connection to your ISP.

In this case, the Kubernetes nodes and the Windows box were all virtual machines on the same VMWare host. Nothing wrong with that, but of course it meant that the network between them was also virtual rather than physical. No actual transmission over the wire took place, and all packets were routed by the OS on the underlying host.

It would normally be the job of the operating system to break TCP transmissions up into packets and respect the MTU of the link while doing so. However, modern OSes and network cards include a feature called "TCP segmentation offloading" which allows the OS to simply fling the biggest possible packets the standard allows at the hardware, and leave the network card to split them up for transmission.

You can probably now see where this is going: with segmentation offloading enabled (it's the default), the OS at each end would happily write packets at the maximum size (65,536 bytes) to the network card. It would appear that - at least for transmission between VMs on the same host - VMWare's virtual network cards don't bother to break these up, which results in the enormous packets arriving intact at the other end.

That's not "wrong" of VMWare - after all, with no physical network link to impose limitations of accuracy/speed, why bother to send things in smaller units?

Unfortunately, in this case, the receiving VM transmits those packets onwards into a pod network with a more normal-sized MTU (again, the default). TCP then does what TCP does, and responds "sod off, fragmentation required".

Cranking the MTU up on the pod network (to slightly less than the theoretical max, to allow for Weave's overheads) thus allows the mega packets to pass onwards to their destination container intact.

This was not a fun one to work out.