VMware NSX and the dropped packets tale

Day two of a VMware NSX implementation and I was surrounded by angry network guys asking me: “What have you done ?”

As scare as it looks like I kept my pace and started to work with them to understanding what has been going on.

What they have seen was an increase of network packages drops on some network ports connected to physical servers and, of course, got the conclusion it was the NSX implementation causing the problem……well, yes and no !!! Let me explain why.

Backing to the dropped packets: using a tool like Wireshark they were seeing packets originated from a VMware prefix MAC Address with a type 0x8922, it was clear something on VMware side was generating those packets, I could not blame them to turn it on us.

(obs: I could not get the original screen, so I use this as an example; on the original one you would see the MAC address of a physical server on the destination)

0x8922 type is generally related to the Beacon Probing fail detection mechanism, so we checked vDS to make sure it was not using Beacon Probing, in fact it was not.

Then we realized that as part of NSX implementation we increased the MTU size to 1600 in order to enable VXLAN utilization on the ESXi host's network ports, as required, we also changed the MTU size of it's vDS accordingly.

The client’s architecture is a single vDS with two 10Gb uplinks handling all port groups and vmkernels, vSphere Distributed Switch Health Check is also enabled on the vDS for VLAN and MTU.

The way vSphere Distributed Switch Health Check works is sending broadcasts packets the size of the MTU configured on vDS to all it’s port groups throughout ESXi hosts uplinks.

Since we increased the MTU size on only ESXi hosts(client's decision), physical hosts connected to the same segment (VLAN id) of some portp groups were receiving those bigger packets and as they could not handle those, they started to dropped it. BINGO !!!

In reality it was not causing any issue other than polluting their monitoring tool screen.

To help you on how to avoid or remediate a situation like that here are some advice:

1 – Keep the MTU size the same on all your environment. (it's a fix)

2 – Change the default health check interval, it will reduce the number of dropped packets you see over time. (it just remediates)

3 – Disable vSphere Distributed Switch Health Check on vDS, not desirable as you will loose the capability of being warned if there’s something broken on the underlay network. (it's a fix)

4 – Create a vDS only for VXLAN traffic, harder to accomplish as it requires spare uplinks on the hosts. (it's a fix)

As you could see it was not NSX related, but some changes required by the NSX lead to the situation.

See you guys

Author Description