Day two of
a VMware NSX implementation and I was surrounded by angry network guys asking me: “What have you done ?”
As scare as
it looks like I kept my pace and started to work with them to understanding
what has been going on.
What they
have seen was an increase of network packages drops on some network ports
connected to physical servers and, of course, got the conclusion it was the NSX
implementation causing the problem……well, yes and no !!! Let me explain why.
Backing to
the dropped packets: using a tool like Wireshark they were seeing packets
originated from a VMware prefix MAC Address with a type 0x8922, it was clear something on VMware side was generating those packets, I could not blame them to turn it on us.
(obs: I
could not get the original screen, so I use this as an example; on the original
one you would see the MAC address of a physical server on the destination)
0x8922 type
is generally related to the Beacon Probing fail detection mechanism, so we checked
vDS to make sure it was not using Beacon Probing, in fact it was not.
Then we
realized that as part of NSX implementation we increased the MTU size to 1600
in order to enable VXLAN utilization on the ESXi host's network ports, as
required, we also changed the MTU size of it's vDS accordingly.
The
client’s architecture is a single vDS with two 10Gb uplinks handling all port
groups and vmkernels, vSphere Distributed Switch Health Check is also enabled on the vDS for VLAN
and MTU.
The way
vSphere Distributed Switch Health Check works is sending broadcasts packets the size of the MTU
configured on vDS to all it’s port groups throughout ESXi hosts uplinks.
Since we
increased the MTU size on only ESXi hosts(client's decision), physical hosts connected to the same
segment (VLAN id) of some portp groups were receiving those bigger packets and
as they could not handle those, they started to dropped it. BINGO !!!
In reality
it was not causing any issue other than polluting their monitoring tool screen.
To
help you on how to avoid or remediate a situation like that here are some advice:
1 – Keep
the MTU size the same on all your environment. (it's a fix)
2 – Change
the default health check interval, it will reduce the number of dropped packets
you see over time. (it just remediates)
3 – Disable
vSphere Distributed Switch Health Check on vDS, not desirable as you will loose the capability
of being warned if there’s something broken on the underlay network. (it's a fix)
4 – Create
a vDS only for VXLAN traffic, harder to accomplish as it requires spare uplinks
on the hosts. (it's a fix)
As you
could see it was not NSX related, but some changes required by the NSX lead to
the situation.
See you guys