Tuesday, April 19, 2016

VMware NSX and the dropped packes tale

Day two of a VMware NSX implementation and I was surrounded by angry network guys asking me: “What have you done ?
As scare as it looks like I kept my pace and started to work with them to understanding what has been going on.

What they have seen was an increase of network packages drops on some network ports connected to physical servers and, of course, got the conclusion it was the NSX implementation causing the problem……well, yes and no !!! Let me explain why.

Backing to the dropped packets: using a tool like Wireshark they were seeing packets originated from a VMware prefix MAC Address with a type 0x8922, it was clear something on VMware side was generating those packets, I could not blame them to turn it on us.

(obs: I could not get the original screen, so I use this as an example; on the original one you would see the MAC address of a physical server on the destination)

0x8922 type is generally related to the Beacon Probing fail detection mechanism, so we checked vDS to make sure it was not using Beacon Probing, in fact it was not.

Then we realized that as part of NSX implementation we increased the MTU size to 1600 in order to enable VXLAN utilization on the ESXi host's network ports, as required, we also changed the MTU size of it's vDS accordingly.

The client’s architecture is a single vDS with two 10Gb uplinks handling all port groups and vmkernels, vSphere Distributed Switch Health Check is also enabled on the vDS for VLAN and MTU.

The way vSphere Distributed Switch Health Check works is sending broadcasts packets the size of the MTU configured on vDS to all it’s port groups throughout ESXi hosts uplinks.

Since we increased the MTU size on only ESXi hosts(client's decision), physical hosts connected to the same segment (VLAN id) of some portp groups were receiving those bigger packets and as they could not handle those, they started to dropped it. BINGO !!!

In reality it was not causing any issue other than polluting their monitoring tool screen.

To help you on how to avoid or remediate a situation like that here are some advice:

1 – Keep the MTU size the same on all your environment. (it's a fix)
2 – Change the default health check interval, it will reduce the number of dropped packets you see over time. (it just remediates)
3 – Disable vSphere Distributed Switch Health Check on vDS, not desirable as you will loose the capability of being warned if there’s something broken on the underlay network. (it's a fix)
4 – Create a vDS only for VXLAN traffic, harder to accomplish as it requires spare uplinks on the hosts. (it's a fix)

As you could see it was not NSX related, but some changes required by the NSX lead to the situation.
See you guys


Anonymous said...

Not sure how this is not really related to NSX, becuase it does, and its one of the takeaways when dealing with an SDN that can work with esxi environment only without considering the underlay or any 3rd party standard networks in the industry.

Eduardo Meirelles da Rocha said...

Let's me explain why it's not NSX related.
The issue is, different MTU sizes on the network ports connected to the same vDS where vDS Health Check is enable. Even if NSX were not installed, it behavior would be the same.
Clear ?

Post a Comment

Who am I

My photo
I’m an IT specialist with over 15 years of experience, working from IT infrastructure to management products, troubleshooting and project management skills from medium to large environments. Nowadays I'm working for VMware as a Consulting Architect, helping customers to embrace the Cloud Era and make them successfully on their journey. Despite the fact I'm a VMware employee these postings reflect my own opinion and do not represents VMware's position, strategies or opinions. Reach me at @dumeirell

Most Viewed Posts

Blog Archive