Monday, June 13, 2022

Cloud Assembly - Kubernetes EXITED

 For the past few days my vRealize Automation Cloud has been broken, mainly because there was an error with my Cloud Proxy preventing it to connect back to my on-premise vCenter.

Checking the Cloud Proxy details I could see Cloud Assembly - Kubernetes (cloudassembly-cmx-agent) had an EXITED status.

Even though the UI logs provides a clear error message: "Error generating auth token, status code: 400" I still had no idea on how to fix i.

Checking the container's log directly on the cloud proxy provides a consistent message.

I was running out of ideas since my searches resulted in nothing, no public KB, internal stuff, documentation blogs out there... nothing related to this error and how to fix it.

Of course I tried to start the container again, reboot the appliance even provisioned a few extra cloud proxies, all with the same error.

At this point it made me to think it was something extra, maybe environmental.... that's when it strike me, my whole lab leaves inside a bubble, including my internal NTP server.

Checking this baby I realized it was 5 hours behind it... which havent cause any issue with my systems, but since the Cloud Proxy connects back to the external world... it might be it. With so little hope I adjusted ajusted my NTP server time and syncronized everything back to it.

As you might guess, cloudassembly-cmx-agent was back to run.

Yeah I know.... sometimes it's the basics, the whole point of this post is to document that such an unusual error message could be simply your time settings and hopefully it would save you some precious troubleshoot hours.

See you guys

Friday, April 1, 2022

vRealize Automation fails to remove machines from Ansible Inventory

 Recently I've been working with one of my customers to create a fully automated offering on their Cloud Management Portal for their end-users to consume.
vRealize Automation (vRA) is their cloud management choice, not only because it's mult-cloud, governance and ease of use capabilities, but also because it's powerfull extensibility options providing all the integrations and automation to deliver fully compliant and customized workloads ready for production.

In charge of their configuration management they decided to use Ansible Playbooks, not a problem for vRA and it's native integration. So when a VM gets created  some playbooks will run to hardening the VM and make some configuration, so far so good.

But when deleting the VM we got an error; not being able to delete it beucase it was not possible to remove the VM from Ansible inventory first.
Checking vRA deployment logs we can see: Unable to parse inventory to obtain existing groups JSON for host : "hostname" in inventory "invetory_path" . ​​Ensure inventory is valid and host exists.. Refer to logs located at: var/tmp/vmware/provider/user_defined_script/<Deployment ID> on Ansible Control Machine for more details.

Checking the Ansible Inventory we confirmed the VM is still in there and vRA Deployment could not proceed to delete the VM.

First we made sure all the requirements were there. They were !!

But what stood up was the message that it could not parse the JSON, is there anything wrong with the JSON ?

So we went back to Ansible and ran some callbacks to make sure it's returning the right information
we ran: ANSIBLE_STDOUT_CALLBACK=json ANSIBLE_LOAD_CALLBACK_PLUGINS=true ansible "VM"-m debug -a var=group_names -i "inventory_path_file"

To our surprise, there was an extra line outside of the JSON with the timer information.

It might be something on the Ansible's configuration !!!
After some seriously analisys and tests we find out a configuration section about callback plugins and one of them had the timer option.

So we removed the timer option from callback_whitelist option

Running the callback command again we confirmed  the JSON comes clear and vRA deletion just worked as expected.

Curious enough this requirement is not on vRA Ansible requirements documentation, To be honest I'm not sure if it was something specific with this customer implementation or Ansible version but I'll mention this internally, possibly for a bugfix. either way now you know how to fix it.

A shout-out to my buddy Sean Leahy working with us all the way on this jorney.

Wednesday, February 16, 2022

Tanzu Kubernetes Cluster creation stucks


I've been playing with Tanzu Kubernetes Cluster (TKC) on vSphere with Tanzu since vSphere 7.0 GA, recently, to be honest, have been a few months I could not create any Guest Clusters anymore, it does not matter if I'm using v1alpha1 or the new v1alpha2 API, it does not matter if my environment is based on NSX or vDS.

When I try to create my Guest Cluster the control plane got provisioned successfully, customized, but nothing else happens, my worker nodes are never provisioned and the cluster status remains on the creating phase.

The only message I see is on vCenter: error creating client and cache for remote cluster. Error creating dynamic rest mapper for remote cluster. Get ""dial tcp connect: connection refused.

I did countless tests until I finally found the issue.
On my descriptor file, I was using a custom VM Class, You might remember, I wrote about it too.
It turns out, there's a bug when using the Custom VM Class within Guest Clusters, when I went back using the built-in ones, my cluster got created successfully.

Until this bug is not fixed, make sure you are using the built-in VM Class instead of custom ones.
I hope this post helps someone, it took me literally months to figure this out.

See you next

Friday, February 4, 2022

VMware Identity Manager and Delegate IP

While working with one of my customers to deploy a new automation platform (vRealize Automation), which will provide and manage multi-cloud resources, like on AWS, Google, and vSphere for hundreds of end-users providing real self-service portal to give them freedom and agility we decided it was a good idea to consider high-availability to this solution.

You might recall when I talked about scale-out VMware Identity Manager, vIDM, to provide high availability. At that time I covered most about load balancer health checks for the services, but there's an extra requirement;  delegate IP.

First thing first, what is delegate IP ?

When you have your vIDM in cluster mode, it will also cluster their internal Postgres database, the delegate IP is the Active IP receiving the request and will fluctuate between the nodes when needed.

So far so good, but what's the problem ?

What was not clear is if this delegate IP needs an external load balancer or not, in fact, the documentation points to Identity Manager load balancing Documentation... and to your surprise, there's no mention about requirements to set up this service.

A more detailed documentation about vIDM load balancing needs shows no evidence of the need for it.

So, to solve anyone's doubt.

There's NO need for an external load balancer for the delegate IP, the nodes themselves will manage it.

You still need an extra free IP on the same segment where your vIDM nodes are provisioned.

be safe people !!!

Who am I

My photo
I’m an IT specialist with over 15 years of experience, working from IT infrastructure to management products, troubleshooting and project management skills from medium to large environments. Nowadays I'm working for VMware as a Consulting Architect, helping customers to embrace the Cloud Era and make them successfully on their journey. Despite the fact I'm a VMware employee these postings reflect my own opinion and do not represents VMware's position, strategies or opinions. Reach me at @dumeirell

Most Viewed Posts

Blog Archive