As environments grow in size and complexity, maintaining manual operations processes becomes increasingly challenging and costly. Even the most mature companies struggle to achieve the ultimate goal of operations: self-healing autonomous remediation.
Today, let's explore how VMware Aria Operations can create autonomous remediation operations for VMware Cloud Foundation, thereby reducing downtime, increasing availability, and lowering operational costs.
A Use Case for Self-Healing Operations
To demonstrate how to achieve self-healing operations, let's consider a simple use case:
- Monitoring File System Growth: We need to monitor the file system growth of virtual machines;
- Threshold for Action: We want to act when the disk is at 85% full;
- Automated Cleanup: Delete unnecessary files, such as temporary files, dumps, and installer files, to prevent a disk full incident;
- Human Interaction: All actions should NOT involve human interaction at all.
In this guide, we'll leverage VMware Aria Automation Orchestrator workflows to automatically remediate issues, such as deleting unnecessary files. While the specific workflow will depend on the action you want to perform, the general steps remain the same. Aria Orchestrator provides hundreds of out-of-the-box workflows to cover most use cases.
1. Define the Workflow:
Start by creating an Orchestrator workflow that will handle the remediation, in this case, deleting unnecessary files. The workflow should support a single input representing the object it will act upon, such as VC:VirtualMachine.
Make the workflow available in VMware Aria Operations by mapping your package and binding the workflow to the desired object type. Check my previous post Extending Aria Operations Actions to learn how.
2. Define a Recommendation:
Recommendations provide a way to tell operators what to do to solve an issue, or in this case, it will call the workflow to the desired resolution.
Navigate to Configure > Alerts and click Recommendations.
Create a new recommendation providing a description and set the adapter type to VMware Aria Automation Orchestrator.
As Action, select the workflow that provides the fix for the issue.
3. Define the Sympton:
Let's move to the Symptions, where we will describe the problem we are looking for:
Navigate to Configure > Alerts and click Symptom Definitions.
Name the new Symptom, select the metric to monitor (e.g., Guest File System Utilization (%)), and set the threshold, in our case, higher than 85%. Save your settings.
4. Define the Alert:
Now it's time to create an Alert, which will combine the Sympton with the Recommendation;
Navigate to Configure > Alerts and click Alert Definition.
Name the new Alert, provide a description, and select the base object type (e.g., Virtual Machine).
Drag and drop the Symptom we created earlier into the appropriate left panel.
Drag and drop the recommendation we created earlier into the recommendations panel.
Attach the Alert to the policy monitoring your source environments and save it.
At this stage, VMware Aria Operations will monitor the VMs for the specified Symptom and create Alerts, providing operators the option to initiate the remediation workflow manually by just clicking the button provided.
It's a perfect solution to help your operations team fix issues more easily while in control of the time to remediate.
But, it's not autonomous remediation yet. Let's make it right.
5. Enabling Automated Remediation
The automated remediation is controlled by Policies;
To fully automate the remediation process:
Navigate to Configure > Policies and click Policy Definition.
Edit the policy associated with your source environment where you want to enable automation.
Click on Alerts and Symptoms.
Select the object type (e.g., Virtual Machine), find the alert definition we created, and change the Automated column from Deactivated to Activated. Save the policy.
Once configured, VMware Aria Operations will automatically trigger the remediation workflow when the alert is triggered, ensuring continuous monitoring and timely intervention to prevent disk full incidents.
Here's how it works in practice;
As you can see I created a dummy file that fills out my file system, in a matter of minutes, Operations identified the issue and remediated it automatically keeping my system healthy.
There's also a track of executions you can audit;
From Administration's page check Recent Tasks
You can filter by the Automated column and check the details of each execution;
This approach brings us closer to the nirvana of self-healing operations, enabling IT teams to focus on more strategic initiatives