[EGO][System] - Improve cluster restart speed on large clusters with unavailable hosts

It took long time to restart the master in a large scale cluster, even there is no workload running. The problem is more significant for clusters with many scavenged hosts.
An example is in a ~2000 scavenging hosts cluster, it could take >30min to restart EGO and see cluster become available, which is unacceptable for production cluster, and also a significant burden for maintenance team.
The current investigations found the following points can be enhanced:
1) When restarts, the Master LIM checked all the compute hosts first, before trying to start EGO.
We saw master LIM went through each dynamic hosts in the hostcache first, before trying to claim mastership and starting VEMKD. Because many hosts in the hostcache file can become unavailable, this operation could take really long, especially when there are many scavenging hosts.

We are thinking management hosts should be started first, before checking those low priority dynamic compute hosts.

2) Automatically remove unavailable hosts from hostcache file based on Resource Group
The EGO_DYNAMIC_HOST_TIMEOUT can help us by automatically remove unavailable hosts from hostcache file, but we can't use it in the cluster that has both servers and scavenging desktops in it.

We only want to remove unavailable scavenging hosts, but not servers.

If servers are removed, a 100% utilization graph doesn't tell us that all servers are in good state.

What we need is to set this parameter at Resource Group level instead of Cluster level. For example, we only set this for Scavenging Resource Group, but not ComputeHosts or ManagementHosts RG.

3) Improve LIM polling performance for unavailable hosts
So far we saw it always took 2-3 sec for master LIM to poll each unavailable host. Because we always have hundreds of scavenging hosts as unavailable, it always took long time to go through all the hosts. We want EGO to enhance this part to reduce the time spent on unavailable hosts.

Post comment

Guest

Reply
| Jun 22, 2020

.Mark Status. Enhancement is available since Symphony 7.2 release.

0 reply Hide replies

Guest

Reply
| Jan 24, 2018

This RFE's Headline was changed after submission to reflect the headline of an internal request we were already considering, but will now track here.

0 reply Hide replies

Guest

Reply
| Aug 20, 2014

Creating a new RFE based on Community RFE #57026 in product Platform Symphony.

0 reply Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Please enter your email address

RELATED IDEAS

[EGO][System] - Improve cluster restart speed on large clusters with unavailable hosts