Improve Various Components of the Current Scheduler Adlgorithm

Currently, though the Scheduler supports threads, the processing is not a thread per bucket, but rather a threads that are used on a per bucket basis and buckets are still handled serially, which has led us to Host Match times around 400 seconds. What makes more sense to me, is a few optimizations that I think will help speed the host match space: 1) For boolean resources, the scheduler should create a shared memory table, indexed by a numeric hash that represents the hostname. This should be done once at reconfig or restart time. Doing this will prevent this from happening every scheduling cycle and save resources. Accommodations would have to be made for dynamic hosts of course. But no worries. In shared memory, this could happen using multiple threads, thus speeding the process. 2) For ELIMs and other variable resources, for each scheduling interval should create a table per resource per host for host based resources using the same numeric hash that represents the matching hostnames. In our case "healthy" is dynamic and on every queue and every bucket. 3) Have short circuit logic for sorting that when you reach an index in the sort order that has a large variation in values like r15s, mem, etc, that you IGNORE all sort fields afterwards, and just stop the sorting. This will save time on the SORT phase of the Matching. Metrics such as -slots have values between 0-maxCpus, where the probability you will have two numbers that may be the same is much higher, thus you should go beyond them, but metrics like free memory, and r15s, etc. have such a diverse range of numbers, it makes no sense to sort after the sort has been performed on them. 4) Instead of having multiple threads working on a bucket at a time, change the algorithm to do X buckets at a time using the SHMEM API and the various tables withing it to have each thread be able to access the memory tables from above for all the various resources and intersecting the resource based upon their numeric hash and not the string based hostname which will be light years faster. 5) At submission time, normalize the conbinedResreq to reduce "duplicate" resource requirements like "type == any && type == any" and to minimize any "extra bracketing" for example: ((type== any && health = ok) && type == LINUX64 && (((mem > 5000)))) If this is first normalized, the number of comparison operations can be reduced thus improving performance. In the case above, the combinedResreq should be: type == any && health == ok && type == LINUX64 && mem > 5000 In fact type == any, should be ignored as it's assumed if it's not already. LSF should report to the user the "optimized" combinedResreq too. It's so ugly the way it is today. Part of this is not IBM's fault of course.

Needed By

Yesterday (Let's go already!)

Post comment

Admin

Bill McMillan

Feb 26, 2024

Thank you for taking the time to provide your ideas to IBM. We truly value our relationship with you and appreciate your willingness to share details about your experience, your recommendations, and ideas.
IBM has evaluated the request and has determined that it cannot be implemented at this time, has been open an extended time without gaining community support or does not align with our current strategy or roadmap. If you would prefer that IBM re-evaluate this decision, please open a new Idea.

Reply
Hide replies

Guest

Feb 3, 2022

This tool needs an EDIT button. I'll have Sun Yi log that.

Reply
Hide replies

Guest

Feb 3, 2022
Giving this more thought, it would be interesting to see how this works in the context of a database, in that case, you could combine the lshosts and lsload structures and then use database constructs. I'm not sure if SQLite tables can span threads like say MySQL or MariaDB, but it's something to think about. A database with X connections allows developers to focus on SQL vs. memory table API's. But that does not mean you have to use them.
Gave the SELECT phase some more thought, in the 'merge' type SQL WHERE context but handled with hash based memory table, you should always keep track of how many rows are in each matching resource, and then when preparing to down select hosts, reorder (in a legal way), the select to place the smallest number of matching hosts to the left, and then work your way right. This is likely what Databases are doing under the covers to handle a SQL WHERE anyway.
LSF Should have an option to push back on bad syntax too, something like "select[resourcea && resourceb || resourcec]" Even though this is not legally incorrect, it's indicative of the user either not thinking, or relying on the left to right processing. Just a thought.
Also, before the schedule allocates threads, and I think this is important, it should look for similar matching resource requirements, and create a set of tables that process each of the major patterns into separate memory tables.
That way, when a bucket is processed, it should check if it's resource requirements match a pattern, and if so, take the pre-processed results from the pattern matching memory table instead of reprocessing.
For example, if you have 10,000 buckets, and all 10,000 have the following:
"select[type == any && health == ok && mem > 10000 [ && blah ]]"
Then, the results of the matching results of the select upto the blah should be placed in a table at the beginning of the process before dispatching threads to complete the allocations for the bucket.
So, in summary, we have this in the scheduler loop:
1. Create tables for each "dynamic" resource
2. Order the Resource Requirements by the number of rows with the least to the left, and the most to the right
3. Run one sweep to create a list of patterns and mark them in a mapping table
4. Use x threads to process each of the patterns into their own table
5. For each bucket dispatch a group of threads upto X, and maintaining X until all buckets are process
Then, inside the bucket process
1. From the pre-ordered group of resource requirements, search for pre-processes lists and decompose the bucket data as required
2. For resources that don't match a pattern, obtain the results via a query of shared memory
3. Processes the select string and make an allocation.
I have no visibility as to what the algorithm is today, these are just my thoughts on how you can:
1. Not have to perform a down-select of hosts more than once
2. Minimize wasted cycles
3. Parallelize the selection/allocation processing time
4. Make the best use of memory
That's enough for this morning.
Reply
Hide replies

Guest

Feb 2, 2022

Man, all that nice formatting was lost. I was going to add another RFE, but instead, I'm just going to type it in here.
LIM fidelity should also be included by having the ls_info, ls_host, and ls_load structures in shared memory and accessible to MBD/MBSCHED. Since the shmem API supports locks, there is no reason that MBD could not use these structures.
In addition. if for every LIM host update or registration, you were for handle that process in a thread, and registering/updating the shmem tables, you can increase the fidelity and frequency of LIM updates, thus allowing say a 30 second update frequency from a 10,000 node cluster.
Doing the math, that would mean roughly 333 updates per second for the master lim, and if the master lim used 33 threads, that would be 10/second/thread, which is not a big number actually. Using shmem and locking, each transaction would take about 40ns or so, which means that the "effective" rate for a 10,000 node cluster with 33 threads could even be larger.
If 40ns is in fact the update time required to update the shmem database per host, we could get a theoretical update frequency of twice a second. Now, that's a bit much, but that would be the upside, or the upper limit of how low the sampling could be taken. Scary shit.
All that is needed is to verify the time TIMEIT() per host update to make the database update, and then do the math. Based upon that time, you can calculate what the upper limit of max threads that you would be able to theoritically supprt as these updates are locking at the table level (shared memory named segment). Math is fun.

Reply
Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Please enter your email address

RELATED IDEAS

Improve Various Components of the Current Scheduler Adlgorithm