Skip to Main Content
IBM Data and AI Ideas Portal for Customers

Shape the future of IBM!

We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:

Post your ideas

Post ideas and requests to enhance a product or service. Take a look at ideas others have posted and upvote them if they matter to you,

  1. Post an idea

  2. Upvote ideas that matter most to you

  3. Get feedback from the IBM team to refine your idea

Help IBM prioritize your ideas and requests

The IBM team may need your help to refine the ideas so they may ask for more information or feedback. The product management team will then decide if they can begin working on your idea. If they can start during the next development cycle, they will put the idea on the priority list. Each team at IBM works on a different schedule, where some ideas can be implemented right away, others may be placed on a different schedule.

Receive notification on the decision

Some ideas can be implemented at IBM, while others may not fit within the development plans for the product. In either case, the team will let you know as soon as possible. In some cases, we may be able to find alternatives for ideas which cannot be implemented in a reasonable time.

Additional Information

To view our roadmaps:

Reminder: This is not the place to submit defects or support needs, please use normal support channel for these cases

IBM Employees:

The correct URL for entering your ideas is:

Status Future consideration
Workspace Spectrum LSF
Components Scheduling
Created by Guest
Created on Feb 1, 2022

Improve Various Components of the Current Scheduler Adlgorithm

Currently, though the Scheduler supports threads, the processing is not a thread per bucket, but rather a threads that are used on a per bucket basis and buckets are still handled serially, which has led us to Host Match times around 400 seconds. What makes more sense to me, is a few optimizations that I think will help speed the host match space: 1) For boolean resources, the scheduler should create a shared memory table, indexed by a numeric hash that represents the hostname. This should be done once at reconfig or restart time. Doing this will prevent this from happening every scheduling cycle and save resources. Accommodations would have to be made for dynamic hosts of course. But no worries. In shared memory, this could happen using multiple threads, thus speeding the process. 2) For ELIMs and other variable resources, for each scheduling interval should create a table per resource per host for host based resources using the same numeric hash that represents the matching hostnames. In our case "healthy" is dynamic and on every queue and every bucket. 3) Have short circuit logic for sorting that when you reach an index in the sort order that has a large variation in values like r15s, mem, etc, that you IGNORE all sort fields afterwards, and just stop the sorting. This will save time on the SORT phase of the Matching. Metrics such as -slots have values between 0-maxCpus, where the probability you will have two numbers that may be the same is much higher, thus you should go beyond them, but metrics like free memory, and r15s, etc. have such a diverse range of numbers, it makes no sense to sort after the sort has been performed on them. 4) Instead of having multiple threads working on a bucket at a time, change the algorithm to do X buckets at a time using the SHMEM API and the various tables withing it to have each thread be able to access the memory tables from above for all the various resources and intersecting the resource based upon their numeric hash and not the string based hostname which will be light years faster. 5) At submission time, normalize the conbinedResreq to reduce "duplicate" resource requirements like "type == any && type == any" and to minimize any "extra bracketing" for example: ((type== any && health = ok) && type == LINUX64 && (((mem > 5000)))) If this is first normalized, the number of comparison operations can be reduced thus improving performance. In the case above, the combinedResreq should be: type == any && health == ok && type == LINUX64 && mem > 5000 In fact type == any, should be ignored as it's assumed if it's not already. LSF should report to the user the "optimized" combinedResreq too. It's so ugly the way it is today. Part of this is not IBM's fault of course.
Needed By Yesterday (Let's go already!)
  • Guest
    Feb 3, 2022

    This tool needs an EDIT button. I'll have Sun Yi log that.

  • Guest
    Feb 3, 2022

    Giving this more thought, it would be interesting to see how this works in the context of a database, in that case, you could combine the lshosts and lsload structures and then use database constructs. I'm not sure if SQLite tables can span threads like say MySQL or MariaDB, but it's something to think about. A database with X connections allows developers to focus on SQL vs. memory table API's. But that does not mean you have to use them.

    Gave the SELECT phase some more thought, in the 'merge' type SQL WHERE context but handled with hash based memory table, you should always keep track of how many rows are in each matching resource, and then when preparing to down select hosts, reorder (in a legal way), the select to place the smallest number of matching hosts to the left, and then work your way right. This is likely what Databases are doing under the covers to handle a SQL WHERE anyway.

    LSF Should have an option to push back on bad syntax too, something like "select[resourcea && resourceb || resourcec]" Even though this is not legally incorrect, it's indicative of the user either not thinking, or relying on the left to right processing. Just a thought.

    Also, before the schedule allocates threads, and I think this is important, it should look for similar matching resource requirements, and create a set of tables that process each of the major patterns into separate memory tables.

    That way, when a bucket is processed, it should check if it's resource requirements match a pattern, and if so, take the pre-processed results from the pattern matching memory table instead of reprocessing.

    For example, if you have 10,000 buckets, and all 10,000 have the following:

    "select[type == any && health == ok && mem > 10000 [ && blah ]]"

    Then, the results of the matching results of the select upto the blah should be placed in a table at the beginning of the process before dispatching threads to complete the allocations for the bucket.

    So, in summary, we have this in the scheduler loop:

    1. Create tables for each "dynamic" resource

    2. Order the Resource Requirements by the number of rows with the least to the left, and the most to the right

    3. Run one sweep to create a list of patterns and mark them in a mapping table

    4. Use x threads to process each of the patterns into their own table

    5. For each bucket dispatch a group of threads upto X, and maintaining X until all buckets are process

    Then, inside the bucket process

    1. From the pre-ordered group of resource requirements, search for pre-processes lists and decompose the bucket data as required

    2. For resources that don't match a pattern, obtain the results via a query of shared memory

    3. Processes the select string and make an allocation.

    I have no visibility as to what the algorithm is today, these are just my thoughts on how you can:

    1. Not have to perform a down-select of hosts more than once

    2. Minimize wasted cycles

    3. Parallelize the selection/allocation processing time

    4. Make the best use of memory

    That's enough for this morning.

  • Guest
    Feb 2, 2022

    Man, all that nice formatting was lost. I was going to add another RFE, but instead, I'm just going to type it in here.

    LIM fidelity should also be included by having the ls_info, ls_host, and ls_load structures in shared memory and accessible to MBD/MBSCHED. Since the shmem API supports locks, there is no reason that MBD could not use these structures.

    In addition. if for every LIM host update or registration, you were for handle that process in a thread, and registering/updating the shmem tables, you can increase the fidelity and frequency of LIM updates, thus allowing say a 30 second update frequency from a 10,000 node cluster.

    Doing the math, that would mean roughly 333 updates per second for the master lim, and if the master lim used 33 threads, that would be 10/second/thread, which is not a big number actually. Using shmem and locking, each transaction would take about 40ns or so, which means that the "effective" rate for a 10,000 node cluster with 33 threads could even be larger.

    If 40ns is in fact the update time required to update the shmem database per host, we could get a theoretical update frequency of twice a second. Now, that's a bit much, but that would be the upside, or the upper limit of how low the sampling could be taken. Scary shit.

    All that is needed is to verify the time TIMEIT() per host update to make the database update, and then do the math. Based upon that time, you can calculate what the upper limit of max threads that you would be able to theoritically supprt as these updates are locking at the table level (shared memory named segment). Math is fun.