Skip to Main Content
IBM Data and AI Ideas Portal for Customers


This portal is to open public enhancement requests against products and services offered by the IBM Data & AI organization. To view all of your ideas submitted to IBM, create and manage groups of Ideas, or create an idea explicitly set to be either visible by all (public) or visible only to you and IBM (private), use the IBM Unified Ideas Portal (https://ideas.ibm.com).


Shape the future of IBM!

We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:


Search existing ideas

Start by searching and reviewing ideas and requests to enhance a product or service. Take a look at ideas others have posted, and add a comment, vote, or subscribe to updates on them if they matter to you. If you can't find what you are looking for,


Post your ideas

Post ideas and requests to enhance a product or service. Take a look at ideas others have posted and upvote them if they matter to you,

  1. Post an idea

  2. Upvote ideas that matter most to you

  3. Get feedback from the IBM team to refine your idea


Specific links you will want to bookmark for future use

Welcome to the IBM Ideas Portal (https://www.ibm.com/ideas) - Use this site to find out additional information and details about the IBM Ideas process and statuses.

IBM Unified Ideas Portal (https://ideas.ibm.com) - Use this site to view all of your ideas, create new ideas for any IBM product, or search for ideas across all of IBM.

ideasibm@us.ibm.com - Use this email to suggest enhancements to the Ideas process or request help from IBM for submitting your Ideas.

IBM Employees should enter Ideas at https://ideas.ibm.com


Status Not under consideration
Workspace Spectrum LSF
Components Scheduling
Created by Guest
Created on Feb 22, 2023

Allow preemption of shared gpu jobs (j_exclusive=no) by higher priority job queues.

We need the LSF support for preemption of shared gpu jobs. Currently GPU preemption is supported only for jobs in mode j_exclusive=yes 

This is a very big problem as we have a significant waste of gpu memory resource. because if a lower priority job needs just 10GB gpu memory, it is a waste if it takes all the 80GB gpu memory gpu, instead of share it with other 10GB gpu memory jobs.

The ideal solution, will be to allow the preemption on lower-priority j_exclusive=no jobs by higher priority shared/non-shared gpu jobs. 

For example:

On lower-priority queue: gpu-lowpriority there are 3 job running on node1, gpu1:

job1 with gmem=10GB and j_exclusive=no

job2 with gmem=20GB  and j_exclusive=no

job3 with gmem=20GB  and j_exclusive=no

 

Now, on higher priority queue: gpu-highpriority, user submits a job that need 30GB gpu memory(no matter if its j_exclusive=yes/no).

What should happen, is that this high-priority gpu jobs will:

preempt job1 and job2 IF the high-priority job is j_exclusive=no

preempt job1 and job2 and job3 IF the high-priority job is j_exclusive=yes

 

Today, there is no support for preemption j_exclusive=no jobs at all. Because of this, we are considering changing to other scheduler that doe's support higher-priority preemption on shared gpu jobs.

Needed By Yesterday (Let's go already!)
  • Admin
    Bill McMillan
    Reply
    |
    Mar 8, 2023

    Having discussed this with Nvidia, their recommendation is to use MIG to subdivide the GPU for better memory utilization.
    There is no resource control when multiple jobs are running in shared mode, this means that pre-empting a given job does not guarantee that there would be resources available for another job to actually start. This would lead to seemingly random failures of the pre-empting job.
    This can be revisited as and when there is support in the nvidia api.