This portal is to open public enhancement requests against products and services offered by the IBM Data & AI organization. To view all of your ideas submitted to IBM, create and manage groups of Ideas, or create an idea explicitly set to be either visible by all (public) or visible only to you and IBM (private), use the IBM Unified Ideas Portal (https://ideas.ibm.com).
Shape the future of IBM!
We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:
Search existing ideas
Start by searching and reviewing ideas and requests to enhance a product or service. Take a look at ideas others have posted, and add a comment, vote, or subscribe to updates on them if they matter to you. If you can't find what you are looking for,
Post your ideas
Post ideas and requests to enhance a product or service. Take a look at ideas others have posted and upvote them if they matter to you,
Post an idea
Upvote ideas that matter most to you
Get feedback from the IBM team to refine your idea
Specific links you will want to bookmark for future use
Welcome to the IBM Ideas Portal (https://www.ibm.com/ideas) - Use this site to find out additional information and details about the IBM Ideas process and statuses.
IBM Unified Ideas Portal (https://ideas.ibm.com) - Use this site to view all of your ideas, create new ideas for any IBM product, or search for ideas across all of IBM.
ideasibm@us.ibm.com - Use this email to suggest enhancements to the Ideas process or request help from IBM for submitting your Ideas.
IBM Employees should enter Ideas at https://ideas.ibm.com
See this idea on ideas.ibm.com
We need the LSF support for preemption of shared gpu jobs. Currently GPU preemption is supported only for jobs in mode j_exclusive=yes
This is a very big problem as we have a significant waste of gpu memory resource. because if a lower priority job needs just 10GB gpu memory, it is a waste if it takes all the 80GB gpu memory gpu, instead of share it with other 10GB gpu memory jobs.
The ideal solution, will be to allow the preemption on lower-priority j_exclusive=no jobs by higher priority shared/non-shared gpu jobs.
For example:
On lower-priority queue: gpu-lowpriority there are 3 job running on node1, gpu1:
job1 with gmem=10GB and j_exclusive=no
job2 with gmem=20GB and j_exclusive=no
job3 with gmem=20GB and j_exclusive=no
Now, on higher priority queue: gpu-highpriority, user submits a job that need 30GB gpu memory(no matter if its j_exclusive=yes/no).
What should happen, is that this high-priority gpu jobs will:
preempt job1 and job2 IF the high-priority job is j_exclusive=no
preempt job1 and job2 and job3 IF the high-priority job is j_exclusive=yes
Today, there is no support for preemption j_exclusive=no jobs at all. Because of this, we are considering changing to other scheduler that doe's support higher-priority preemption on shared gpu jobs.
Needed By | Yesterday (Let's go already!) |
By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.
Having discussed this with Nvidia, their recommendation is to use MIG to subdivide the GPU for better memory utilization.
There is no resource control when multiple jobs are running in shared mode, this means that pre-empting a given job does not guarantee that there would be resources available for another job to actually start. This would lead to seemingly random failures of the pre-empting job.
This can be revisited as and when there is support in the nvidia api.