Optimizing WML Deep Learning Resource Allocation and Scheduling in watsonx.ai

Resource allocation limits: Set GPU or resource limits for each watsonx.ai project to ensure efficient resource usage across multiple projects. This will prevent any one project from monopolizing resources and ensure that other projects can run smoothly.
Dynamic GPU resource preemption & scheduling: Allow high-priority training jobs to preempt and acquire GPU resources from lower-priority jobs. This ensures that critical workloads are processed first and helps optimize the overall usage of resources based on the importance and urgency of tasks.
Enhanced UI for visual management: Provide a user-friendly interface that allows users to visualize and manage GPU allocation, task priority, and real-time resource usage. Features like drag-and-drop or slider controls will make resource management easy and accessible without deep technical knowledge.
Multi-level resource allocation strategy: Implement different resource allocation strategies based on task needs. For instance, urgent tasks can be allocated higher priority resources, while longer-running models can be assigned lower-priority resources, balancing the overall workload.
Automated resource management: Automatically adjust resource allocation based on the priority and importance of the training jobs. This will enable efficient, real-time resource distribution, ensuring that deep learning tasks always have access to the most suitable resources.
Manual GPU adjustment during training: Allow CPD admin to manually adjust the GPU allocation while a training job is running. This flexibility gives admin more control over the allocation of resources based on real-time needs, ensuring optimal performance during training.\

Background of my client:

My client is one of Taiwan's most important research institutions. Their primary projects focus on military applications, and AI solutions for central and local governments. They have procured a large number of AI servers and GPUs and are preparing to establish a national-level computing center and AI training platform. They has multiple teams working on AI research, including LLM, machine learning, deep learning. I have been engaged with them for nearly 6 months and initially received very positive feedback on watsonx.ai. However, when it comes to deep learning capabilities and GPU management, they have found several limitations that do not meet their requirements.
Specifically, they have identified issues such as:

The lack of dynamic manual GPU resource allocation, preventing users from adjusting GPU distribution in real-time.
The inability to predefine GPU resource limits at the project level through the user interface.

Due to these shortcomings in GPU management and deep learning capabilities, our platform appears less comprehensive compared to other vendors. As a result, they are actively considering alternative solutions from competitors for their deep learning.

Impact and Significance :

If we successfully sell watsonx.ai to this client, it would be a big milestone for IBM’s AI platform in Taiwan’s government sector. This achievement would not only strengthen IBM’s influence in government AI applications but also establish a solid foundation of trust for expanding into other government agencies in the future.

Needed By

Quarter

Post comment

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Please enter your email address

RELATED IDEAS

Optimizing WML Deep Learning Resource Allocation and Scheduling in watsonx.ai