Asynchronous API and prioritisation rules vs. synchronous requests

See this idea on ideas.ibm.com

Due to the high shortage on GPU and the very high prices of those pieces of hardware, clients wants to optimize genetive AI queries by:

Maximising GPU utilisation over time
Maintaining the best quality of service, in particular latency and token throughput

That's why they want watsonx.ai to expose two types of API:

Synchronous API are the ones that already exist with streaming or no streaming. Client expects to use these endpoints when there is an end user expected model output asasp. Watsonx.ai should prioritize this king of requests.
Asynchronous APIs are for all the requests that do not requires an immediate response. For instance, meeting summary could be generated when there is some capacity available on the LLM. Similarly some batch could be handled as a series of asynchronous requests. A lot of generaitive AI tasks like document management, claim summarisation or classification can be handled asynchronously.
- In the standard asynchronous case we could have an http/grpc endpoint that publish the request in a topic and send back a requestId to the requestor. Watsonx.ai could also allow requestor to behave like an event producer by exposing a topic where it can publish its request. Then the requestor can either pool on an endpoint to get the generated output or subscribe to a topic to get it when it is ready. If using Kafka, the requestId could be the key of the Kafka event so that it is cheaper to filter on it
- In the case of batch, the results would be appended to a file that the consumer can then fetch with an API or by checking an S3 repository. Watsonx.ai could slice the input file of queries into a special Kafka topic to eventually prioritize unitary asynchronous requests over batches during the day and prioritize batches at night

This would have an positive impact on:

synchronous consumers since the performance would not be impacted by asynchronous requests if the LLM does not have enough capacity
GPU load would be optimized
This mechanism will protect LLM by behaving like a circuit breaker by stopping the flow of asynchronous requests when the LLM is overload (by monitoring token throughput for instance) and closing the circuit when the LLM has some free capacity
It will allow clients to free up some GPU capacity to handle more use case and have more flexibility to manage their LLM deployments

In short the expected feature would be:

http/gRPC endpoints that behave in an asynchronous manner (publish in a topic and send back an id)
priorization of synchronous vs. asynchronous vs. batch based on LLM/GPU utilisation
configure priorisation rules

image.png

Needed By

Quarter

Post comment

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Please enter your email address

RELATED IDEAS

Asynchronous API and prioritisation rules vs. synchronous requests