This topic is about too many endpoint for inferencing in foundation models through watsonx which makes application development harder to do.

See this idea on ideas.ibm.com

We are facing a big workload with a customer (Banco do Brasil) due to too many endpoint for inferencing watsonx.ai models. The client needs chat endpoint and we had to create a transformation layer to convert text generation APIs to chat generation APIs. The main workload is in the tags required for each model architecture (example: llama need a kind of tags for chat application, mistral other, granite other, etc) and the management of deployed models. Because they are deployed models but each application (agent, byom, prompt) has a different endpoint.

For BYOM the endpoint is (Completion endpoint – to use in chat application you must to perform all prompt transformation based on the model architecture. If you are using multiple models you must to build prompt transformation for every single architecture)): https://us-south.ml.cloud.ibm.com/ml/v1/deployments/{deploy_id}/text/generation?version=2021-05-01

For Agents deployed the endpoint is (already a chat endpoint why to create a new one? Why not a text/chat endpoint with deployment id as a parameter?): https://us-south.ml.cloud.ibm.com/ml/v4/deployments/{deploy_id}/ai_service?version=2021-05-01

For default models text generation: https://us-south.ml.cloud.ibm.com/ml/v1/text/generation?version=2023-05-29

For default models text chat: https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29

PS: As we can see in the development hub there is a chat endpoint where I can perform tool calling: https://www.ibm.com/watsonx/developer/

It should be only two endpoints:

1 – text generation, only for models that is impossible to perform chat generation.
2- chat generation, for all models where is possible to perform chat generation even when the customer needs a single iteration with the model (one user input, one assistant output and a system input if required). For BYOM models we can use the chat structure based on the model architecture and deployment can be a parameter just as model is. The same should be considered for agent deployment and prompt deployment.

Needed By

Yesterday (Let's go already!)

Post comment

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Please enter your email address

RELATED IDEAS

This topic is about too many endpoint for inferencing in foundation models through watsonx which makes application development harder to do.