Skip to content

Back end

Deploying AI models involves a variety of considerations, especially when it comes to backend infrastructure. The backend is the engine that powers your AI application, handling the complex computations and data processing that your models require. When setting up your backend, you need to consider factors such as latency, model availability, and compute resources.

  • Latency: This refers to the delay between a user's action and the system's response. In AI applications, low latency is crucial for a smooth user experience.

  • Model Performance: Your AI model should be readily available to process requests to the quality needed by your end user. If it doesn't give sufficiently reopted. asonable results, then it will not be ad

  • Compute Resources: AI models, especially large ones, require significant computational resources. You need to ensure that your backend has enough processing power and memory to handle your model's requirements.

For more information on compute resources, refer to our computation guide.

Libraries for Backend Deployment

There are several libraries available that can help you deploy your AI models on the backend. These libraries provide tools and functionalities that simplify the process of setting up and managing your backend infrastructure.

  • FlexFlow: A low-latency, high-performance LLM serving library.

  • llm: A CLI utility and Python library for interacting with Large Language Models, including OpenAI, PaLM, and local models installed on your own machine.

  • vLLM: This library utilizes PagedAttention to manage attention keys/values, enabling 24x throughput than other transformers without architecture changes.

  • Text Generation Inference: An open-sourced implementation forked from HF. It is a Rust, Python, and gRPC server for text generation inference.

  • Lit-Gpt: A hackable implementation of state-of-the-art open-source large language models.

  • Torch Serve: This library enables efficient serving of PyTorch models.

  • Triton Inference Server: Part of NVIDIA AI Inference, this server provides a robust solution for deploying AI models.

  • litellm by BerriAI: This library provides code to enable deployments

  • OpenRouter: Provides a python and curl based calling of open and closed source models, tracking rates.

Platforms for Backend Deployment

Several platforms provide infrastructure and services that can help you deploy your AI models on the backend.

  • Azure-Chat-GPT: This platform allows you to run GPT on Azure services.

  • Amazon Sagemaker: Part of the AWS suite, Sagemaker allows for streamlined running of AI models in various manners.

  • Lamini: This platform provides tools and services to help you build your AI applications.

Tutorials

For more hands-on guidance, you can refer to the following tutorials:

  • GCP Tutorial: This tutorial provides a step-by-step guide on how to deploy large-size deep learning models into production using Google Cloud Platform.