model-serving

Here are 130 public repositories matching this topic...

bentoml / BentoML

The easiest way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Multi-model Inference Graph/Pipelines, LLM/RAG apps, and more!

python machine-learning deep-learning model-serving multimodal mlops ml-engineering llm generative-ai llmops llm-serving model-inference-service llm-inference inference-platform

Updated Jun 12, 2024
Python

logicalclocks / machine-learning-api

Star

Hopsworks Machine Learning Api 🚀 Model management with a model registry and model serving

model-serving model-registry

Updated Jun 12, 2024
Python

vllm-project / vllm

Star

A high-throughput and memory-efficient inference and serving engine for LLMs

amd cuda inference pytorch transformer llama gpt rocm model-serving mlops llm inferentia llmops llm-serving trainium

Updated Jun 12, 2024
Python

kserve / kserve

Star

Standardized Serverless ML Inference Platform on Kubernetes

Updated Jun 12, 2024
Python

alibaba / rtp-llm

Star

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

inference llama gpt model-serving llm llmops llm-serving

Updated Jun 12, 2024
C++

ModelTC / lightllm

Star

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

nlp deep-learning llama gpt model-serving llm openai-triton

Updated Jun 12, 2024
Python

intel / xFasterTransformer

Star

intel inference transformer xeon llama model-serving llm chatglm qwen

Updated Jun 12, 2024
C++

predibase / lorax

Star

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

transformers pytorch llama gpt lora model-serving fine-tuning llm llmops llm-serving llm-inference

Updated Jun 12, 2024
Python

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

machine-learning deep-learning inference-engine model-deployment model-serving distributed-training federated-learning mlops edge-ai ai-agent on-device-training

Updated Jun 12, 2024
Python

google / jetstream-pytorch

Star

PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"

inference pytorch batching attention llama gemma model-serving tpu llm llm-inference llama2

Updated Jun 12, 2024
Python

microsoft / aici

Star

AICI: Prompts as (Wasm) Programs

rust ai wasm inference transformer language-model model-serving wasmtime llm llmops llm-serving llm-inference llm-framework

Updated Jun 11, 2024
Rust

basetenlabs / truss

Star

The simplest way to serve AI/ML models in production

open-source machine-learning packaging artificial-intelligence falcon easy-to-use whisper inference-server model-serving inference-api stable-diffusion wizardlm

Updated Jun 12, 2024
Python

openvinotoolkit / model_server

Star

A scalable inference server for models optimized with OpenVINO™

kubernetes machine-learning cloud ai deep-learning inference edge dag model-serving serving openvino

Updated Jun 12, 2024
C++

mlrun / mlrun

Star

MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.

python kubernetes workflow data-science machine-learning data-engineering model-serving mlops experiment-tracking mlops-workflow