
Amazon SageMaker
Fully managed machine learning service to build, train, and deploy ML models at scale
Discover top open-source software, updated regularly with real-world adoption signals.

Serve thousands of fine-tuned LLM adapters on a single GPU
LoRAX enables dynamic loading of LoRA adapters for LLMs, delivering high-throughput, low-latency inference for thousands of fine-tuned models on a single GPU, with Docker and Kubernetes support.

LoRAX is designed for ML engineers, DevOps teams, and AI product groups that need to expose many fine‑tuned LLM variants without replicating hardware. By sharing a single base model and loading LoRA adapters on demand, it cuts serving costs while keeping latency and throughput stable.
The framework provides just‑in‑time adapter loading from HuggingFace, Predibase, or local storage, heterogeneous continuous batching that mixes requests across adapters, and an adapter‑exchange scheduler that prefetches weights between GPU and CPU. Optimizations such as tensor parallelism, flash‑attention, quantization, and token streaming ensure production‑grade performance. An OpenAI‑compatible API, multi‑turn chat support, private tenant isolation, and structured JSON output round out the feature set.
LoRAX ships as a pre‑built Docker image and Helm charts for Kubernetes, with Prometheus metrics and OpenTelemetry tracing. It runs on Nvidia Ampere‑class GPUs (CUDA 11.8+), Linux, and can also be launched via SkyPilot or locally for development.
When teams consider LoRAX, these hosted platforms usually appear on the same shortlist.
Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.
Personalized chatbot deployment
Serve a distinct LoRA adapter per customer to adjust tone or knowledge without restarting the server.
A/B testing of model variants
Switch adapters on the fly to compare performance of different fine‑tuned versions in production.
Multi‑tenant SaaS AI platform
Isolate each tenant's adapters while sharing a single base model, dramatically reducing GPU costs.
Rapid prototyping of research adapters
Load new adapters from HuggingFace instantly for experimentation without rebuilding the inference service.
Adapters are fetched on demand from HuggingFace, Predibase, or a local path and merged with the base model just‑in‑time, without blocking other requests.
An Nvidia GPU of Ampere generation or newer, CUDA 11.8 or later, Linux OS, and Docker for the containerized deployment.
Yes, pre‑built Docker images and Helm charts are provided, and the service can be orchestrated alongside other containers.
LoRAX exposes an OpenAI‑compatible endpoint for both completions and chat, allowing drop‑in use of standard OpenAI SDKs.
LoRAX is released under the Apache 2.0 license, permitting free commercial use.
Project at a glance
StableLast synced 4 days ago