LoRAX logo

LoRAX

Serve thousands of fine-tuned LLM adapters on a single GPU

LoRAX enables dynamic loading of LoRA adapters for LLMs, delivering high-throughput, low-latency inference for thousands of fine-tuned models on a single GPU, with Docker and Kubernetes support.

LoRAX banner

Overview

Overview

LoRAX is designed for ML engineers, DevOps teams, and AI product groups that need to expose many fine‑tuned LLM variants without replicating hardware. By sharing a single base model and loading LoRA adapters on demand, it cuts serving costs while keeping latency and throughput stable.

Core Capabilities

The framework provides just‑in‑time adapter loading from HuggingFace, Predibase, or local storage, heterogeneous continuous batching that mixes requests across adapters, and an adapter‑exchange scheduler that prefetches weights between GPU and CPU. Optimizations such as tensor parallelism, flash‑attention, quantization, and token streaming ensure production‑grade performance. An OpenAI‑compatible API, multi‑turn chat support, private tenant isolation, and structured JSON output round out the feature set.

Deployment

LoRAX ships as a pre‑built Docker image and Helm charts for Kubernetes, with Prometheus metrics and OpenTelemetry tracing. It runs on Nvidia Ampere‑class GPUs (CUDA 11.8+), Linux, and can also be launched via SkyPilot or locally for development.

Highlights

Dynamic per‑request LoRA adapter loading from HF, Predibase, or filesystem
Heterogeneous continuous batching keeps latency constant across adapters
Adapter exchange scheduling offloads weights between GPU and CPU
Production‑ready tooling: Docker, Helm, Prometheus, OpenTelemetry, OpenAI‑compatible API

Pros

  • Supports thousands of adapters on a single GPU
  • Low latency and high throughput thanks to advanced CUDA kernels
  • Works with major base models (Llama, Mistral, Qwen) and quantization
  • Apache 2.0 license allows free commercial use

Considerations

  • Requires Nvidia Ampere‑generation GPU and Linux environment
  • Depends on CUDA 11.8+ and compatible drivers
  • Only LoRA adapters compatible with PEFT/Ludwig are supported
  • Advanced scaling may need Kubernetes expertise

Managed products teams compare with

When teams consider LoRAX, these hosted platforms usually appear on the same shortlist.

Amazon SageMaker logo

Amazon SageMaker

Fully managed machine learning service to build, train, and deploy ML models at scale

Anyscale logo

Anyscale

Ray-powered platform for scalable LLM training and inference.

BentoML logo

BentoML

Open-source model serving framework to ship AI applications.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Enterprises delivering multi‑tenant LLM services
  • Researchers testing many fine‑tuned variants quickly
  • SaaS platforms that need per‑customer model customization
  • Teams with GPU‑rich on‑prem clusters seeking cost‑effective serving

Not ideal when

  • CPU‑only or low‑end GPU deployments
  • Projects where a single adapter suffices
  • Users unfamiliar with Docker/Kubernetes orchestration
  • Workflows that rely on non‑LoRA fine‑tuning methods

How teams use it

Personalized chatbot deployment

Serve a distinct LoRA adapter per customer to adjust tone or knowledge without restarting the server.

A/B testing of model variants

Switch adapters on the fly to compare performance of different fine‑tuned versions in production.

Multi‑tenant SaaS AI platform

Isolate each tenant's adapters while sharing a single base model, dramatically reducing GPU costs.

Rapid prototyping of research adapters

Load new adapters from HuggingFace instantly for experimentation without rebuilding the inference service.

Tech snapshot

Python69%
Rust20%
Cuda8%
C++2%
Dockerfile1%
Shell1%

Tags

llamagptmodel-servingfine-tuningllmpytorchllm-servingtransformersllm-inferencelorallmops

Frequently asked questions

How does LoRAX load adapters at inference time?

Adapters are fetched on demand from HuggingFace, Predibase, or a local path and merged with the base model just‑in‑time, without blocking other requests.

What hardware is required to run LoRAX?

An Nvidia GPU of Ampere generation or newer, CUDA 11.8 or later, Linux OS, and Docker for the containerized deployment.

Can LoRAX be integrated with existing Docker or Kubernetes workflows?

Yes, pre‑built Docker images and Helm charts are provided, and the service can be orchestrated alongside other containers.

Is the API compatible with OpenAI client libraries?

LoRAX exposes an OpenAI‑compatible endpoint for both completions and chat, allowing drop‑in use of standard OpenAI SDKs.

What license governs LoRAX?

LoRAX is released under the Apache 2.0 license, permitting free commercial use.

Project at a glance

Stable
Stars
3,679
Watchers
3,679
Forks
304
LicenseApache-2.0
Repo age2 years old
Last commit8 months ago
Primary languagePython

Last synced 2 hours ago