GPUStack logo

GPUStack

Unified GPU cluster manager for scalable AI inference

GPUStack orchestrates heterogeneous GPU resources across Linux, macOS, and Windows, delivering OpenAI‑compatible APIs for LLMs, VLMs, diffusion, audio, and embedding models.

GPUStack banner

Overview

Overview

GPUStack is a lightweight manager that unifies GPUs from NVIDIA, Apple Metal, AMD ROCm, Ascend CANN, and other accelerators into a single inference platform. It supports a broad catalog of models—including large language, vision‑language, diffusion, audio, and embedding models—through flexible backends such as vLLM, llama‑box, Ascend MindIE, and vox‑box. Users interact via a web UI or OpenAI‑compatible endpoints, benefiting from automatic resource evaluation, load balancing, and real‑time GPU monitoring.

Deployment

Deployments are container‑based on Linux (Docker with NVIDIA Container Toolkit) and available as desktop installers for macOS and Windows. Adding nodes or GPUs scales the cluster instantly, while multi‑version backend support lets different models run with their optimal runtimes. API keys and user management secure access, making GPUStack suitable for internal AI services, multi‑tenant SaaS, or research clusters.

Highlights

Broad GPU compatibility across major vendors and OSes
Multi‑version backend support for diverse model runtimes
Distributed inference on heterogeneous multi‑node clusters
OpenAI‑compatible APIs with built‑in user and key management

Pros

  • Runs on a wide range of accelerators
  • Flexible integration with multiple inference backends
  • Scalable architecture for adding GPUs or nodes
  • Real‑time GPU performance monitoring

Considerations

  • Requires Docker and NVIDIA Container Toolkit for NVIDIA GPUs
  • Limited to inference; no native training support
  • Multi‑node setup may need additional networking configuration
  • Depends on supported accelerators and OS versions

Managed products teams compare with

When teams consider GPUStack, these hosted platforms usually appear on the same shortlist.

Amazon SageMaker logo

Amazon SageMaker

Fully managed machine learning service to build, train, and deploy ML models at scale

Anyscale logo

Anyscale

Ray-powered platform for scalable LLM training and inference.

BentoML logo

BentoML

Open-source model serving framework to ship AI applications.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Enterprises needing on‑prem AI inference at scale
  • Teams with heterogeneous GPU hardware across vendors
  • Developers wanting OpenAI‑compatible endpoints on private infrastructure
  • Researchers requiring multi‑node, multi‑GPU model serving

Not ideal when

  • Workflows focused on model training rather than inference
  • Environments without Docker or container runtime support
  • Small single‑GPU deployments where overhead outweighs benefits
  • Organizations preferring fully managed cloud inference services

How teams use it

Internal chatbot powered by LLMs

Deploys Qwen3 or LLaMA models behind OpenAI‑compatible APIs for secure, low‑latency employee assistance.

On‑prem image generation service

Runs Stable Diffusion or FLUX across multiple GPUs, delivering high‑throughput image creation for design teams.

Speech‑to‑text transcription pipeline

Hosts Whisper models, exposing transcription endpoints that scale with added GPU nodes.

Multi‑tenant SaaS inference platform

Provides isolated API keys and load‑balanced inference for diverse customer models on shared GPU clusters.

Tech snapshot

Python92%
Shell2%
PowerShell2%
Jinja2%
Dockerfile1%
Makefile1%

Tags

llamamindieinferenceqwenascendllmllm-servingmaasdeepseekcudallm-inferencesglanggenaivllmopenaidistributed-inferencerocmhigh-performance-inference

Frequently asked questions

How do I install GPUStack on a Linux server?

Install Docker and the NVIDIA Container Toolkit, then run the provided `docker run` command to start the GPUStack server.

Which GPU accelerators are supported?

GPUStack supports NVIDIA CUDA, Apple Metal, AMD ROCm, Ascend CANN, Hygon DTK, Moore Threads MUSA, Iluvatar Corex, and Cambricon MLU.

Can I add new models not listed in the catalog?

Yes, you can deploy models from Hugging Face, ModelScope, or a local file path by following the UI deployment workflow.

Does GPUStack handle model training?

GPUStack focuses on inference; training workflows are not provided out of the box.

How is API access secured?

API keys are generated per user, displayed only once, and can be managed via the UI's API Keys page.

Project at a glance

Active
Stars
4,398
Watchers
4,398
Forks
447
LicenseApache-2.0
Repo age1 year old
Last commityesterday
Primary languagePython

Last synced yesterday