GPUStack

Unified GPU cluster manager for scalable AI inference

GPUStack orchestrates heterogeneous GPU resources across Linux, macOS, and Windows, delivering OpenAI‑compatible APIs for LLMs, VLMs, diffusion, audio, and embedding models.

Overview

GPUStack is a lightweight manager that unifies GPUs from NVIDIA, Apple Metal, AMD ROCm, Ascend CANN, and other accelerators into a single inference platform. It supports a broad catalog of models—including large language, vision‑language, diffusion, audio, and embedding models—through flexible backends such as vLLM, llama‑box, Ascend MindIE, and vox‑box. Users interact via a web UI or OpenAI‑compatible endpoints, benefiting from automatic resource evaluation, load balancing, and real‑time GPU monitoring.

Deployment

Deployments are container‑based on Linux (Docker with NVIDIA Container Toolkit) and available as desktop installers for macOS and Windows. Adding nodes or GPUs scales the cluster instantly, while multi‑version backend support lets different models run with their optimal runtimes. API keys and user management secure access, making GPUStack suitable for internal AI services, multi‑tenant SaaS, or research clusters.

Highlights

Broad GPU compatibility across major vendors and OSes

Multi‑version backend support for diverse model runtimes

Distributed inference on heterogeneous multi‑node clusters

OpenAI‑compatible APIs with built‑in user and key management

Pros

Runs on a wide range of accelerators
Flexible integration with multiple inference backends
Scalable architecture for adding GPUs or nodes
Real‑time GPU performance monitoring

Considerations

Requires Docker and NVIDIA Container Toolkit for NVIDIA GPUs
Limited to inference; no native training support
Multi‑node setup may need additional networking configuration
Depends on supported accelerators and OS versions

Managed products teams compare with

When teams consider GPUStack, these hosted platforms usually appear on the same shortlist.

Anyscale

Ray-powered platform for scalable LLM training and inference.

Amazon SageMaker

Fully managed machine learning service to build, train, and deploy ML models at scale

BentoML

Open-source model serving framework to ship AI applications.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Enterprises needing on‑prem AI inference at scale
Teams with heterogeneous GPU hardware across vendors
Developers wanting OpenAI‑compatible endpoints on private infrastructure
Researchers requiring multi‑node, multi‑GPU model serving

Not ideal when

Workflows focused on model training rather than inference
Environments without Docker or container runtime support
Small single‑GPU deployments where overhead outweighs benefits
Organizations preferring fully managed cloud inference services

How teams use it

Internal chatbot powered by LLMs

Deploys Qwen3 or LLaMA models behind OpenAI‑compatible APIs for secure, low‑latency employee assistance.

On‑prem image generation service

Runs Stable Diffusion or FLUX across multiple GPUs, delivering high‑throughput image creation for design teams.

Speech‑to‑text transcription pipeline

Hosts Whisper models, exposing transcription endpoints that scale with added GPU nodes.

Multi‑tenant SaaS inference platform

Provides isolated API keys and load‑balanced inference for diverse customer models on shared GPU clusters.

Tech snapshot

Python92%

Shell2%

PowerShell2%

Jinja2%

Dockerfile1%

Makefile1%

Frequently asked questions

How do I install GPUStack on a Linux server?

Install Docker and the NVIDIA Container Toolkit, then run the provided `docker run` command to start the GPUStack server.

Which GPU accelerators are supported?

GPUStack supports NVIDIA CUDA, Apple Metal, AMD ROCm, Ascend CANN, Hygon DTK, Moore Threads MUSA, Iluvatar Corex, and Cambricon MLU.

Can I add new models not listed in the catalog?

Yes, you can deploy models from Hugging Face, ModelScope, or a local file path by following the UI deployment workflow.

Does GPUStack handle model training?

GPUStack focuses on inference; training workflows are not provided out of the box.

How is API access secured?

API keys are generated per user, displayed only once, and can be managed via the UI's API Keys page.

Project at a glance

Active

Visit site View repo

Stars: 4,599
Watchers: 4,599
Forks: 467

LicenseApache-2.0

Repo age1 year old

Last commit20 hours ago

Primary languagePython

Last synced 2 hours ago

Overview

Overview

Deployment

Highlights

Pros

Considerations

Managed products teams compare with

Anyscale

Amazon SageMaker

BentoML

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions