NanoFlow logo

NanoFlow

High‑throughput LLM serving with intra‑device parallelism and asynchronous CPU scheduling

NanoFlow delivers up to 1.91× higher throughput than TensorRT‑LLM by overlapping compute, memory, and network operations on a single GPU, supporting Llama2/3, Qwen2 models up to 72B.

NanoFlow banner

Overview

Overview

NanoFlow is a throughput‑oriented serving framework that maximizes GPU utilization through intra‑device parallelism. By breaking requests into nano‑batches and co‑scheduling compute, memory, and network operations, it keeps the critical compute path busy while overlapping other resource‑bound stages.

Capabilities & Deployment

The system integrates state‑of‑the‑art kernels (CUTLASS, FlashInfer, MSCCL++) and provides a C++ backend with a Python demo frontend. It supports Llama2‑70B, Llama3‑70B/8B, Llama3.1‑70B/8B, and Qwen2‑72B, and includes scripts for environment setup and benchmark reproduction. Deployment is typically done via Docker on NVIDIA GPUs (e.g., A100 80 GB), with required system settings such as huge pages and io_uring enabled.

Target Audience

NanoFlow is aimed at enterprises, research labs, and SaaS providers that need to serve large‑scale LLM workloads with high throughput while maintaining reasonable latency. It excels in scenarios where multiple requests can be batched and where GPU resources are the primary bottleneck.

Highlights

Intra‑device parallelism with nano‑batching and execution unit scheduling
Asynchronous CPU scheduling for KV‑cache management and batch formation
Integration with CUTLASS, FlashInfer, and MSCCL++ kernel libraries
Support for Llama2/3 and Qwen2 models up to 72 B parameters

Pros

  • Achieves up to 1.91× higher throughput than TensorRT‑LLM
  • Efficient GPU utilization through overlapping resource usage
  • Low CPU overhead thanks to async control flow
  • Open‑source C++ backend with Python demo

Considerations

  • Best performance observed on high‑end NVIDIA GPUs (e.g., A100)
  • Complex installation requiring Docker, huge pages, and io_uring
  • Currently limited to models up to 72 B parameters
  • No built‑in quantization or speculative decoding features

Managed products teams compare with

When teams consider NanoFlow, these hosted platforms usually appear on the same shortlist.

Amazon SageMaker logo

Amazon SageMaker

Fully managed machine learning service to build, train, and deploy ML models at scale

Anyscale logo

Anyscale

Ray-powered platform for scalable LLM training and inference.

BentoML logo

BentoML

Open-source model serving framework to ship AI applications.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Enterprises needing high‑throughput LLM inference at scale
  • Researchers benchmarking serving performance of large models
  • Teams deploying Llama2/3 or Qwen2 models in multi‑GPU clusters
  • Workloads with mixed compute, memory, and network demands

Not ideal when

  • Small‑scale deployments on consumer‑grade GPUs
  • Users requiring out‑of‑the‑box quantized or compressed models
  • Environments without Docker or root access for system tweaks
  • Ultra‑low‑latency single‑request scenarios where latency dominates

How teams use it

High‑volume chat service

Sustains higher request rates with low per‑token latency for thousands of concurrent users

Batch inference for data labeling

Processes large corpora of text quickly, reducing total labeling time

Multi‑tenant SaaS LLM API

Provides isolated KV‑cache handling and efficient throughput across tenants

Offline token generation for fine‑tuning pipelines

Generates training data at scale while offloading KV‑cache to SSDs to save GPU memory

Tech snapshot

Jupyter Notebook48%
Python41%
Cuda7%
C++3%
CMake1%
Shell1%

Tags

model-servinginferencellmllm-servingcudallama2

Frequently asked questions

What hardware is required to run NanoFlow effectively?

NanoFlow is optimized for NVIDIA GPUs such as the A100 80 GB; other GPUs work but may not achieve the same throughput gains.

How is NanoFlow installed?

The recommended method is using Docker with the provided CUDA image, then installing required libraries (pybind11, liburing, libopenmpi) and configuring huge pages and io_uring.

Does NanoFlow support model quantization?

Quantization is not built into NanoFlow; users must apply quantized model weights before loading if needed.

How does KV‑cache offloading work?

Finished request KV‑cache entries are copied to the host SSD in parallel with ongoing inference, using a layer‑by‑layer transfer that requires modest bandwidth (~5 GB/s for LLaMA2‑70B).

Can NanoFlow be used with existing model formats?

Yes, it accepts standard model checkpoints compatible with the integrated kernel libraries.

Project at a glance

Active
Stars
940
Watchers
940
Forks
46
Repo age1 year old
Last commit3 months ago
Primary languageJupyter Notebook

Last synced 3 hours ago