NanoFlow

High‑throughput LLM serving with intra‑device parallelism and asynchronous CPU scheduling

NanoFlow delivers up to 1.91× higher throughput than TensorRT‑LLM by overlapping compute, memory, and network operations on a single GPU, supporting Llama2/3, Qwen2 models up to 72B.

Overview

NanoFlow is a throughput‑oriented serving framework that maximizes GPU utilization through intra‑device parallelism. By breaking requests into nano‑batches and co‑scheduling compute, memory, and network operations, it keeps the critical compute path busy while overlapping other resource‑bound stages.

Capabilities & Deployment

The system integrates state‑of‑the‑art kernels (CUTLASS, FlashInfer, MSCCL++) and provides a C++ backend with a Python demo frontend. It supports Llama2‑70B, Llama3‑70B/8B, Llama3.1‑70B/8B, and Qwen2‑72B, and includes scripts for environment setup and benchmark reproduction. Deployment is typically done via Docker on NVIDIA GPUs (e.g., A100 80 GB), with required system settings such as huge pages and io_uring enabled.

Target Audience

NanoFlow is aimed at enterprises, research labs, and SaaS providers that need to serve large‑scale LLM workloads with high throughput while maintaining reasonable latency. It excels in scenarios where multiple requests can be batched and where GPU resources are the primary bottleneck.

Highlights

Intra‑device parallelism with nano‑batching and execution unit scheduling

Asynchronous CPU scheduling for KV‑cache management and batch formation

Integration with CUTLASS, FlashInfer, and MSCCL++ kernel libraries

Support for Llama2/3 and Qwen2 models up to 72 B parameters

Pros

Achieves up to 1.91× higher throughput than TensorRT‑LLM
Efficient GPU utilization through overlapping resource usage
Low CPU overhead thanks to async control flow
Open‑source C++ backend with Python demo

Considerations

Best performance observed on high‑end NVIDIA GPUs (e.g., A100)
Complex installation requiring Docker, huge pages, and io_uring
Currently limited to models up to 72 B parameters
No built‑in quantization or speculative decoding features

Managed products teams compare with

When teams consider NanoFlow, these hosted platforms usually appear on the same shortlist.

Anyscale

Ray-powered platform for scalable LLM training and inference.

Amazon SageMaker

Fully managed machine learning service to build, train, and deploy ML models at scale

BentoML

Open-source model serving framework to ship AI applications.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Enterprises needing high‑throughput LLM inference at scale
Researchers benchmarking serving performance of large models
Teams deploying Llama2/3 or Qwen2 models in multi‑GPU clusters
Workloads with mixed compute, memory, and network demands

Not ideal when

Small‑scale deployments on consumer‑grade GPUs
Users requiring out‑of‑the‑box quantized or compressed models
Environments without Docker or root access for system tweaks
Ultra‑low‑latency single‑request scenarios where latency dominates

How teams use it

High‑volume chat service

Sustains higher request rates with low per‑token latency for thousands of concurrent users

Batch inference for data labeling

Processes large corpora of text quickly, reducing total labeling time

Multi‑tenant SaaS LLM API

Provides isolated KV‑cache handling and efficient throughput across tenants

Offline token generation for fine‑tuning pipelines

Generates training data at scale while offloading KV‑cache to SSDs to save GPU memory

Tech snapshot

Jupyter Notebook48%

Python41%

Cuda7%

C++3%

CMake1%

Shell1%

Frequently asked questions

What hardware is required to run NanoFlow effectively?

NanoFlow is optimized for NVIDIA GPUs such as the A100 80 GB; other GPUs work but may not achieve the same throughput gains.

How is NanoFlow installed?

The recommended method is using Docker with the provided CUDA image, then installing required libraries (pybind11, liburing, libopenmpi) and configuring huge pages and io_uring.

Does NanoFlow support model quantization?

Quantization is not built into NanoFlow; users must apply quantized model weights before loading if needed.

How does KV‑cache offloading work?

Finished request KV‑cache entries are copied to the host SSD in parallel with ongoing inference, using a layer‑by‑layer transfer that requires modest bandwidth (~5 GB/s for LLaMA2‑70B).

Can NanoFlow be used with existing model formats?

Yes, it accepts standard model checkpoints compatible with the integrated kernel libraries.

Project at a glance

Stable

Visit site View repo

Stars: 948
Watchers: 948
Forks: 46

Repo age1 year old

Last commit4 months ago

Primary languageJupyter Notebook

Last synced 4 hours ago

Overview

Overview

Capabilities & Deployment

Target Audience

Highlights

Pros

Considerations

Managed products teams compare with

Anyscale

Amazon SageMaker

BentoML

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions