
Amazon SageMaker
Fully managed machine learning service to build, train, and deploy ML models at scale
Discover top open-source software, updated regularly with real-world adoption signals.

High‑throughput LLM serving with intra‑device parallelism and asynchronous CPU scheduling
NanoFlow delivers up to 1.91× higher throughput than TensorRT‑LLM by overlapping compute, memory, and network operations on a single GPU, supporting Llama2/3, Qwen2 models up to 72B.

NanoFlow is a throughput‑oriented serving framework that maximizes GPU utilization through intra‑device parallelism. By breaking requests into nano‑batches and co‑scheduling compute, memory, and network operations, it keeps the critical compute path busy while overlapping other resource‑bound stages.
The system integrates state‑of‑the‑art kernels (CUTLASS, FlashInfer, MSCCL++) and provides a C++ backend with a Python demo frontend. It supports Llama2‑70B, Llama3‑70B/8B, Llama3.1‑70B/8B, and Qwen2‑72B, and includes scripts for environment setup and benchmark reproduction. Deployment is typically done via Docker on NVIDIA GPUs (e.g., A100 80 GB), with required system settings such as huge pages and io_uring enabled.
NanoFlow is aimed at enterprises, research labs, and SaaS providers that need to serve large‑scale LLM workloads with high throughput while maintaining reasonable latency. It excels in scenarios where multiple requests can be batched and where GPU resources are the primary bottleneck.
When teams consider NanoFlow, these hosted platforms usually appear on the same shortlist.
Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.
High‑volume chat service
Sustains higher request rates with low per‑token latency for thousands of concurrent users
Batch inference for data labeling
Processes large corpora of text quickly, reducing total labeling time
Multi‑tenant SaaS LLM API
Provides isolated KV‑cache handling and efficient throughput across tenants
Offline token generation for fine‑tuning pipelines
Generates training data at scale while offloading KV‑cache to SSDs to save GPU memory
NanoFlow is optimized for NVIDIA GPUs such as the A100 80 GB; other GPUs work but may not achieve the same throughput gains.
The recommended method is using Docker with the provided CUDA image, then installing required libraries (pybind11, liburing, libopenmpi) and configuring huge pages and io_uring.
Quantization is not built into NanoFlow; users must apply quantized model weights before loading if needed.
Finished request KV‑cache entries are copied to the host SSD in parallel with ongoing inference, using a layer‑by‑layer transfer that requires modest bandwidth (~5 GB/s for LLaMA2‑70B).
Yes, it accepts standard model checkpoints compatible with the integrated kernel libraries.
Project at a glance
ActiveLast synced 4 days ago