TensorRT LLM logo

TensorRT LLM

Accelerated LLM inference with NVIDIA TensorRT optimizations

TensorRT LLM is a high‑performance inference toolkit that maximizes throughput and minimizes latency for large language models on NVIDIA GPUs, offering expert parallelism, speculative decoding, and edge‑ready Jetson support.

TensorRT LLM banner

Overview

Overview

TensorRT LLM provides a comprehensive toolbox for deploying large language models (LLMs) at production scale on NVIDIA GPUs. By leveraging expert parallelism, KV‑cache reuse, and multiblock attention, it delivers industry‑leading token throughput while keeping latency low.

Capabilities & Deployment

The framework supports a wide range of open‑weight models—including GPT‑OSS, Llama, DeepSeek, and EXAONE—through ONNX conversion and TensorRT engine generation. Advanced decoding strategies such as speculative and guided decoding can triple token output per step. Pre‑built Docker containers and Jetson AGX Orin wheels simplify deployment on both data‑center clusters and edge devices, enabling developers to scale from a single GPU to multi‑node HGX systems.

Getting Started

Comprehensive documentation, example scripts, and a roadmap guide users through model conversion, performance tuning, and auto‑scaling on platforms like AWS EKS. The open‑source nature encourages community contributions and rapid adoption of the latest NVIDIA GPU features.

Highlights

Expert parallelism for multi‑GPU scaling
Speculative and guided decoding to triple token throughput
KV‑cache reuse and multiblock attention for long sequences
Pre‑built containers and Jetson AGX Orin wheels for easy deployment

Pros

  • Industry‑leading throughput on NVIDIA GPUs
  • Supports a wide range of open‑weight LLMs (GPT‑OSS, Llama, DeepSeek, EXAONE)
  • Open‑source with extensive documentation and examples
  • Optimized for both data‑center and edge (Jetson) deployments

Considerations

  • Requires NVIDIA GPU hardware
  • Performance tuning may need deep expertise
  • Limited to models convertible to TensorRT (ONNX)
  • Container images can be large

Managed products teams compare with

When teams consider TensorRT LLM, these hosted platforms usually appear on the same shortlist.

Amazon SageMaker logo

Amazon SageMaker

Fully managed machine learning service to build, train, and deploy ML models at scale

Anyscale logo

Anyscale

Ray-powered platform for scalable LLM training and inference.

BentoML logo

BentoML

Open-source model serving framework to ship AI applications.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Enterprises needing high‑QPS LLM serving
  • Developers building real‑time AI applications on NVIDIA GPUs
  • Researchers optimizing inference for large models
  • Edge AI projects on Jetson platforms

Not ideal when

  • CPU‑only environments
  • Non‑NVIDIA hardware deployments
  • Small‑scale inference where overhead outweighs benefits
  • Teams without GPU engineering expertise

How teams use it

High‑throughput chatbot service

Delivers >40,000 tokens/s per GPU, handling millions of user queries daily with sub‑10 ms latency.

Batch document summarization

Processes terabytes of text overnight using expert parallelism across multiple GPUs, reducing total runtime by 70%.

AI‑enhanced search on e‑commerce

Provides low‑latency query generation using speculative decoding, improving search relevance while keeping cost per token low.

Robotics perception on Jetson

Runs LLM‑driven language commands on Jetson AGX Orin with pre‑compiled wheels, enabling on‑device inference without cloud latency.

Tech snapshot

C++45%
Python43%
Cuda11%
Groovy1%
CMake1%
Shell1%

Tags

moepytorchblackwellllm-servingcuda

Frequently asked questions

Which NVIDIA GPUs are supported?

TensorRT LLM runs on all NVIDIA GPUs supported by TensorRT, including A100, H100, B200, Blackwell, and Jetson AGX Orin.

Do I need to convert my model to ONNX?

Yes, models must be provided in ONNX format before TensorRT conversion; the toolkit includes scripts to assist conversion.

Is there a ready‑to‑use container?

Official Docker images with pre‑built TensorRT LLM engines are published and can be pulled from NVIDIA NGC.

How does speculative decoding improve performance?

It generates multiple tokens per GPU step, effectively tripling token throughput while preserving model accuracy.

Can I run TensorRT LLM on edge devices?

Support for Jetson AGX Orin is available via pre‑compiled wheels and containers, enabling on‑device inference.

Project at a glance

Active
Stars
12,689
Watchers
12,689
Forks
2,026
Repo age2 years old
Last commityesterday
Primary languagePython

Last synced yesterday