TensorRT LLM

Accelerated LLM inference with NVIDIA TensorRT optimizations

TensorRT LLM is a high‑performance inference toolkit that maximizes throughput and minimizes latency for large language models on NVIDIA GPUs, offering expert parallelism, speculative decoding, and edge‑ready Jetson support.

Overview

TensorRT LLM provides a comprehensive toolbox for deploying large language models (LLMs) at production scale on NVIDIA GPUs. By leveraging expert parallelism, KV‑cache reuse, and multiblock attention, it delivers industry‑leading token throughput while keeping latency low.

Capabilities & Deployment

The framework supports a wide range of open‑weight models—including GPT‑OSS, Llama, DeepSeek, and EXAONE—through ONNX conversion and TensorRT engine generation. Advanced decoding strategies such as speculative and guided decoding can triple token output per step. Pre‑built Docker containers and Jetson AGX Orin wheels simplify deployment on both data‑center clusters and edge devices, enabling developers to scale from a single GPU to multi‑node HGX systems.

Getting Started

Comprehensive documentation, example scripts, and a roadmap guide users through model conversion, performance tuning, and auto‑scaling on platforms like AWS EKS. The open‑source nature encourages community contributions and rapid adoption of the latest NVIDIA GPU features.

Highlights

Expert parallelism for multi‑GPU scaling

Speculative and guided decoding to triple token throughput

KV‑cache reuse and multiblock attention for long sequences

Pre‑built containers and Jetson AGX Orin wheels for easy deployment

Pros

Industry‑leading throughput on NVIDIA GPUs
Supports a wide range of open‑weight LLMs (GPT‑OSS, Llama, DeepSeek, EXAONE)
Open‑source with extensive documentation and examples
Optimized for both data‑center and edge (Jetson) deployments

Considerations

Requires NVIDIA GPU hardware
Performance tuning may need deep expertise
Limited to models convertible to TensorRT (ONNX)
Container images can be large

Managed products teams compare with

When teams consider TensorRT LLM, these hosted platforms usually appear on the same shortlist.

Anyscale

Ray-powered platform for scalable LLM training and inference.

Amazon SageMaker

Fully managed machine learning service to build, train, and deploy ML models at scale

BentoML

Open-source model serving framework to ship AI applications.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

Enterprises needing high‑QPS LLM serving
Developers building real‑time AI applications on NVIDIA GPUs
Researchers optimizing inference for large models
Edge AI projects on Jetson platforms

Not ideal when

CPU‑only environments
Non‑NVIDIA hardware deployments
Small‑scale inference where overhead outweighs benefits
Teams without GPU engineering expertise

How teams use it

High‑throughput chatbot service

Delivers >40,000 tokens/s per GPU, handling millions of user queries daily with sub‑10 ms latency.

Batch document summarization

Processes terabytes of text overnight using expert parallelism across multiple GPUs, reducing total runtime by 70%.

AI‑enhanced search on e‑commerce

Provides low‑latency query generation using speculative decoding, improving search relevance while keeping cost per token low.

Robotics perception on Jetson

Runs LLM‑driven language commands on Jetson AGX Orin with pre‑compiled wheels, enabling on‑device inference without cloud latency.

Tech snapshot

C++45%

Python43%

Cuda11%

Groovy1%

CMake1%

Shell1%

Frequently asked questions

Which NVIDIA GPUs are supported?

TensorRT LLM runs on all NVIDIA GPUs supported by TensorRT, including A100, H100, B200, Blackwell, and Jetson AGX Orin.

Do I need to convert my model to ONNX?

Yes, models must be provided in ONNX format before TensorRT conversion; the toolkit includes scripts to assist conversion.

Is there a ready‑to‑use container?

Official Docker images with pre‑built TensorRT LLM engines are published and can be pulled from NVIDIA NGC.

How does speculative decoding improve performance?

It generates multiple tokens per GPU step, effectively tripling token throughput while preserving model accuracy.

Can I run TensorRT LLM on edge devices?

Support for Jetson AGX Orin is available via pre‑compiled wheels and containers, enabling on‑device inference.

Project at a glance

Active

Visit site View repo

Stars: 13,029
Watchers: 13,029
Forks: 2,154

Repo age2 years old

Last commit12 hours ago

Primary languagePython

Last synced 3 hours ago

Overview

Overview

Capabilities & Deployment

Getting Started

Highlights

Pros

Considerations

Managed products teams compare with

Anyscale

Amazon SageMaker

BentoML

Fit guide

Great for

Not ideal when

How teams use it

Tech snapshot

Tags

Frequently asked questions