
Amazon SageMaker
Fully managed machine learning service to build, train, and deploy ML models at scale
Discover top open-source software, updated regularly with real-world adoption signals.

Accelerated LLM inference with NVIDIA TensorRT optimizations
TensorRT LLM is a high‑performance inference toolkit that maximizes throughput and minimizes latency for large language models on NVIDIA GPUs, offering expert parallelism, speculative decoding, and edge‑ready Jetson support.

TensorRT LLM provides a comprehensive toolbox for deploying large language models (LLMs) at production scale on NVIDIA GPUs. By leveraging expert parallelism, KV‑cache reuse, and multiblock attention, it delivers industry‑leading token throughput while keeping latency low.
The framework supports a wide range of open‑weight models—including GPT‑OSS, Llama, DeepSeek, and EXAONE—through ONNX conversion and TensorRT engine generation. Advanced decoding strategies such as speculative and guided decoding can triple token output per step. Pre‑built Docker containers and Jetson AGX Orin wheels simplify deployment on both data‑center clusters and edge devices, enabling developers to scale from a single GPU to multi‑node HGX systems.
Comprehensive documentation, example scripts, and a roadmap guide users through model conversion, performance tuning, and auto‑scaling on platforms like AWS EKS. The open‑source nature encourages community contributions and rapid adoption of the latest NVIDIA GPU features.
When teams consider TensorRT LLM, these hosted platforms usually appear on the same shortlist.
Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.
High‑throughput chatbot service
Delivers >40,000 tokens/s per GPU, handling millions of user queries daily with sub‑10 ms latency.
Batch document summarization
Processes terabytes of text overnight using expert parallelism across multiple GPUs, reducing total runtime by 70%.
AI‑enhanced search on e‑commerce
Provides low‑latency query generation using speculative decoding, improving search relevance while keeping cost per token low.
Robotics perception on Jetson
Runs LLM‑driven language commands on Jetson AGX Orin with pre‑compiled wheels, enabling on‑device inference without cloud latency.
TensorRT LLM runs on all NVIDIA GPUs supported by TensorRT, including A100, H100, B200, Blackwell, and Jetson AGX Orin.
Yes, models must be provided in ONNX format before TensorRT conversion; the toolkit includes scripts to assist conversion.
Official Docker images with pre‑built TensorRT LLM engines are published and can be pulled from NVIDIA NGC.
It generates multiple tokens per GPU step, effectively tripling token throughput while preserving model accuracy.
Support for Jetson AGX Orin is available via pre‑compiled wheels and containers, enabling on‑device inference.
Project at a glance
ActiveLast synced 4 days ago