TensorZero logo

TensorZero

Unified, high-performance gateway for industrial-grade LLM applications

TensorZero provides a fast, extensible gateway, observability, optimization, evaluation, and experimentation stack for LLMs, supporting dozens of providers, streaming, multimodal, and high-throughput workloads.

TensorZero banner

Overview

Overview

TensorZero is a modular stack that lets developers access any major LLM provider through a single, high-performance gateway. Built in Rust, the gateway adds less than 1 ms p99 overhead and can sustain over 10 k queries per second, while supporting streaming, tool use, batch, embeddings, multimodal inputs, and caching.

Observability & Optimization

All inferences and optional feedback are stored in a user‑provided database (e.g., ClickHouse) and can be inspected via the TensorZero UI or programmatically. The platform automatically builds datasets, replays historic calls with new prompts or models, and exports OpenTelemetry traces. Integrated metrics and human‑feedback loops enable prompt, model, and strategy optimization.

Experimentation & Deployment

TensorZero includes out‑of‑the‑box A/B testing, routing, retries, fallbacks, and granular rate‑limiting. It can be deployed with Docker, accessed via a Python client, patched OpenAI SDKs, or any HTTP client, making it language‑agnostic. Teams can adopt individual components incrementally and combine them with existing tooling.

Highlights

Unified API accesses 30+ LLM providers with a single client
Sub‑millisecond overhead enables >10k QPS at scale
Built‑in observability stores inferences and feedback with UI and OpenTelemetry export
Experimentation layer offers A/B testing, routing, retries, and fallback strategies out of the box

Pros

  • High performance low latency suitable for production workloads
  • Broad provider support including major cloud and self‑hosted models
  • Language‑agnostic access via HTTP, Python, and OpenAI SDK patches
  • Integrated observability and experimentation tools reduce third‑party dependencies

Considerations

  • Self‑hosting required; users must manage Docker and a database like ClickHouse
  • Configuration can be complex for small or quick‑start projects
  • Advanced features such as spend tracking are not yet implemented
  • Custom OpenAI‑compatible integrations may need additional setup

Managed products teams compare with

When teams consider TensorZero, these hosted platforms usually appear on the same shortlist.

Comet logo

Comet

Experiment tracking, model registry & production monitoring for ML teams

DagsHub logo

DagsHub

Git/DVC-based platform with MLflow experiment tracking and model registry.

Neptune logo

Neptune

Experiment tracking and model registry to log, compare, and manage ML runs.

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • Teams building production LLM services needing consistent multi‑provider access
  • Applications with strict latency or throughput requirements
  • Organizations that want full control over data, logging, and feedback loops
  • Developers who prefer a single stack for inference, monitoring, and experimentation

Not ideal when

  • Hobby projects that only need a single provider and minimal setup
  • Environments where managed SaaS gateways are preferred over self‑hosting
  • Teams lacking ops resources to maintain Docker/ClickHouse deployments
  • Use cases that require built‑in spend tracking or billing features (not yet available)

How teams use it

Real‑time chat assistant with multi‑model fallback

Seamlessly route requests between OpenAI and Anthropic, maintaining sub‑millisecond latency and automatic retries on failures.

Batch embedding generation for recommendation engine

Process millions of texts via the gateway’s batch endpoint, store embeddings in ClickHouse, and monitor throughput via the UI.

A/B testing new prompt designs

Deploy two prompt variants, collect user feedback, and use built‑in metrics to identify the higher‑performing version.

Debugging and replaying production inferences

Query historical calls from the UI, edit prompts, and re‑run them to evaluate model updates without affecting live traffic.

Tech snapshot

Rust75%
TypeScript17%
Python6%
Jupyter Notebook1%
Shell1%
Go1%

Tags

llamamlmlopsgptaillmsgenerative-aillmmachine-learningartificial-intelligenceai-engineeringpythonanthropicrustml-engineeringgenaideep-learninglarge-language-modelsopenaillmops

Frequently asked questions

How do I add a new LLM provider?

Add the provider in the TensorZero configuration; any OpenAI‑compatible endpoint can be registered, and many major providers are pre‑supported.

What storage backend is used for observability?

You configure your own database (e.g., ClickHouse) where inferences, metrics, and feedback are persisted.

Can I use TensorZero with existing OpenAI SDK code?

Yes, you can patch the OpenAI client or point the SDK to the gateway’s base URL to route calls through TensorZero.

How does rate limiting work?

Custom rate limits can be defined with granular scopes such as user tags, and the gateway enforces them per request.

Is there a managed hosting option?

Currently TensorZero is self‑hosted via Docker; no managed SaaS offering is provided.

Project at a glance

Active
Stars
10,847
Watchers
10,847
Forks
752
LicenseApache-2.0
Repo age1 year old
Last commit11 hours ago
Primary languageRust

Last synced 4 hours ago