Phoenix logo

Phoenix

AI observability platform for tracing, evaluation, and prompt management

Phoenix lets you trace LLM calls, benchmark performance, version datasets, run experiments, and manage prompts—all vendor‑agnostic and deployable locally, in containers, or in the cloud.

Phoenix banner

Overview

Overview

Phoenix is an AI observability platform that centralizes tracing, evaluation, dataset versioning, experiment tracking, and prompt management for LLM‑driven applications. It targets ML engineers, data scientists, and product teams who need reproducible experimentation and deep insight into model behavior.

Capabilities & Deployment

Built on OpenTelemetry and the OpenInference ecosystem, Phoenix automatically instruments popular frameworks such as LlamaIndex, LangChain, Haystack, and DSPy, while supporting a wide range of LLM providers (OpenAI, Bedrock, MistralAI, VertexAI, LiteLLM, Google GenAI, etc.). The platform can run on a local machine, within a Jupyter notebook, as a Docker container, or at scale on Kubernetes, and a hosted cloud instance is also available. Python and TypeScript SDKs provide lightweight clients and evaluation libraries, enabling seamless integration into existing pipelines.

Why Choose Phoenix

By offering a vendor‑agnostic, extensible observability stack, Phoenix helps teams iterate faster, compare model variants, and debug production issues without locking into a single provider or framework.

Highlights

Unified tracing via OpenTelemetry
LLM‑specific evaluation suite
Versioned dataset and experiment tracking
Prompt management with version control

Pros

  • Vendor‑agnostic across frameworks and providers
  • Extensible Python and TypeScript SDKs
  • Runs anywhere from notebooks to Kubernetes
  • Rich UI for playground and experiment comparison

Considerations

  • Requires instrumentation of your code
  • Full platform may need container/K8s setup for scaling
  • Feature set still evolving (e.g., TypeScript evals in alpha)
  • Learning curve for OpenTelemetry concepts

Managed products teams compare with

When teams consider Phoenix, these hosted platforms usually appear on the same shortlist.

Confident AI logo

Confident AI

DeepEval-powered LLM evaluation platform to test, benchmark, and safeguard apps

InsightFinder logo

InsightFinder

AIOps platform for streaming anomaly detection, root cause analysis, and incident prediction

LangSmith Observability logo

LangSmith Observability

LLM/agent observability with tracing, monitoring, and alerts

Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.

Fit guide

Great for

  • ML engineers building LLM applications needing observability
  • Data scientists experimenting with prompts and models
  • Teams that want reproducible evaluation pipelines
  • Organizations preferring self‑hosted AI monitoring

Not ideal when

  • Simple scripts with no tracing needs
  • Projects locked to a single LLM provider
  • Teams without capacity to manage container/K8s deployments
  • Use cases requiring real‑time alerting out of the box

How teams use it

Prompt Optimization

Iteratively test prompt variations, compare model responses, and select the best performing version.

RAG Performance Benchmarking

Run retrieval and answer relevance evaluations to quantify improvements across index updates.

Dataset Versioning for Fine‑Tuning

Create immutable snapshots of training examples, track changes, and feed them into fine‑tuning pipelines.

End‑to‑End LLM Debugging

Trace runtime calls, view input/output spans, and replay failures directly in the Playground.

Tech snapshot

Jupyter Notebook56%
Python28%
TypeScript16%
Shell1%
JavaScript1%
PLpgSQL1%

Tags

llmssmolagentsllamaindexai-monitoringagentsllm-evaluationaiengineeringlangchainanthropicdatasetsprompt-engineeringai-observabilityllm-evalevalsopenaillmops

Frequently asked questions

Do I need to run a Phoenix server?

You can use the hosted instance at app.phoenix.arize.com or self‑host via Docker/Kubernetes.

Which languages are supported?

Core SDKs are available for Python and TypeScript; tracing works for any language that can emit OpenTelemetry spans.

How does Phoenix integrate with existing LLM frameworks?

Instrumentation packages are provided for LlamaIndex, LangChain, Haystack, DSPy, and others via OpenInference.

Is there a cost to use Phoenix?

The platform is open source and free; cloud hosting by Arize AI is a paid service.

Can I export evaluation results?

Results can be accessed through the REST API or client libraries and exported to CSV/JSON for downstream analysis.

Project at a glance

Active
Stars
8,292
Watchers
8,292
Forks
684
Repo age3 years old
Last commit2 days ago
Primary languageJupyter Notebook

Last synced 2 days ago