
Confident AI
DeepEval-powered LLM evaluation platform to test, benchmark, and safeguard apps
Discover top open-source software, updated regularly with real-world adoption signals.

AI observability platform for tracing, evaluation, and prompt management
Phoenix lets you trace LLM calls, benchmark performance, version datasets, run experiments, and manage prompts—all vendor‑agnostic and deployable locally, in containers, or in the cloud.

Phoenix is an AI observability platform that centralizes tracing, evaluation, dataset versioning, experiment tracking, and prompt management for LLM‑driven applications. It targets ML engineers, data scientists, and product teams who need reproducible experimentation and deep insight into model behavior.
Built on OpenTelemetry and the OpenInference ecosystem, Phoenix automatically instruments popular frameworks such as LlamaIndex, LangChain, Haystack, and DSPy, while supporting a wide range of LLM providers (OpenAI, Bedrock, MistralAI, VertexAI, LiteLLM, Google GenAI, etc.). The platform can run on a local machine, within a Jupyter notebook, as a Docker container, or at scale on Kubernetes, and a hosted cloud instance is also available. Python and TypeScript SDKs provide lightweight clients and evaluation libraries, enabling seamless integration into existing pipelines.
By offering a vendor‑agnostic, extensible observability stack, Phoenix helps teams iterate faster, compare model variants, and debug production issues without locking into a single provider or framework.
When teams consider Phoenix, these hosted platforms usually appear on the same shortlist.
Looking for a hosted option? These are the services engineering teams benchmark against before choosing open source.
Prompt Optimization
Iteratively test prompt variations, compare model responses, and select the best performing version.
RAG Performance Benchmarking
Run retrieval and answer relevance evaluations to quantify improvements across index updates.
Dataset Versioning for Fine‑Tuning
Create immutable snapshots of training examples, track changes, and feed them into fine‑tuning pipelines.
End‑to‑End LLM Debugging
Trace runtime calls, view input/output spans, and replay failures directly in the Playground.
You can use the hosted instance at app.phoenix.arize.com or self‑host via Docker/Kubernetes.
Core SDKs are available for Python and TypeScript; tracing works for any language that can emit OpenTelemetry spans.
Instrumentation packages are provided for LlamaIndex, LangChain, Haystack, DSPy, and others via OpenInference.
The platform is open source and free; cloud hosting by Arize AI is a paid service.
Results can be accessed through the REST API or client libraries and exported to CSV/JSON for downstream analysis.
Project at a glance
ActiveLast synced 4 days ago