phoenix Manual - Doramagic.ai

Doramagic Project Pack · Human Manual

phoenix

AI Observability & Evaluation

Platform Overview and Architecture

Related topics: Tracing, Spans, and Observability, Datasets, Experiments, and Evaluation, PXI Agent, CLI, and MCP Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Tracing and Observability

Continue reading this section for the full explanation and source context.

Section Evaluation Framework

Continue reading this section for the full explanation and source context.

Section Agent Development and Experimentation

Continue reading this section for the full explanation and source context.

Platform Overview and Architecture

Phoenix is an open-source AI observability platform that provides tracing, evaluation, and experimentation capabilities for LLM-based applications and AI agents. The platform spans multiple language SDKs (Python and TypeScript), a backend API server, a React-based UI, and a rich set of examples and developer tooling.

High-Level Architecture

The platform is organized into several major subsystems that work together to capture, store, evaluate, and visualize AI application behavior.

flowchart TB
    subgraph SDKs["Client SDKs"]
        TS["TypeScript SDK<br/>(@phoenix-evals)"]
        PY["Python SDK<br/>(tracing + evals)"]
    end

    subgraph Backend["Phoenix Server"]
        API["GraphQL + REST API"]
        Eval["Evaluation Engine"]
        Store["Trace & Experiment Store"]
    end

    subgraph UI["Phoenix Web UI"]
        Traces["Traces & Sessions"]
        Exps["Experiments"]
        Evals["Evaluation Views"]
    end

    subgraph Tooling["Developer Tooling"]
        Helm["Helm Chart"]
        Docker["Docker / Docker Compose"]
        Sphinx["Sphinx API Reference"]
    end

    TS -->|OTel spans| API
    PY -->|OTel spans| API
    API --> Store
    API --> Eval
    Eval --> Store
    Store --> UI
    Traces --> UI
    Exps --> UI
    Evals --> UI
    Helm --> Backend
    Docker --> Backend
    Sphinx -.->|documents| API

The architecture separates concerns between instrumentation (SDKs), storage and query (server), presentation (UI), and operations (Helm/Docker/Sphinx). Source: api_reference/README.md:1-30, internal_docs/README.md:1-5.

Core Subsystems

Tracing and Observability

Phoenix captures OpenTelemetry-compatible traces from instrumented applications. Spans are organized into traces, and traces can be grouped into sessions for multi-turn conversations. Annotation types supported on spans include user_feedback (thumbs up/down), tool_result (success/error), and retrieval_relevance (LLM evaluation). Sessions receive conversation_coherence and resolution_status annotations on their final turn. Source: js/examples/apps/tracing-tutorial/README.md:1-20.

Evaluation Framework

The phoenix-evals package ships a set of generated, ready-to-use classification evaluator configurations. These evaluators share a common type definition and are exported from a single index module.

export type ClassificationEvaluatorConfig = {
  name: string;
  description: string;
  optimizationDirection: "MINIMIZE" | "MAXIMIZE" | "NEUTRAL";
  template: PromptTemplate;
  choices: Record<string, number>;
};

Source: js/packages/phoenix-evals/src/__generated__/types.ts:1-8. The following default evaluators are pre-built and exported:

Evaluator	Purpose	Direction
`correctness`	General correctness and completeness of outputs	MAXIMIZE
`conciseness`	Whether outputs are concise and free of filler	MAXIMIZE
`document_relevance`	Whether a document answers a question	MAXIMIZE
`hallucination`	Detect hallucinated content	MAXIMIZE
`refusal`	Detect model refusals	MAXIMIZE
`faithfulness`	Faithfulness to source context	MAXIMIZE
`tool_invocation`	Correctness of tool arguments and formatting	MAXIMIZE
`tool_selection`	Whether the right tool was chosen	MAXIMIZE
`tool_response_handling`	How the agent processed tool results	MAXIMIZE

Sources: js/packages/phoenix-evals/src/__generated__/default_templates/index.ts:1-9, js/packages/phoenix-evals/src/__generated__/default_templates/CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-5, js/packages/phoenix-evals/src/__generated__/default_templates/TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-5.

Agent Development and Experimentation

Phoenix provides reference implementations for common agent patterns. The repository includes end-to-end examples for Retrieval-Augmented Generation, Code Generation, and CLI-based agents. These examples use LangChain, OpenAI, Anthropic Claude, and the Vercel AI SDK, and are auto-instrumented for Phoenix tracing. The RAG example demonstrates tool usage (web search, response analysis) and vector store initialization; the code-gen example demonstrates analysis, execution, generation, and merging tools. Source: examples/rag_agent/README.md:1-15, examples/code_gen_agent/README.md:1-15, js/examples/apps/cli-agent-starter-kit/README.md:1-15.

Experiments are a first-class concept. The demo-document-relevancy-experiment example creates a dataset, runs an application function against each example, applies an LLM-based evaluator, and sends the results to Phoenix for analysis. Source: js/examples/apps/demo-document-relevancy-experiment/README.md:1-5.

Developer Tooling and Operations

The repository includes a Helm chart for Kubernetes deployment, a Docker / docker-compose stack for local development, and a Sphinx-based API reference generated from Python source. The scripts directory contains CI helpers, DDL generation for PostgreSQL, benchmark notebooks, and a local devops stack (OIDC, LDAP, SMTP, Grafana, Prometheus, Toxiproxy, Vite). Source: scripts/README.md:1-15. The Sphinx configuration separates auto-generated API files (source/output) from manually edited ones (source/api) to prevent overwrites. Source: api_reference/README.md:1-30

Common Failure Modes and Operational Notes

Several recurring issues appear in community discussions. A known bug exists where PHOENIX_OAUTH2_{IDP}_ROLE_ATTRIBUTE_PATH unconditionally overwrites manually set roles on every OIDC login (Issue #13783). Enhancements requested include support for custom span costs independent of LLM token pricing (Issue #13655) and a backend bash emulator for agents (Issue #13675). Release notes for versions 17.3.0 through 17.8.1 indicate ongoing work on agent features, token detail metrics, the GraphQL skill, and prometheus middleware compatibility with FastAPI 0.137. Source: internal_docs/README.md:1-5, scripts/README.md:1-15.

Tracing, Spans, and Observability

Related topics: Platform Overview and Architecture, Datasets, Experiments, and Evaluation

Section Related Pages

Continue reading this section for the full explanation and source context.

Section Default Evaluator Templates on Spans

Continue reading this section for the full explanation and source context.

Tracing, Spans, and Observability

Overview

Phoenix is an observability platform for LLM applications that ingests OpenTelemetry spans as its core data unit and rolls them up into traces. The whole observability story revolves around three core entities:

Spans — the atomic unit of work, emitted by every instrumented LLM call, tool execution, retriever, or agent step.
Traces — collections of spans sharing a trace_id, forming a single end-to-end execution.
Annotations — first-class metadata attached to spans (human feedback, LLM-as-judge scores, correctness labels).

These entities power Phoenix's dashboards, evaluations, sessions, and the new agent skills released across v17.3–v17.8. Tracing is what links every other Phoenix feature together: without spans and traces, evaluations have nothing to attach to and sessions have no turns to group.

Source: js/packages/phoenix-mcp/src/traceUtils.ts:1-50

Spans: The Atomic Unit of Observability

A span is the smallest piece of telemetry emitted by an instrumented application. In Phoenix's TypeScript SDK and MCP integration, spans are typed directly from the generated OpenAPI schemas:

export type Span = componentsV1["schemas"]["Span"];
export type SpanAnnotation = componentsV1["schemas"]["SpanAnnotation"];
export type SpanWithAnnotations = Span & { annotations?: SpanAnnotation[] };

Source: js/packages/phoenix-mcp/src/traceUtils.ts:5-7

Each span carries an OpenTelemetry context (context.trace_id, context.span_id) plus the standard fields: name, start/end timestamps, kind, attributes, status, and parent ID. The SpanWithAnnotations extension is what enables human or LLM-judge evaluations to be attached to a span after the fact — every evaluation record in Phoenix is, at its core, a span annotation. The Python client exposes this through client.spans.add_span_annotation and client.spans.log_span_annotations for bulk workflows.

Source: packages/phoenix-client/README.md:30-60

A common pattern is to instrument spans with semantic conventions such as the OpenInference attributes, which label spans as LLM, TOOL, CHAIN, RETRIEVER, or AGENT. The RAG agent example wires this up end-to-end:

Auto-instrumentation with OpenInference decorators to fully instrument the agent. End-to-end tracing with Phoenix to track agent performance.

Source: examples/rag_agent/README.md:8-10

The same pattern is used in the code generation agent example, where every code analysis, execution, generation, and merging tool call is emitted as a typed span, and the tutorial traces a real support agent: query classification, tool calls, retrieval, and the final LLM response, all nested under one root span.

Source: examples/code_gen_agent/README.md:9-11, js/examples/apps/tracing-tutorial/README.md:30-50

Default Evaluator Templates on Spans

The generated evaluator configurations in js/packages/phoenix-evals/src/__generated__/default_templates/ describe how Phoenix labels spans out of the box. Each config defines a name, optimizationDirection, classification choices, and a prompt template that runs against span attributes such as {{input}} and {{output}} (correctness) or {{input}} and {{documentText}} (document relevance).

Source: js/packages/phoenix-evals/src/__generated__/default_templates/CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-60, js/packages/phoenix-evals/src/__generated__/default_templates/DOCUMENT_RELEVANCE_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-50, js/packages/phoenix-evals/src/__generated__/default_templates/CONCISENESS_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-60

The built-in evaluators exported from the index are: conciseness, correctness, document_relevance, faithfulness, hallucination, refusal, tool_invocation, tool_response_handling, and tool_selection. Helpers such as createDocumentRelevanceEvaluator let callers pass any model and reuse the default promptTemplate and choices unless overridden:

const evaluator = createDocumentRelevanceEvaluator({ model: openai("gpt-4o-mini") });

Source: js/packages/phoenix-evals/src/__generated__/default_templates/index.ts:1-9, js/packages/phoenix-evals/src/llm/createDocumentRelevanceEvaluator.ts:1-50

Traces: Grouping Spans Into End-to-End Executions

A trace is a logical grouping of spans sharing the same trace_id. The MCP server's traceUtils.ts ships the canonical grouping logic used across Phoenix's TypeScript surfaces:

export function groupSpansByTrace({
  spans,
}: {
  spans: SpanWithAnnotations[];
}): Map<string, SpanWithAnnotations[]> { ... }

Source: js/packages/phoenix-mcp/src/traceUtils.ts:25-40

For each trace the helper derives a rootSpan (the span with no parent), the earliest startTime and latest endTime across all spans, a wall-clock duration in milliseconds, and a rolled-up status of "ERROR" if any span has an error status or error attribute, otherwise "OK".

Source: js/packages/phoenix-mcp/src/traceUtils.ts:8-50

This is the foundation for trace-level dashboards, cost aggregation, and the trace ID copy action shipped in v17.3.0. Sessions are themselves derived from spans — Phoenix uses conversation IDs on individual spans to roll up conversation_coherence and resolution_status evaluations across turns.

Source: js/examples/apps/tracing-tutorial/README.md:40-60

flowchart TD
    Root["Root Span<br/>(trace_id = T1)"]
    LLM1["LLM: classify_query"]
    Tool1["TOOL: order_lookup"]
    Ret1["RETRIEVER: faq_search"]
    LLM2["LLM: synthesize_answer"]
    Root --> LLM1
    LLM1 --> Tool1
    LLM1 --> Ret1
    Tool1 --> LLM2
    Ret1 --> LLM2

Instrumentation and Agent Use Cases

Phoenix's tracing story is end-to-end across Python and TypeScript. The RAG agent and code-gen agent examples both rely on the same OpenInference-style decorators: spans are emitted automatically for every LangChain / LangGraph node, and Phoenix rolls them up server-side. For TypeScript-only stacks, the CLI agent starter kit demonstrates a parallel pattern using the AI SDK — Phoenix is auto-started with pnpm dev, tool definitions are wired through the phoenix-otel register helper, and all LLM calls, tool executions, and MCP interactions are emitted as spans to http://localhost:6006.

Source: examples/rag_agent/README.md:12-18, examples/code_gen_agent/README.md:14-22, js/examples/apps/cli-agent-starter-kit/README.md:8-30

The tracing tutorial covers three chapters: producing your first traces, attaching annotations and LLM-as-judge evaluations, and grouping multi-turn conversations into sessions.

Source: js/examples/apps/tracing-tutorial/README.md:40-60

Operational Notes and Known Gaps

Custom span costs are not yet first-class. As of v17.8.x, cost tracking is still centred on LLM token pricing, and non-LLM spans (Tool, Agent, Retriever, Custom) cannot yet emit a cost attribute that aggregates into trace-level dashboards. This is tracked as an open enhancement request.
Synthetic trace generation. scripts/data/generate_traces.py emits synthetic LLM traces for benchmarks and load testing — useful for validating that the span → trace rollup logic is correct before pointing production traffic at Phoenix.
Server-side schema. Phoenix exposes its API as a generated OpenAPI schema via scripts/ci/compile_openapi_schema.py, and the Python client mirrors that schema in resource classes such as client.spans.add_span_annotation and client.sessions.get_session_turns.

Source: scripts/README.md:14-22, packages/phoenix-client/README.md:20-80

Datasets, Experiments, and Evaluation

Related topics: Tracing, Spans, and Observability, PXI Agent, CLI, and MCP Extensibility

Section Related Pages

Continue reading this section for the full explanation and source context.

Section General-purpose evaluators

Continue reading this section for the full explanation and source context.

Section Agent and tool-use evaluators

Continue reading this section for the full explanation and source context.

Datasets, Experiments, and Evaluation

Overview

Phoenix provides a unified surface for managing datasets, running experiments, and applying evaluations against LLM-based applications. Datasets store representative inputs (and, when applicable, expected outputs) used to drive repeatable runs. Experiments invoke a task against each example in a dataset and capture the results. Evaluators then score those results so practitioners can compare task versions side-by-side.

The TypeScript implementation ships a curated set of default classification evaluators under js/packages/phoenix-evals/src/__generated__/default_templates/, exported through a single barrel file. Source: js/packages/phoenix-evals/src/__generated__/default_templates/index.ts:1-11. Each default evaluator is a generated, hand-curated prompt template and a choices map that maps label strings to numeric scores, enabling direct aggregation in dashboards. Source: js/packages/phoenix-evals/src/__generated__/default_templates/CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-6.

The same surface is exposed programmatically through the phoenix-client (Python) and the JavaScript phoenix-client packages, which let users upload datasets, kick off experiments, and log span annotations against live traces. Source: packages/phoenix-client/README.md:1-100.

Default Evaluator Catalog

The generated default templates expose a consistent shape — name, description, optimizationDirection, a single user-role template, and a choices map — that the experiments engine consumes. Source: js/packages/phoenix-evals/src/__generated__/default_templates/CONCISENESS_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-6.

Evaluator	File	Optimization	Choices
`correctness`	`CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.ts`	MAXIMIZE	`correct=1`, `incorrect=0`
`conciseness`	`CONCISENESS_CLASSIFICATION_EVALUATOR_CONFIG.ts`	MAXIMIZE	`concise=1`, `verbose=0`
`document_relevance`	`DOCUMENT_RELEVANCE_CLASSIFICATION_EVALUATOR_CONFIG.ts`	MAXIMIZE	`relevant=1`, `unrelated=0`
`tool_selection`	`TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.ts`	MAXIMIZE	`correct=1`, `incorrect=0`
`tool_invocation`	`TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG.ts`	MAXIMIZE	`correct=1`, `incorrect=0`
`tool_response_handling`	`TOOL_RESPONSE_HANDLING_CLASSIFICATION_EVALUATOR_CONFIG.ts`	MAXIMIZE	`correct=1`, `incorrect=0`
`hallucination`, `refusal`, `faithfulness`	exported via the same barrel	MAXIMIZE	label-keyed

Source: js/packages/phoenix-evals/src/__generated__/default_templates/index.ts:1-11, js/packages/phoenix-evals/src/__generated__/default_templates/TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-50, js/packages/phoenix-evals/src/__generated__/default_templates/TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-50, js/packages/phoenix-evals/src/__generated__/default_templates/TOOL_RESPONSE_HANDLING_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-50.

The optimizationDirection of "MAXIMIZE" means higher scores are better; experiments can therefore be ranked without re-mapping labels. Source: js/packages/phoenix-evals/src/__generated__/default_templates/CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.ts:4-6.

Evaluator Semantics

General-purpose evaluators

correctness checks factual accuracy, completeness, and logical consistency of a model output against an input, and returns correct or incorrect. Source: js/packages/phoenix-evals/src/__generated__/default_templates/CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.ts:8-50. conciseness is intentionally scoped to verbosity only — it does not assess correctness or helpfulness, so a response can be both concise and incorrect. Source: js/packages/phoenix-evals/src/__generated__/default_templates/CONCISENESS_CLASSIFICATION_EVALUATOR_CONFIG.ts:8-50.

document_relevance compares a question to a candidate document and emits relevant or unrelated, useful for RAG retrieval evaluation. Source: js/packages/phoenix-evals/src/__generated__/default_templates/DOCUMENT_RELEVANCE_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-30.

Agent and tool-use evaluators

The three tool-related evaluators separate orthogonal concerns:

tool_selection judges whether the right tool was picked, allowing the model to skip tools when none are needed. Source: js/packages/phoenix-evals/src/__generated__/default_templates/TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-50.
tool_invocation judges whether the chosen tool was called correctly with valid arguments; it explicitly disregards whether the choice itself was optimal, and evaluates every tool call in multi-tool responses independently. Source: js/packages/phoenix-evals/src/__generated__/default_templates/TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-50.
tool_response_handling judges what happens after a tool returns — error handling, data extraction, and safe disclosure of the tool's output. Source: js/packages/phoenix-evals/src/__generated__/default_templates/TOOL_RESPONSE_HANDLING_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-50.

This split lets experiment dashboards report failure modes granularly (e.g., "wrong tool" vs. "bad arguments") rather than collapsing them into a single pass/fail.

End-to-End Example Workflows

The demo-document-relevancy-experiment example illustrates the canonical dataset→experiment→evaluation loop. It builds a small dataset of space-related questions, runs each example through a spaceKnowledgeApplication that calls OpenAI for retrieval, applies the document_relevance evaluator, and forwards the run to Phoenix Cloud for analysis. Source: js/examples/apps/demo-document-relevancy-experiment/README.md:1-25.

The rag_agent and code_gen_agent examples show the same pattern for full agent workflows: each ships a requirements.txt, an app.py entrypoint, an agent.py that wires LangChain / LangGraph, and a tools.py that exposes the agent's capabilities. Both rely on OpenInference auto-instrumentation so that Phoenix can record spans and feed them to evaluators without manual wiring. Source: examples/rag_agent/README.md:1-50, examples/code_gen_agent/README.md:1-50.

The TypeScript tracing tutorial extends this to live sessions and trace-level evaluation: spans can be annotated with user_feedback, tool_result, and retrieval_relevance, and conversations can be filtered by conversation_coherence or resolution_status annotations on the last turn. Source: js/examples/apps/tracing-tutorial/README.md:1-50.

Annotations and the Python Client

Beyond experiments, the phoenix-client Python package supports direct span and session annotation — useful for human-in-the-loop review of production traces:

client.spans.add_span_annotation(...) records a single HUMAN or LLM annotation with optional score and explanation. Source: packages/phoenix-client/README.md:1-100.
client.spans.log_span_annotations(span_annotations=[...]) bulk-loads annotations across many spans at once. Source: packages/phoenix-client/README.md:1-100.
client.sessions.list(project_name=...) and client.sessions.get_session_turns(session_id=...) let you audit full conversation threads; annotation names can be filtered to narrow the query. Source: packages/phoenix-client/README.md:1-100.

A canonical hallucination_eval_benchmark notebook under scripts/benchmarks/ exercises the hallucination evaluator end-to-end for reproducibility. Source: scripts/README.md:1-100

Common Pitfalls

Mixing evaluator scopes. tool_invocation deliberately ignores selection; a model that picks the wrong tool but invokes it correctly will still receive a correct judgment. Use both tool_selection and tool_invocation when you need a complete picture. Source: js/packages/phoenix-evals/src/__generated__/default_templates/TOOL_INVOCATION_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-50.
Confusing conciseness with quality. Conciseness is a length/stylistic signal only; pair it with correctness or hallucination to capture substance. Source: js/packages/phoenix-evals/src/__generated__/default_templates/CONCISENESS_CLASSIFICATION_EVALUATOR_CONFIG.ts:8-50.
Generated files are not hand-editable. The __generated__/default_templates/ directory is emitted by a build step — modifying it directly will be lost on regeneration. Source: js/packages/phoenix-evals/src/__generated__/default_templates/CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.ts:1-2.

PXI Agent, CLI, and MCP Extensibility

Related topics: Platform Overview and Architecture, Datasets, Experiments, and Evaluation

Section Related Pages

Continue reading this section for the full explanation and source context.

Section 2.1 Prompt formatting

Continue reading this section for the full explanation and source context.

Section 2.2 Sessions and annotations

Continue reading this section for the full explanation and source context.

Section 2.3 Authorization model

Continue reading this section for the full explanation and source context.

PXI Agent, CLI, and MCP Extensibility

Phoenix's extensibility story rests on three coordinated layers: the PXI Agent built into the Phoenix server, the standalone px CLI for scripting and automation, and Model Context Protocol (MCP) integrations that let external agents (e.g. Claude Code, CLI agent starter kits) consume Phoenix programmatically. Together they let teams run, observe, and improve LLM applications without leaving their preferred surfaces.

1. Purpose and Scope

The PXI Agent is the in-product assistant embedded in the Phoenix UI. It has been actively expanded across the v17.3.0 – v17.8.1 release line:

Version	Date	PXI Agent / Tooling Feature	Tracking
17.3.0	2026-06-10	"Copy trace ID" chat action	release
17.4.0	2026-06-11	Local slash commands in chat menu	#13683
17.5.0	2026-06-12	Subagents toggle in assistant settings	#13733
17.6.0	2026-06-15	Experiment editing and eval skills	#13704
17.7.0	2026-06-16	Agent eval skill additions	release
17.8.0	2026-06-17	Agent GraphQL skill	#13732

Source: internal_docs/README.md describes feature specifications that drive the agent roadmap.

The companion surfaces — the px CLI (js/packages/phoenix-cli/README.md) and the phoenix-client Python SDK (packages/phoenix-client/README.md) — give the same capabilities headless access for CI, notebooks, and external agent runtimes.

2. The `px` CLI

The CLI ships as a Node.js package and is invoked as px <command>.

2.1 Prompt formatting

formatPromptOutput renders a fetched PromptVersion (typed against componentsV1["schemas"]["PromptVersion"]) in pretty, json, raw, or text form. The text mode emits an XML-tagged chat template so the output can be piped to other tools without color codes. Source: js/packages/phoenix-cli/src/commands/formatPrompt.ts.

2.2 Sessions and annotations

px session annotate my-session-id --name reviewer --label pass
px session add-note my-session-id --text "needs follow-up"
px session-annotations delete --identifier "$PHOENIX_CODING_IDENTIFIER" --all -y
px annotation-config list --format raw --no-progress | jq '.[].name'

The add-note subcommand requires Phoenix server 14.17.0 or newer, and the deletion commands require either --all or both --start-time and --end-time. Source: js/packages/phoenix-cli/README.md.

2.3 Authorization model

Mutating commands propagate the same permission classes used by the GraphQL layer — IsNotReadOnly and IsNotViewer — which is why deletion commands have explicit gates (the script scripts/ci/ensure_graphql_mutations_have_permission_classes.py enforces this rule server-side). Source: scripts/README.md.

3. CLI Agent Starter Kit

The starter kit demonstrates how an external agent (Anthropic Claude via the Vercel AI SDK) can register Phoenix as both an MCP server and an OpenTelemetry observability backend.

src/
├── cli.ts              # Entry point
├── agent/              # Agent factory
├── tools/
│   ├── index.ts        # Tool exports
│   ├── datetime.ts     # Utility tool
│   └── mcp.ts          # Phoenix docs MCP
├── prompts/            # System instructions
└── ui/                 # CLI interface

Adding a tool follows a three-step pattern: create src/tools/mytool.ts with tool({ description, inputSchema: z.object({...}), execute }), re-export from src/tools/index.ts, then register in src/cli.ts. Requirements are Node.js 22+, pnpm, Docker Desktop, and an Anthropic API key; Phoenix auto-starts and serves the UI on http://localhost:6006. Source: js/examples/apps/cli-agent-starter-kit/README.md.

4. MCP Extensibility and Evaluators

Phoenix exposes its datasets, prompts, traces, and evals through the Model Context Protocol so any MCP-compatible runtime (Claude Code, Cursor, the CLI starter kit above) can pull Phoenix context directly. The PXI Agent itself consumes MCP — the v17.8.0 release shipped an Agent GraphQL skill (#13732) so the in-product agent can answer structured queries without leaving the chat surface.

The JS evals package mirrors this on the programmatic side: a generated barrel file re-exports ClassificationEvaluatorConfig templates for conciseness, correctness, document_relevance, faithfulness, hallucination, refusal, tool_invocation, tool_response_handling, and tool_selection. Each template declares an optimizationDirection (e.g. "MAXIMIZE") and a numeric choices map. Source: js/packages/phoenix-evals/src/__generated__/default_templates/index.ts.

A canonical "tool selection" rubric illustrates the contract: return "correct" only when the chosen tool exists in the available tools list, is safe, and matches the user's intent; otherwise return "incorrect". Source: js/packages/phoenix-evals/src/__generated__/default_templates/TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.ts.

flowchart LR
    A[External Agent<br/>Claude Code / CLI Kit] -->|MCP| B(Phoenix MCP Server)
    PXI[PXI Agent<br/>in Phoenix UI] -->|GraphQL Skill| B
    PX[px CLI] -->|REST + OpenAPI| B
    SDK[phoenix-client<br/>Python SDK] -->|REST + OpenAPI| B
    B --> D[(Phoenix Store<br/>Traces, Datasets,<br/>Prompts, Evals)]

5. Known Constraints

OIDC role overrides are not persisted when PHOENIX_OAUTH2_{IDP}_ROLE_ATTRIBUTE_PATH is set: the IDP claim overwrites any role assigned in the UI on every login (#13783). Workaround: do not rely on UI role edits while the env var is configured.
Cost tracking is token-centric: non-LLM spans (Tool, Agent, Retriever, Custom) cannot yet expose a custom cost attribute that rolls into project-level dashboards (#13655).
Agent FS / bash access is being expanded behind a backend emulator (#13675), so file- and shell-shaped tools should be treated as closed-feature pending the release that flips it on.
FastAPI 0.137 _IncludedRouter: a 17.8.1 hot-fix re-resolved the Prometheus middleware route path (#13822); pin server versions if you write custom Prometheus scrapes.

Doramagic Pitfall Log

Source-linked risks stay visible on the manual page so the preview does not read like a recommendation.

high Runtime risk requires verification

May increase setup, validation, or first-run risk for the user.

high Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

high Security or permission risk requires verification

May increase setup, validation, or first-run risk for the user.

medium Configuration risk requires verification

Developers may misconfigure credentials, environment, or host setup: [BUG]: ROLE_ATTRIBUTE_PATH unconditionally overwrites manually set roles on every OIDC login - manual UI overrides should persist

Doramagic Pitfall Log

Found 22 structured pitfall item(s), including 3 high/blocking item(s). Top priority: Runtime risk - Runtime risk requires verification.

1. Runtime risk: Runtime risk requires verification

Severity: high
Finding: Project evidence flags a runtime risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/Arize-ai/phoenix/issues/13837

2. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/Arize-ai/phoenix/issues/13783

3. Security or permission risk: Security or permission risk requires verification

Severity: high
Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: community_evidence:github | https://github.com/Arize-ai/phoenix/issues/13655

4. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Developers should check this configuration risk before relying on the project: [BUG]: ROLE_ATTRIBUTE_PATH unconditionally overwrites manually set roles on every OIDC login - manual UI overrides should persist
User impact: Developers may misconfigure credentials, environment, or host setup: [BUG]: ROLE_ATTRIBUTE_PATH unconditionally overwrites manually set roles on every OIDC login - manual UI overrides should persist
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [BUG]: ROLE_ATTRIBUTE_PATH unconditionally overwrites manually set roles on every OIDC login - manual UI overrides should persist. Context: Observed during installation or first-run setup.
Evidence: failure_mode_cluster:github_issue | https://github.com/Arize-ai/phoenix/issues/13783

5. Configuration risk: Configuration risk requires verification

Severity: medium
Finding: Developers should check this configuration risk before relying on the project: arize-phoenix-client: v2.8.0
User impact: Upgrade or migration may change expected behavior: arize-phoenix-client: v2.8.0
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: arize-phoenix-client: v2.8.0. Context: Source discussion did not expose a precise runtime context.
Evidence: failure_mode_cluster:github_release | https://github.com/Arize-ai/phoenix/releases/tag/arize-phoenix-client-v2.8.0

6. Capability evidence risk: Capability evidence risk requires verification

Severity: medium
Finding: README/documentation is current enough for a first validation pass.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: capability.assumptions | https://github.com/Arize-ai/phoenix

7. Maintenance risk: Maintenance risk requires verification

Severity: medium
Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: evidence.maintainer_signals | https://github.com/Arize-ai/phoenix

8. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: downstream_validation.risk_items | https://github.com/Arize-ai/phoenix

9. Security or permission risk: Security or permission risk requires verification

Severity: medium
Finding: no_demo
User impact: May increase setup, validation, or first-run risk for the user.
Recommended check: Reproduce the official install and quickstart path in an isolated environment.
Evidence: risks.scoring_risks | https://github.com/Arize-ai/phoenix

10. Capability evidence risk: Capability evidence risk requires verification

Severity: low
Finding: Developers should check this capability risk before relying on the project: [ENHANCEMENT]: Support custom span costs independent of LLM token pricing.
User impact: Developers may hit a documented source-backed failure mode: [ENHANCEMENT]: Support custom span costs independent of LLM token pricing.
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [ENHANCEMENT]: Support custom span costs independent of LLM token pricing.. Context: Observed when using cuda
Evidence: failure_mode_cluster:github_issue | https://github.com/Arize-ai/phoenix/issues/13655

11. Capability evidence risk: Capability evidence risk requires verification

Severity: low
Finding: Developers should check this capability risk before relying on the project: [agents] backend bash emulator / FS access
User impact: Developers may hit a documented source-backed failure mode: [agents] backend bash emulator / FS access
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: [agents] backend bash emulator / FS access. Context: Source discussion did not expose a precise runtime context.
Evidence: failure_mode_cluster:github_issue | https://github.com/Arize-ai/phoenix/issues/13675

12. Runtime risk: Runtime risk requires verification

Severity: low
Finding: Developers should check this performance risk before relying on the project: arize-phoenix: v17.5.0
User impact: Upgrade or migration may change expected behavior: arize-phoenix: v17.5.0
Recommended check: Before packaging this project, run the relevant install/config/quickstart check for: arize-phoenix: v17.5.0. Context: Source discussion did not expose a precise runtime context.
Evidence: failure_mode_cluster:github_release | https://github.com/Arize-ai/phoenix/releases/tag/arize-phoenix-v17.5.0

Source: Doramagic discovery, validation, and Project Pack records

Community Discussion Evidence

These external discussion links are review inputs, not standalone proof that the project is production-ready.

Sources 12

Count of project-level external discussion links exposed on this manual page.

Use Review before install

Open the linked issues or discussions before treating the pack as ready for your environment.

Community Discussion Evidence

Doramagic exposes project-level community discussion separately from official documentation. Review these links before using phoenix with real data or production workflows.

[[bug][agents] Inconsistent /clear invocation](https://github.com/Arize-ai/phoenix/issues/13837) - github / github_issue
[[ENHANCEMENT]: Support custom span costs independent of LLM token pricin](https://github.com/Arize-ai/phoenix/issues/13655) - github / github_issue
[[agents] backend bash emulator / FS access](https://github.com/Arize-ai/phoenix/issues/13675) - github / github_issue
[[BUG]: ROLE_ATTRIBUTE_PATH unconditionally overwrites manually set roles](https://github.com/Arize-ai/phoenix/issues/13783) - github / github_issue
arize-phoenix: v17.9.0 - github / github_release
arize-phoenix: v17.8.1 - github / github_release
arize-phoenix: v17.8.0 - github / github_release
arize-phoenix: v17.7.0 - github / github_release
arize-phoenix: v17.6.0 - github / github_release
arize-phoenix: v17.5.0 - github / github_release
arize-phoenix-client: v2.9.0 - github / github_release
arize-phoenix: v17.4.0 - github / github_release

Source: Project Pack community evidence and pitfall evidence

phoenix

Platform Overview and Architecture

Related Pages

Platform Overview and Architecture

High-Level Architecture

Core Subsystems

Tracing and Observability

Evaluation Framework

Agent Development and Experimentation

Developer Tooling and Operations

Common Failure Modes and Operational Notes

See Also

Tracing, Spans, and Observability

Related Pages

Tracing, Spans, and Observability

Overview

Spans: The Atomic Unit of Observability

Default Evaluator Templates on Spans

Traces: Grouping Spans Into End-to-End Executions

Instrumentation and Agent Use Cases

Operational Notes and Known Gaps

See Also

Datasets, Experiments, and Evaluation

Related Pages

Datasets, Experiments, and Evaluation

Overview

Default Evaluator Catalog

Evaluator Semantics

General-purpose evaluators

Agent and tool-use evaluators

End-to-End Example Workflows

Annotations and the Python Client

Common Pitfalls

See Also

PXI Agent, CLI, and MCP Extensibility

Related Pages

PXI Agent, CLI, and MCP Extensibility

1. Purpose and Scope

2. The `px` CLI

2.1 Prompt formatting

2.2 Sessions and annotations

2.3 Authorization model

3. CLI Agent Starter Kit

4. MCP Extensibility and Evaluators

5. Known Constraints

See Also

Doramagic Pitfall Log

Doramagic Pitfall Log

1. Runtime risk: Runtime risk requires verification

2. Security or permission risk: Security or permission risk requires verification

3. Security or permission risk: Security or permission risk requires verification

4. Configuration risk: Configuration risk requires verification

5. Configuration risk: Configuration risk requires verification

6. Capability evidence risk: Capability evidence risk requires verification

7. Maintenance risk: Maintenance risk requires verification

8. Security or permission risk: Security or permission risk requires verification

9. Security or permission risk: Security or permission risk requires verification

10. Capability evidence risk: Capability evidence risk requires verification

11. Capability evidence risk: Capability evidence risk requires verification

12. Runtime risk: Runtime risk requires verification

Community Discussion Evidence

Community Discussion Evidence