# Pitfall Log

Project: confident-ai/deepeval

Summary: Found 28 structured pitfall item(s), including 4 high/blocking item(s). Top priority: Installation risk - Installation risk requires verification.

## 1. Installation risk - Installation risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/1235

## 2. Installation risk - Installation risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a installation risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/2508

## 3. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Developers should check this security_permissions risk before relying on the project: Security: request for a submitting security vulnerabilities.
- User impact: Developers may expose sensitive permissions or credentials: Security: request for a submitting security vulnerabilities.
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2744

## 4. Security or permission risk - Security or permission risk requires verification

- Severity: high
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/2594

## 5. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a configuration risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.host_targets | github_repo:676829188 | https://github.com/confident-ai/deepeval

## 6. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: 🎉 New Interfaces, Reduce ETL Code < 50%!
- User impact: Upgrade or migration may change expected behavior: 🎉 New Interfaces, Reduce ETL Code < 50%!
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v3.7.2

## 7. Configuration risk - Configuration risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this configuration risk before relying on the project: 🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI for trace inspection!
- User impact: Upgrade or migration may change expected behavior: 🔥 DeepEval 4.0: Eval Harness for Coding Agents, 1-line integrations, TUI for trace inspection!
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v4.0.2

## 8. Capability evidence risk - Capability evidence risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: README/documentation is current enough for a first validation pass.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: capability.assumptions | github_repo:676829188 | https://github.com/confident-ai/deepeval

## 9. Runtime risk - Runtime risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this runtime risk before relying on the project: ConfidentInstrumentationSettings with pydantic-ai: tools_called, expected_tools, and actual_output are all None when using OpenAIResponsesModel
- User impact: Developers may hit a documented source-backed failure mode: ConfidentInstrumentationSettings with pydantic-ai: tools_called, expected_tools, and actual_output are all None when using OpenAIResponsesModel
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2508

## 10. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Developers should check this migration risk before relying on the project: 🎉 New Decision Graph Logic for Granular Simulation Control
- User impact: Upgrade or migration may change expected behavior: 🎉 New Decision Graph Logic for Granular Simulation Control
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v4.0.3

## 11. Maintenance risk - Maintenance risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a maintenance risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:676829188 | https://github.com/confident-ai/deepeval

## 12. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: downstream_validation.risk_items | github_repo:676829188 | https://github.com/confident-ai/deepeval

## 13. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: no_demo
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: risks.scoring_risks | github_repo:676829188 | https://github.com/confident-ai/deepeval

## 14. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/2741

## 15. Security or permission risk - Security or permission risk requires verification

- Severity: medium
- Evidence strength: source_linked
- Finding: Project evidence flags a security or permission risk. Review the linked source before relying on this workflow.
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: community_evidence:github | https://github.com/confident-ai/deepeval/issues/2746

## 16. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: CLI improvement: option to display only failed tests
- User impact: Developers may hit a documented source-backed failure mode: CLI improvement: option to display only failed tests
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/1235

## 17. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this capability risk before relying on the project: Feature: support cached input tokens in LLM span cost tracking
- User impact: Developers may hit a documented source-backed failure mode: Feature: support cached input tokens in LLM span cost tracking
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2741

## 18. Capability evidence risk - Capability evidence risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this conceptual risk before relying on the project: DeepEval for Typescript
- User impact: Developers may hit a documented source-backed failure mode: DeepEval for Typescript
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2734

## 19. Runtime risk - Runtime risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this performance risk before relying on the project: Contextual Precision over-penalizes overlapping chunks in financial-document RAG
- User impact: Developers may hit a documented source-backed failure mode: Contextual Precision over-penalizes overlapping chunks in financial-document RAG
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2594

## 20. Runtime risk - Runtime risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this performance risk before relying on the project: LLM tokens not displayed when using custom OpenTelemetry / OpenInference OTLP export
- User impact: Developers may hit a documented source-backed failure mode: LLM tokens not displayed when using custom OpenTelemetry / OpenInference OTLP export
- Evidence: failure_mode_cluster:github_issue | https://github.com/confident-ai/deepeval/issues/2746

## 21. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: issue_or_pr_quality=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:676829188 | https://github.com/confident-ai/deepeval

## 22. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: release_recency=unknown。
- User impact: May increase setup, validation, or first-run risk for the user.
- Evidence: evidence.maintainer_signals | github_repo:676829188 | https://github.com/confident-ai/deepeval

## 23. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: Opus 4.8: Day 0 Support
- User impact: Upgrade or migration may change expected behavior: Opus 4.8: Day 0 Support
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v4.0.5

## 24. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 🎉 Metrics for AI agents, multi-turn synthetic data generation, and more!
- User impact: Upgrade or migration may change expected behavior: 🎉 Metrics for AI agents, multi-turn synthetic data generation, and more!
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v3.9.9

## 25. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 🎉 New Arena GEval Metric, for Pairwise Comparisons
- User impact: Upgrade or migration may change expected behavior: 🎉 New Arena GEval Metric, for Pairwise Comparisons
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v3.1.9

## 26. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 🎉 New Conversational Evaluation, LiteLLM Integration
- User impact: Upgrade or migration may change expected behavior: 🎉 New Conversational Evaluation, LiteLLM Integration
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v3.0.8

## 27. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 🎉 New Multimodal Metrics, with Platform Support
- User impact: Upgrade or migration may change expected behavior: 🎉 New Multimodal Metrics, with Platform Support
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v3.1.5

## 28. Maintenance risk - Maintenance risk requires verification

- Severity: low
- Evidence strength: source_linked
- Finding: Developers should check this maintenance risk before relying on the project: 🎉 Renewed datasets, single vs multi-turn
- User impact: Upgrade or migration may change expected behavior: 🎉 Renewed datasets, single vs multi-turn
- Evidence: failure_mode_cluster:github_release | https://github.com/confident-ai/deepeval/releases/tag/v3.2.6