Evaluating AI Agent Quality with Agent Evaluation
Who this is for:
Architecture / Concept Overview: Evaluating AI Agent Quality with Agent Evaluation
Agent Evaluation combines automated LLM-based judging with human review to produce actionable quality scores.
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
flowchart LR
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
AGENT[Deployed Agent] -->|Run| EVAL_SET[Evaluation Dataset]
EVAL_SET -->|Generate| RESPONSES[Agent Responses]
RESPONSES -->|Score| LLM_JUDGE[LLM-as-Judge]
RESPONSES -->|Review| HUMAN[Human Review App]
LLM_JUDGE -->|Metrics| REPORT[Quality Report]
HUMAN -->|Feedback| REPORT
REPORT -->|Gate| DEPLOY_GATE[Deployment Gate]
REPORT -->|Monitor| PROD_MON[Production Monitoring]
AGENT:::serving
EVAL_SET:::source
RESPONSES:::ingestion
LLM_JUDGE:::processing
HUMAN:::governance
REPORT:::storage
DEPLOY_GATE:::serving
PROD_MON:::governance
*Agent Evaluation pipeline: agents run against evaluation sets, LLM judges and humans score responses, and quality gates control deployment.*
%%{init: {"theme":"base","themeVariables":{"background":"#0B0E14","primaryTextColor":"#E0E6ED","lineColor":"#5D6470","darkMode":true,"primaryColor":"#2E4A4A","secondaryColor":"#374151","secondaryTextColor":"#E0E6ED","tertiaryColor":"#111827","tertiaryTextColor":"#E0E6ED","edgeLabelBackground":"#1f2937"}}}%%
graph TD
classDef source fill:#3F4B59,stroke:#9CA3AF,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef ingestion fill:#5A4B36,stroke:#C9A86B,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef processing fill:#535072,stroke:#8E82B4,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef storage fill:#2E4A4A,stroke:#5FAFA8,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef serving fill:#3D5550,stroke:#6BB7AA,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
classDef governance fill:#5A3F52,stroke:#C28BB0,stroke-width:2px,rx:8,ry:8,color:#E0E6ED
METRICS[Agent Quality Metrics] --> GROUND[Groundedness]
METRICS --> REL[Relevance]
METRICS --> SAFE[Safety]
METRICS --> CORRECT[Correctness]
METRICS --> COMPLETE[Task Completion]
METRICS --> LATENCY_M[Latency]
GROUND --> CONTEXT_SUPPORT[Response supported by context]
REL --> ANSWER_QUESTION[Response addresses the question]
SAFE --> NO_HARMFUL[No harmful content]
CORRECT --> FACTUAL[Factually accurate]
COMPLETE --> TASK_DONE[Task fully accomplished]
METRICS:::governance
GROUND:::processing
REL:::processing
SAFE:::storage
CORRECT:::serving
COMPLETE:::serving
LATENCY_M:::ingestion
CONTEXT_SUPPORT:::source
ANSWER_QUESTION:::source
NO_HARMFUL:::source
FACTUAL:::source
TASK_DONE:::source
*Agent quality metric taxonomy: from groundedness to task completion.*
Key Terms
Prerequisites and Setup
databricks-agentspackage installed.- A deployed agent or model to evaluate.
- An evaluation dataset with test questions and expected answers.
- Foundation Model APIs for LLM-as-judge scoring.
Step-by-Step Implementation
Configuration Reference
| Parameter | Default | Description |
|---|---|---|
model_name | — | Unity Catalog model to evaluate |
evaluation_set | — | DataFrame or table with test cases |
metrics | all | Specific metrics to compute |
enable_review_app | false | Launch human review interface |
judge_model | default | LLM endpoint for automated judging |