Evaluation Framework¶
Prela's evaluation framework provides a comprehensive system for testing AI agents with assertions, test suites, and CI/CD integration.
Overview¶
The evaluation framework helps you:
- Define test cases with inputs and expected outputs
- Run assertions to validate agent behavior
- Execute test suites sequentially or in parallel
- Generate reports in multiple formats (console, JSON, JUnit)
- Integrate with CI/CD for automated testing
graph LR
A[Test Cases] --> B[Test Suite]
B --> C[Runner]
C --> D[Agent]
D --> E[Output]
E --> F[Assertions]
F --> G[Results]
G --> H[Reporter]
H --> I[Console/JSON/JUnit]
Quick Start¶
from prela.evals import EvalCase, EvalInput, EvalExpected, EvalSuite, EvalRunner
# Define test case
case = EvalCase(
id="test_qa",
name="Basic QA Test",
input=EvalInput(query="What is 2+2?"),
expected=EvalExpected(contains=["4"])
)
# Create suite
suite = EvalSuite(name="Math Tests", cases=[case])
# Define your agent
def my_agent(input_data):
query = input_data.get("query", "")
if "2+2" in query:
return "The answer is 4"
return "I don't know"
# Run evaluation
runner = EvalRunner(suite, my_agent)
result = runner.run()
# View results
print(result.summary())
# Evaluation Suite: Math Tests
# Total Cases: 1
# Passed: 1 (100.0%)
Core Components¶
1. EvalCase¶
A test case defines: - Input: Data to pass to your agent - Expected: What the output should contain/match - Assertions: Validation rules - Metadata: Tags, timeout, description
from prela.evals import EvalCase, EvalInput, EvalExpected
case = EvalCase(
id="test_greeting",
name="Greeting Test",
input=EvalInput(
query="Say hello",
context={"language": "english"}
),
expected=EvalExpected(
contains=["hello", "hi"],
not_contains=["error"]
),
tags=["greeting", "basic"],
timeout_seconds=5.0
)
2. EvalSuite¶
A collection of test cases with optional setup/teardown:
from prela.evals import EvalSuite
suite = EvalSuite(
name="Agent Test Suite",
cases=[case1, case2, case3],
setup=lambda: print("Setting up..."),
teardown=lambda: print("Cleaning up..."),
default_assertions=[
{"type": "latency", "max_ms": 2000}
]
)
3. Assertions¶
Validation rules that check agent outputs:
# Simple expectations
expected = EvalExpected(
contains=["success"], # Must contain
not_contains=["error"], # Must not contain
)
# Advanced assertions
assertions = [
{"type": "contains", "value": "result"},
{"type": "regex", "pattern": r"\d+"},
{"type": "json_valid"},
{"type": "semantic_similarity", "threshold": 0.8, "reference": "..."},
{"type": "tool_called", "tool_name": "search"},
]
4. EvalRunner¶
Executes test suites and collects results:
from prela.evals import EvalRunner
runner = EvalRunner(
suite=suite,
agent_fn=my_agent,
parallel=True, # Run tests in parallel
max_workers=4, # Parallel workers
tracer=tracer # Optional: trace test execution
)
result = runner.run()
5. Reporters¶
Generate results in different formats:
from prela.evals.reporters import ConsoleReporter, JSONReporter, JUnitReporter
# Console output
ConsoleReporter(verbosity="verbose").report(result)
# JSON file
JSONReporter("results.json").report(result)
# JUnit XML (for CI/CD)
JUnitReporter("junit.xml").report(result)
Assertion Types¶
Structural Assertions¶
- contains: Text must be present
- not_contains: Text must not be present
- regex: Match regex pattern
- length: Output length constraints
- json_valid: Valid JSON output
- latency: Response time limits
Tool Assertions¶
- tool_called: Specific tool was invoked
- tool_args: Tool called with correct arguments
- tool_sequence: Tools called in order
Semantic Assertions¶
- semantic_similarity: Semantic similarity to reference
YAML/JSON Support¶
Define test suites in YAML or JSON:
# tests.yaml
name: Agent Test Suite
cases:
- id: test_greeting
name: Greeting Test
input:
query: "Say hello"
expected:
contains: ["hello"]
tags: ["greeting"]
- id: test_math
name: Math Test
input:
query: "What is 5+3?"
expected:
contains: ["8"]
tags: ["math"]
Load and run:
from prela.evals import EvalSuite
suite = EvalSuite.from_yaml("tests.yaml")
runner = EvalRunner(suite, my_agent)
result = runner.run()
CLI Integration¶
Run evaluations from the command line:
# Run test suite
prela eval run tests.yaml --agent my_agent.py
# With custom reporter
prela eval run tests.yaml --agent my_agent.py --format junit --output results.xml
# Parallel execution
prela eval run tests.yaml --agent my_agent.py --parallel --workers 8
CI/CD Integration¶
GitHub Actions¶
name: Run Evaluations
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- run: pip install prela
- name: Run evaluations
run: |
prela eval run tests.yaml --agent agent.py --format junit --output results.xml
- name: Publish test results
uses: EnricoMi/publish-unit-test-result-action@v2
if: always()
with:
files: results.xml
GitLab CI¶
test:
script:
- pip install prela
- prela eval run tests.yaml --agent agent.py --format junit --output results.xml
artifacts:
reports:
junit: results.xml
Observability Integration¶
Trace test execution with Prela's tracer:
import prela
from prela.evals import EvalRunner
# Initialize tracer
tracer = prela.init(service_name="eval-runner", exporter="console")
# Run with tracing
runner = EvalRunner(suite, my_agent, tracer=tracer)
result = runner.run()
# Each test case execution is captured as a span
# View complete trace of agent behavior during tests
Best Practices¶
1. Organize Tests by Feature¶
qa_suite = EvalSuite(name="QA Tests", cases=qa_cases, tags=["qa"])
rag_suite = EvalSuite(name="RAG Tests", cases=rag_cases, tags=["rag"])
tool_suite = EvalSuite(name="Tool Tests", cases=tool_cases, tags=["tools"])
2. Use Tags for Filtering¶
case = EvalCase(
id="test_1",
tags=["smoke", "critical", "qa"]
)
# Filter by tags (future feature)
# runner.run(tags=["smoke"])
3. Set Appropriate Timeouts¶
# Fast operations
case = EvalCase(..., timeout_seconds=2.0)
# LLM calls
case = EvalCase(..., timeout_seconds=30.0)
# Complex workflows
case = EvalCase(..., timeout_seconds=120.0)
4. Combine Multiple Assertions¶
case = EvalCase(
id="test_comprehensive",
expected=EvalExpected(
contains=["result"],
not_contains=["error"]
),
assertions=[
{"type": "latency", "max_ms": 5000},
{"type": "json_valid"},
{"type": "length", "min": 10, "max": 500},
]
)
5. Use Setup/Teardown¶
def setup():
"""Initialize test environment."""
db.connect()
cache.clear()
def teardown():
"""Clean up after tests."""
db.disconnect()
suite = EvalSuite(
name="Integration Tests",
cases=cases,
setup=setup,
teardown=teardown
)
Example Workflows¶
Regression Testing¶
# Define expected behaviors
regression_cases = [
EvalCase(
id=f"regression_{i}",
input=EvalInput(query=query),
expected=EvalExpected(output=known_good_output)
)
for i, (query, known_good_output) in enumerate(test_data)
]
suite = EvalSuite(name="Regression Tests", cases=regression_cases)
result = EvalRunner(suite, agent).run()
# Fail CI if regressions detected
assert result.pass_rate == 1.0, f"Regressions detected: {result.failed} failures"
A/B Testing¶
# Test two agent versions
result_v1 = EvalRunner(suite, agent_v1).run()
result_v2 = EvalRunner(suite, agent_v2).run()
print(f"V1 pass rate: {result_v1.pass_rate:.1%}")
print(f"V2 pass rate: {result_v2.pass_rate:.1%}")
# Deploy version with better performance
Performance Benchmarking¶
# Add latency assertions to all cases
suite.default_assertions = [
{"type": "latency", "max_ms": 2000}
]
result = EvalRunner(suite, agent, parallel=True, max_workers=10).run()
# Analyze performance
for case_result in result.case_results:
print(f"{case_result.case_name}: {case_result.duration_ms:.0f}ms")
Next Steps¶
- Learn Writing Tests
- Explore Assertions
- See Running Evaluations
- Set up CI Integration