Assertions¶
Assertions validate agent outputs against expected behavior. Prela provides 21 built-in assertion types across five categories.
Structural Assertions¶
ContainsAssertion¶
Checks if output contains specific text.
from prela.evals.assertions import ContainsAssertion
assertion = ContainsAssertion(text="success", case_sensitive=False)
result = assertion.evaluate(output="Operation completed successfully!")
print(result.passed) # True
Configuration:
{
"type": "contains",
"value": "expected text",
"case_sensitive": False # Optional, default: False
}
NotContainsAssertion¶
Checks if output does NOT contain specific text.
from prela.evals.assertions import NotContainsAssertion
assertion = NotContainsAssertion(text="error", case_sensitive=True)
result = assertion.evaluate(output="All tests passed!")
print(result.passed) # True
Configuration:
RegexAssertion¶
Matches output against a regex pattern.
from prela.evals.assertions import RegexAssertion
assertion = RegexAssertion(pattern=r"\\d{3}-\\d{3}-\\d{4}")
result = assertion.evaluate(output="Call me at 555-123-4567")
print(result.passed) # True
Configuration:
LengthAssertion¶
Validates output length is within bounds.
from prela.evals.assertions import LengthAssertion
assertion = LengthAssertion(min_length=10, max_length=100)
result = assertion.evaluate(output="This is a medium length response.")
print(result.passed) # True
Configuration:
JSONValidAssertion¶
Validates output is valid JSON.
from prela.evals.assertions import JSONValidAssertion
assertion = JSONValidAssertion()
result = assertion.evaluate(output='{"key": "value"}')
print(result.passed) # True
Configuration:
LatencyAssertion¶
Validates response time is under threshold.
from prela.evals.assertions import LatencyAssertion
from prela.core.span import Span, SpanType
from datetime import datetime, timezone, timedelta
assertion = LatencyAssertion(max_ms=5000)
# Create span with timing
span = Span(
name="test",
span_type=SpanType.LLM,
started_at=datetime.now(timezone.utc),
ended_at=datetime.now(timezone.utc) + timedelta(milliseconds=1234)
)
result = assertion.evaluate(output="", trace=[span])
print(result.passed) # True (1234ms < 5000ms)
Configuration:
Tool Assertions¶
ToolCalledAssertion¶
Validates that a specific tool was called.
from prela.evals.assertions import ToolCalledAssertion
from prela.core.span import Span, SpanType
assertion = ToolCalledAssertion(tool_name="search")
# Create span with tool call
span = Span(name="tool.search", span_type=SpanType.TOOL)
span.set_attribute("tool.name", "search")
result = assertion.evaluate(output="", trace=[span])
print(result.passed) # True
Configuration:
ToolArgsAssertion¶
Validates tool was called with correct arguments.
from prela.evals.assertions import ToolArgsAssertion
assertion = ToolArgsAssertion(
tool_name="calculator",
args={"x": 5, "y": 3}
)
# Create span with tool args
span = Span(name="tool.calculator", span_type=SpanType.TOOL)
span.set_attribute("tool.name", "calculator")
span.set_attribute("tool.input", {"x": 5, "y": 3})
result = assertion.evaluate(output="", trace=[span])
print(result.passed) # True
Configuration:
ToolSequenceAssertion¶
Validates tools were called in specific order.
from prela.evals.assertions import ToolSequenceAssertion
assertion = ToolSequenceAssertion(sequence=["search", "summarize", "format"])
# Create spans for each tool
spans = [
Span(name="tool.search", span_type=SpanType.TOOL),
Span(name="tool.summarize", span_type=SpanType.TOOL),
Span(name="tool.format", span_type=SpanType.TOOL)
]
for span, tool in zip(spans, ["search", "summarize", "format"]):
span.set_attribute("tool.name", tool)
result = assertion.evaluate(output="", trace=spans)
print(result.passed) # True
Configuration:
Semantic Assertions¶
SemanticSimilarityAssertion¶
Validates semantic similarity to reference text.
Requirements:
from prela.evals.assertions import SemanticSimilarityAssertion
assertion = SemanticSimilarityAssertion(
reference="The capital of France is Paris",
threshold=0.8
)
result = assertion.evaluate(output="Paris is the capital city of France")
print(result.passed) # True (similarity > 0.8)
Configuration:
{
"type": "semantic_similarity",
"reference": "Expected meaning",
"threshold": 0.8, # 0.0 to 1.0
"model": "all-MiniLM-L6-v2" # Optional
}
Security Assertions¶
NoPIIAssertion¶
Validates that output contains no personally identifiable information. Detects emails, phone numbers, SSNs, credit card numbers, and API keys (AWS, Stripe, GitHub, OpenAI, Slack, Google).
from prela.evals.assertions import NoPIIAssertion
assertion = NoPIIAssertion()
result = assertion.evaluate(output="Contact [email protected] for details")
print(result.passed) # False -- contains email
# Allow specific PII types
assertion = NoPIIAssertion(allow_emails=True, allow_phones=True)
result = assertion.evaluate(output="Email [email protected]")
print(result.passed) # True -- emails allowed
Configuration:
{
"type": "no_pii",
"allow_emails": False, # Optional, default: False
"allow_phones": False # Optional, default: False
}
NoInjectionAssertion¶
Validates that output does not contain prompt injection patterns. Scans for 5 categories of injection attempts:
- Instruction overrides (critical) -- "ignore previous instructions", "override system prompt"
- Jailbreak attempts (high) -- DAN mode, developer mode, "act without restrictions"
- Role confusion (high) -- injected
[SYSTEM],<|assistant|>,<system>markers - Encoded injection (medium) -- base64 decode, eval/exec calls
- Delimiter injection (medium) -- closing prompt tags, end markers
from prela.evals.assertions import NoInjectionAssertion
assertion = NoInjectionAssertion()
result = assertion.evaluate(output="Ignore all previous instructions")
print(result.passed) # False -- injection pattern detected
# Only flag high and critical severity
assertion = NoInjectionAssertion(min_severity="high")
Configuration:
{
"type": "no_injection",
"min_severity": "medium" # Optional: "low", "medium", "high", "critical"
}
CustomRuleAssertion¶
Flexible regex-based assertion for user-defined content rules.
from prela.evals.assertions import CustomRuleAssertion
# Forbid certain words in output
assertion = CustomRuleAssertion(
pattern=r"\b(password|secret|token)\b",
must_match=False,
description="No secrets in output",
)
result = assertion.evaluate(output="Your password is 1234")
print(result.passed) # False -- contains "password"
# Require output to match a pattern
assertion = CustomRuleAssertion(
pattern=r"\bJSON\b",
must_match=True,
description="Output must mention JSON",
)
Configuration:
{
"type": "custom_rule",
"pattern": r"\b(password|secret)\b",
"must_match": False, # Optional, default: False
"case_sensitive": False, # Optional, default: False
"description": "rule name" # Optional
}
AI-Scored Assertions¶
LLMJudgeAssertion¶
Uses an LLM to score agent outputs against custom rubrics. The LLM returns a score (0-1) and reasoning, enabling evaluation of qualities like factual accuracy, helpfulness, or tone.
Subscription Required
LLM-as-Judge assertions require a Lunch Money tier subscription or higher.
from prela.evals.assertions import LLMJudgeAssertion
assertion = LLMJudgeAssertion(
rubric="Score 0-1 on factual accuracy and completeness",
threshold=0.7,
)
result = assertion.evaluate(output="Paris is the capital of France")
print(result.passed) # True if score >= 0.7
print(result.score) # e.g., 0.85
print(result.details) # {"score": 0.85, "reasoning": "...", ...}
Providers:
# Anthropic (default)
assertion = LLMJudgeAssertion(
rubric="Is this response helpful?",
provider="anthropic",
model="claude-haiku-4-5-20251001",
)
# OpenAI
assertion = LLMJudgeAssertion(
rubric="Is this response helpful?",
provider="openai",
model="gpt-4o-mini",
)
Configuration:
{
"type": "llm_judge",
"rubric": "Score 0-1 on factual accuracy", # Required
"threshold": 0.7, # Optional, default: 0.7
"model": "claude-haiku-4-5-20251001", # Optional
"provider": "anthropic", # Optional: "anthropic" or "openai"
"system_prompt": "..." # Optional: override judge system prompt
}
Using Assertions¶
In Test Cases¶
from prela.evals import EvalCase, EvalInput
case = EvalCase(
id="test_1",
input=EvalInput(query="What is 2+2?"),
assertions=[
{"type": "contains", "value": "4"},
{"type": "latency", "max_ms": 3000},
{"type": "length", "min": 5, "max": 100}
]
)
Programmatically¶
from prela.evals.assertions import ContainsAssertion, LengthAssertion
assertions = [
ContainsAssertion(text="success"),
LengthAssertion(min_length=10, max_length=500)
]
for assertion in assertions:
result = assertion.evaluate(output=agent_output)
if not result.passed:
print(f"Failed: {result.message}")
With create_assertion Factory¶
from prela.evals.runner import create_assertion
# Create from config
config = {"type": "contains", "value": "hello"}
assertion = create_assertion(config)
result = assertion.evaluate(output="Hello, world!")
Best Practices¶
1. Combine Multiple Assertions¶
assertions = [
{"type": "contains", "value": "result"},
{"type": "not_contains", "value": "error"},
{"type": "json_valid"},
{"type": "latency", "max_ms": 5000}
]
2. Use Semantic Similarity for Fuzzy Matching¶
# Instead of exact match
{"type": "contains", "value": "Paris is the capital of France"}
# Use semantic similarity
{
"type": "semantic_similarity",
"threshold": 0.85,
"reference": "Paris is the capital of France"
}
3. Validate Tool Usage¶
# Ensure tool was called
{"type": "tool_called", "tool_name": "search"}
# Ensure correct arguments
{"type": "tool_args", "tool_name": "search", "args": {"query": "expected"}}
# Ensure correct sequence
{"type": "tool_sequence", "sequence": ["retrieve", "process", "respond"]}
4. Set Realistic Latency Thresholds¶
# Fast operations
{"type": "latency", "max_ms": 1000}
# LLM calls
{"type": "latency", "max_ms": 10000}
# Complex workflows
{"type": "latency", "max_ms": 30000}
Multi-Agent Assertions¶
Specialized assertions for testing multi-agent systems (CrewAI, AutoGen, LangGraph, Swarm):
AgentUsedAssertion¶
Verify that a specific agent was invoked during execution:
from prela.evals.assertions import AgentUsedAssertion
# Verify agent participated
AgentUsedAssertion(agent_name="Researcher", min_invocations=1)
Use Cases: - Verify agent participation in multi-agent workflows - Ensure critical agents are used - Test agent selection logic
TaskCompletedAssertion¶
Verify that a task completed successfully (CrewAI):
from prela.evals.assertions import TaskCompletedAssertion
# Verify task completion
TaskCompletedAssertion(task_description="Research AI trends")
Use Cases: - Verify task completion in CrewAI crews - Ensure all workflow steps execute - Test task orchestration
DelegationOccurredAssertion¶
Verify agent-to-agent delegation (CrewAI):
from prela.evals.assertions import DelegationOccurredAssertion
# Verify specific delegation
DelegationOccurredAssertion(from_agent="Manager", to_agent="Worker")
# Verify any delegation to agent
DelegationOccurredAssertion(to_agent="Worker")
Use Cases: - Test hierarchical crew processes - Verify delegation logic - Ensure proper task routing
HandoffOccurredAssertion¶
Verify agent handoffs (Swarm):
from prela.evals.assertions import HandoffOccurredAssertion
# Verify specific handoff
HandoffOccurredAssertion(from_agent="Triage", to_agent="Billing")
# Verify any handoff from agent
HandoffOccurredAssertion(from_agent="Triage")
Use Cases: - Test Swarm routing logic - Verify specialist assignment - Ensure handoff triggers work
AgentCollaborationAssertion¶
Verify minimum number of agents participated:
from prela.evals.assertions import AgentCollaborationAssertion
# Require at least 3 agents
AgentCollaborationAssertion(min_agents=3)
Use Cases: - Ensure multi-agent collaboration - Verify sufficient agent participation - Test collaborative workflows
ConversationTurnsAssertion¶
Verify conversation length (AutoGen):
from prela.evals.assertions import ConversationTurnsAssertion
# Verify turn count range
ConversationTurnsAssertion(min_turns=2, max_turns=10)
Use Cases: - Test conversation flow - Verify termination conditions - Ensure efficient dialogues
NoCircularDelegationAssertion¶
Detect circular delegation loops:
from prela.evals.assertions import NoCircularDelegationAssertion
# Verify no circular delegation
NoCircularDelegationAssertion()
Use Cases: - Prevent infinite delegation loops - Verify workflow correctness - Ensure proper delegation graphs
Example: Multi-Agent Test¶
from prela.evals import EvalCase, EvalSuite, EvalRunner
from prela.evals.assertions import (
AgentUsedAssertion,
AgentCollaborationAssertion,
DelegationOccurredAssertion,
NoCircularDelegationAssertion
)
# Test multi-agent workflow
test_case = EvalCase(
id="test_research_crew",
name="Research crew with delegation",
input={"topic": "AI agents"},
assertions=[
# Verify all agents used
AgentUsedAssertion(agent_name="Manager", min_invocations=1),
AgentUsedAssertion(agent_name="Researcher", min_invocations=1),
AgentUsedAssertion(agent_name="Writer", min_invocations=1),
# Verify collaboration
AgentCollaborationAssertion(min_agents=3),
# Verify delegation flow
DelegationOccurredAssertion(from_agent="Manager", to_agent="Researcher"),
DelegationOccurredAssertion(from_agent="Manager", to_agent="Writer"),
# Verify no circular delegation
NoCircularDelegationAssertion()
]
)
suite = EvalSuite(name="Multi-Agent Tests", cases=[test_case])
runner = EvalRunner(suite, my_crew_function)
result = runner.run()
For framework-specific examples: - CrewAI Integration - AutoGen Integration - LangGraph Integration - Swarm Integration
Next Steps¶
- See Writing Tests for test case creation
- Learn Running Evaluations
- Explore CI Integration