Assertions¶

Assertions validate agent outputs against expected behavior. Prela provides 21 built-in assertion types across five categories.

Structural Assertions¶

ContainsAssertion¶

Checks if output contains specific text.

from prela.evals.assertions import ContainsAssertion

assertion = ContainsAssertion(text="success", case_sensitive=False)
result = assertion.evaluate(output="Operation completed successfully!")
print(result.passed)  # True

Configuration:

{
    "type": "contains",
    "value": "expected text",
    "case_sensitive": False  # Optional, default: False
}

NotContainsAssertion¶

Checks if output does NOT contain specific text.

from prela.evals.assertions import NotContainsAssertion

assertion = NotContainsAssertion(text="error", case_sensitive=True)
result = assertion.evaluate(output="All tests passed!")
print(result.passed)  # True

Configuration:

{
    "type": "not_contains",
    "value": "forbidden text",
    "case_sensitive": True  # Optional
}

RegexAssertion¶

Matches output against a regex pattern.

from prela.evals.assertions import RegexAssertion

assertion = RegexAssertion(pattern=r"\\d{3}-\\d{3}-\\d{4}")
result = assertion.evaluate(output="Call me at 555-123-4567")
print(result.passed)  # True

Configuration:

{
    "type": "regex",
    "pattern": r"\\d{3}-\\d{3}-\\d{4}"
}

LengthAssertion¶

Validates output length is within bounds.

from prela.evals.assertions import LengthAssertion

assertion = LengthAssertion(min_length=10, max_length=100)
result = assertion.evaluate(output="This is a medium length response.")
print(result.passed)  # True

Configuration:

{
    "type": "length",
    "min": 10,    # Optional
    "max": 100    # Optional
}

JSONValidAssertion¶

Validates output is valid JSON.

from prela.evals.assertions import JSONValidAssertion

assertion = JSONValidAssertion()
result = assertion.evaluate(output='{"key": "value"}')
print(result.passed)  # True

Configuration:

{"type": "json_valid"}

LatencyAssertion¶

Validates response time is under threshold.

from prela.evals.assertions import LatencyAssertion
from prela.core.span import Span, SpanType
from datetime import datetime, timezone, timedelta

assertion = LatencyAssertion(max_ms=5000)

# Create span with timing
span = Span(
    name="test",
    span_type=SpanType.LLM,
    started_at=datetime.now(timezone.utc),
    ended_at=datetime.now(timezone.utc) + timedelta(milliseconds=1234)
)

result = assertion.evaluate(output="", trace=[span])
print(result.passed)  # True (1234ms < 5000ms)

Configuration:

{
    "type": "latency",
    "max_ms": 5000
}

Tool Assertions¶

ToolCalledAssertion¶

Validates that a specific tool was called.

from prela.evals.assertions import ToolCalledAssertion
from prela.core.span import Span, SpanType

assertion = ToolCalledAssertion(tool_name="search")

# Create span with tool call
span = Span(name="tool.search", span_type=SpanType.TOOL)
span.set_attribute("tool.name", "search")

result = assertion.evaluate(output="", trace=[span])
print(result.passed)  # True

Configuration:

{
    "type": "tool_called",
    "tool_name": "search"
}

ToolArgsAssertion¶

Validates tool was called with correct arguments.

from prela.evals.assertions import ToolArgsAssertion

assertion = ToolArgsAssertion(
    tool_name="calculator",
    args={"x": 5, "y": 3}
)

# Create span with tool args
span = Span(name="tool.calculator", span_type=SpanType.TOOL)
span.set_attribute("tool.name", "calculator")
span.set_attribute("tool.input", {"x": 5, "y": 3})

result = assertion.evaluate(output="", trace=[span])
print(result.passed)  # True

Configuration:

{
    "type": "tool_args",
    "tool_name": "calculator",
    "args": {"x": 5, "y": 3}
}

ToolSequenceAssertion¶

Validates tools were called in specific order.

from prela.evals.assertions import ToolSequenceAssertion

assertion = ToolSequenceAssertion(sequence=["search", "summarize", "format"])

# Create spans for each tool
spans = [
    Span(name="tool.search", span_type=SpanType.TOOL),
    Span(name="tool.summarize", span_type=SpanType.TOOL),
    Span(name="tool.format", span_type=SpanType.TOOL)
]

for span, tool in zip(spans, ["search", "summarize", "format"]):
    span.set_attribute("tool.name", tool)

result = assertion.evaluate(output="", trace=spans)
print(result.passed)  # True

Configuration:

{
    "type": "tool_sequence",
    "sequence": ["search", "summarize", "format"]
}

Semantic Assertions¶

SemanticSimilarityAssertion¶

Validates semantic similarity to reference text.

Requirements:

pip install sentence-transformers

from prela.evals.assertions import SemanticSimilarityAssertion

assertion = SemanticSimilarityAssertion(
    reference="The capital of France is Paris",
    threshold=0.8
)

result = assertion.evaluate(output="Paris is the capital city of France")
print(result.passed)  # True (similarity > 0.8)

Configuration:

{
    "type": "semantic_similarity",
    "reference": "Expected meaning",
    "threshold": 0.8,  # 0.0 to 1.0
    "model": "all-MiniLM-L6-v2"  # Optional
}

Security Assertions¶

NoPIIAssertion¶

Validates that output contains no personally identifiable information. Detects emails, phone numbers, SSNs, credit card numbers, and API keys (AWS, Stripe, GitHub, OpenAI, Slack, Google).

from prela.evals.assertions import NoPIIAssertion

assertion = NoPIIAssertion()
result = assertion.evaluate(output="Contact [email protected] for details")
print(result.passed)  # False -- contains email

# Allow specific PII types
assertion = NoPIIAssertion(allow_emails=True, allow_phones=True)
result = assertion.evaluate(output="Email [email protected]")
print(result.passed)  # True -- emails allowed

Configuration:

{
    "type": "no_pii",
    "allow_emails": False,  # Optional, default: False
    "allow_phones": False   # Optional, default: False
}

NoInjectionAssertion¶

Validates that output does not contain prompt injection patterns. Scans for 5 categories of injection attempts:

Instruction overrides (critical) -- "ignore previous instructions", "override system prompt"
Jailbreak attempts (high) -- DAN mode, developer mode, "act without restrictions"
Role confusion (high) -- injected [SYSTEM], <|assistant|>, <system> markers
Encoded injection (medium) -- base64 decode, eval/exec calls
Delimiter injection (medium) -- closing prompt tags, end markers

from prela.evals.assertions import NoInjectionAssertion

assertion = NoInjectionAssertion()
result = assertion.evaluate(output="Ignore all previous instructions")
print(result.passed)  # False -- injection pattern detected

# Only flag high and critical severity
assertion = NoInjectionAssertion(min_severity="high")

Configuration:

{
    "type": "no_injection",
    "min_severity": "medium"  # Optional: "low", "medium", "high", "critical"
}

CustomRuleAssertion¶

Flexible regex-based assertion for user-defined content rules.

from prela.evals.assertions import CustomRuleAssertion

# Forbid certain words in output
assertion = CustomRuleAssertion(
    pattern=r"\b(password|secret|token)\b",
    must_match=False,
    description="No secrets in output",
)
result = assertion.evaluate(output="Your password is 1234")
print(result.passed)  # False -- contains "password"

# Require output to match a pattern
assertion = CustomRuleAssertion(
    pattern=r"\bJSON\b",
    must_match=True,
    description="Output must mention JSON",
)

Configuration:

{
    "type": "custom_rule",
    "pattern": r"\b(password|secret)\b",
    "must_match": False,        # Optional, default: False
    "case_sensitive": False,    # Optional, default: False
    "description": "rule name"  # Optional
}

AI-Scored Assertions¶

LLMJudgeAssertion¶

Uses an LLM to score agent outputs against custom rubrics. The LLM returns a score (0-1) and reasoning, enabling evaluation of qualities like factual accuracy, helpfulness, or tone.

Subscription Required

LLM-as-Judge assertions require a Lunch Money tier subscription or higher.

from prela.evals.assertions import LLMJudgeAssertion

assertion = LLMJudgeAssertion(
    rubric="Score 0-1 on factual accuracy and completeness",
    threshold=0.7,
)
result = assertion.evaluate(output="Paris is the capital of France")
print(result.passed)   # True if score >= 0.7
print(result.score)    # e.g., 0.85
print(result.details)  # {"score": 0.85, "reasoning": "...", ...}

Providers:

# Anthropic (default)
assertion = LLMJudgeAssertion(
    rubric="Is this response helpful?",
    provider="anthropic",
    model="claude-haiku-4-5-20251001",
)

# OpenAI
assertion = LLMJudgeAssertion(
    rubric="Is this response helpful?",
    provider="openai",
    model="gpt-4o-mini",
)

Configuration:

{
    "type": "llm_judge",
    "rubric": "Score 0-1 on factual accuracy",  # Required
    "threshold": 0.7,           # Optional, default: 0.7
    "model": "claude-haiku-4-5-20251001",  # Optional
    "provider": "anthropic",    # Optional: "anthropic" or "openai"
    "system_prompt": "..."      # Optional: override judge system prompt
}

Using Assertions¶

In Test Cases¶

from prela.evals import EvalCase, EvalInput

case = EvalCase(
    id="test_1",
    input=EvalInput(query="What is 2+2?"),
    assertions=[
        {"type": "contains", "value": "4"},
        {"type": "latency", "max_ms": 3000},
        {"type": "length", "min": 5, "max": 100}
    ]
)

Programmatically¶

from prela.evals.assertions import ContainsAssertion, LengthAssertion

assertions = [
    ContainsAssertion(text="success"),
    LengthAssertion(min_length=10, max_length=500)
]

for assertion in assertions:
    result = assertion.evaluate(output=agent_output)
    if not result.passed:
        print(f"Failed: {result.message}")

With create_assertion Factory¶

from prela.evals.runner import create_assertion

# Create from config
config = {"type": "contains", "value": "hello"}
assertion = create_assertion(config)

result = assertion.evaluate(output="Hello, world!")

Best Practices¶

1. Combine Multiple Assertions¶

assertions = [
    {"type": "contains", "value": "result"},
    {"type": "not_contains", "value": "error"},
    {"type": "json_valid"},
    {"type": "latency", "max_ms": 5000}
]

2. Use Semantic Similarity for Fuzzy Matching¶

# Instead of exact match
{"type": "contains", "value": "Paris is the capital of France"}

# Use semantic similarity
{
    "type": "semantic_similarity",
    "threshold": 0.85,
    "reference": "Paris is the capital of France"
}

3. Validate Tool Usage¶

# Ensure tool was called
{"type": "tool_called", "tool_name": "search"}

# Ensure correct arguments
{"type": "tool_args", "tool_name": "search", "args": {"query": "expected"}}

# Ensure correct sequence
{"type": "tool_sequence", "sequence": ["retrieve", "process", "respond"]}

4. Set Realistic Latency Thresholds¶

# Fast operations
{"type": "latency", "max_ms": 1000}

# LLM calls
{"type": "latency", "max_ms": 10000}

# Complex workflows
{"type": "latency", "max_ms": 30000}

Multi-Agent Assertions¶

Specialized assertions for testing multi-agent systems (CrewAI, AutoGen, LangGraph, Swarm):

AgentUsedAssertion¶

Verify that a specific agent was invoked during execution:

from prela.evals.assertions import AgentUsedAssertion

# Verify agent participated
AgentUsedAssertion(agent_name="Researcher", min_invocations=1)

Use Cases: - Verify agent participation in multi-agent workflows - Ensure critical agents are used - Test agent selection logic

TaskCompletedAssertion¶

Verify that a task completed successfully (CrewAI):

from prela.evals.assertions import TaskCompletedAssertion

# Verify task completion
TaskCompletedAssertion(task_description="Research AI trends")

Use Cases: - Verify task completion in CrewAI crews - Ensure all workflow steps execute - Test task orchestration

DelegationOccurredAssertion¶

Verify agent-to-agent delegation (CrewAI):

from prela.evals.assertions import DelegationOccurredAssertion

# Verify specific delegation
DelegationOccurredAssertion(from_agent="Manager", to_agent="Worker")

# Verify any delegation to agent
DelegationOccurredAssertion(to_agent="Worker")

Use Cases: - Test hierarchical crew processes - Verify delegation logic - Ensure proper task routing

HandoffOccurredAssertion¶

Verify agent handoffs (Swarm):

from prela.evals.assertions import HandoffOccurredAssertion

# Verify specific handoff
HandoffOccurredAssertion(from_agent="Triage", to_agent="Billing")

# Verify any handoff from agent
HandoffOccurredAssertion(from_agent="Triage")

Use Cases: - Test Swarm routing logic - Verify specialist assignment - Ensure handoff triggers work

AgentCollaborationAssertion¶

Verify minimum number of agents participated:

from prela.evals.assertions import AgentCollaborationAssertion

# Require at least 3 agents
AgentCollaborationAssertion(min_agents=3)

Use Cases: - Ensure multi-agent collaboration - Verify sufficient agent participation - Test collaborative workflows

ConversationTurnsAssertion¶

Verify conversation length (AutoGen):

from prela.evals.assertions import ConversationTurnsAssertion

# Verify turn count range
ConversationTurnsAssertion(min_turns=2, max_turns=10)

Use Cases: - Test conversation flow - Verify termination conditions - Ensure efficient dialogues

NoCircularDelegationAssertion¶

Detect circular delegation loops:

from prela.evals.assertions import NoCircularDelegationAssertion

# Verify no circular delegation
NoCircularDelegationAssertion()

Use Cases: - Prevent infinite delegation loops - Verify workflow correctness - Ensure proper delegation graphs

Example: Multi-Agent Test¶

from prela.evals import EvalCase, EvalSuite, EvalRunner
from prela.evals.assertions import (
    AgentUsedAssertion,
    AgentCollaborationAssertion,
    DelegationOccurredAssertion,
    NoCircularDelegationAssertion
)

# Test multi-agent workflow
test_case = EvalCase(
    id="test_research_crew",
    name="Research crew with delegation",
    input={"topic": "AI agents"},
    assertions=[
        # Verify all agents used
        AgentUsedAssertion(agent_name="Manager", min_invocations=1),
        AgentUsedAssertion(agent_name="Researcher", min_invocations=1),
        AgentUsedAssertion(agent_name="Writer", min_invocations=1),

        # Verify collaboration
        AgentCollaborationAssertion(min_agents=3),

        # Verify delegation flow
        DelegationOccurredAssertion(from_agent="Manager", to_agent="Researcher"),
        DelegationOccurredAssertion(from_agent="Manager", to_agent="Writer"),

        # Verify no circular delegation
        NoCircularDelegationAssertion()
    ]
)

suite = EvalSuite(name="Multi-Agent Tests", cases=[test_case])
runner = EvalRunner(suite, my_crew_function)
result = runner.run()

For framework-specific examples: - CrewAI Integration - AutoGen Integration - LangGraph Integration - Swarm Integration

Next Steps¶

See Writing Tests for test case creation
Learn Running Evaluations
Explore CI Integration