Skip to content

Evaluations API

Framework for testing and evaluating AI agent behavior.

Test Case Definition

EvalInput

prela.evals.case.EvalInput dataclass

Input data for an eval case.

Represents what goes into the agent being tested. Can be a simple query, a list of messages, or custom context data.

Attributes:

Name Type Description
query str | None

Simple string query/prompt (for basic use cases)

messages list[dict] | None

List of message dicts (for chat-based agents)

context dict[str, Any] | None

Additional context data (e.g., retrieved documents, metadata)

Example

Simple query

input1 = EvalInput(query="What is the capital of France?")

Chat messages

input2 = EvalInput(messages=[ ... {"role": "system", "content": "You are a helpful assistant"}, ... {"role": "user", "content": "Hello!"} ... ])

Query with context

input3 = EvalInput( ... query="Summarize the document", ... context={"document": "Long text here..."} ... )

Functions

__post_init__()

Validate that at least one input type is provided.

to_agent_input()

Convert to format that agent expects.

Returns:

Type Description
dict[str, Any]

Dictionary with all non-None input fields.

Example

input = EvalInput(query="Hello", context={"user_id": "123"}) input.to_agent_input() {'query': 'Hello', 'context': {'user_id': '123'}}

from_dict(data) classmethod

Create EvalInput from dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with 'query', 'messages', and/or 'context' keys

required

Returns:

Type Description
EvalInput

EvalInput instance

Example

data = {"query": "Hello", "context": {"key": "value"}} input = EvalInput.from_dict(data)

to_dict()

Convert to dictionary for serialization.

Returns:

Type Description
dict[str, Any]

Dictionary representation of the input.

EvalExpected

prela.evals.case.EvalExpected dataclass

Expected output for an eval case.

Defines what the agent's output should look like. Supports multiple validation strategies: - Exact output match - Contains/not_contains substring checks - Tool call validation - Custom metadata checks

Attributes:

Name Type Description
output str | None

Exact expected output string

contains list[str] | None

List of substrings that must appear in output

not_contains list[str] | None

List of substrings that must NOT appear in output

tool_calls list[dict[str, Any]] | None

Expected tool calls (list of dicts with 'name', 'args', etc.)

metadata dict[str, Any] | None

Expected metadata fields (e.g., final_answer, confidence)

Example

Exact match

expected1 = EvalExpected(output="The answer is 42")

Substring checks

expected2 = EvalExpected( ... contains=["Paris", "capital"], ... not_contains=["London", "Berlin"] ... )

Tool call validation

expected3 = EvalExpected(tool_calls=[ ... {"name": "search", "args": {"query": "weather"}} ... ])

Functions

__post_init__()

Validate that at least one expectation is provided.

from_dict(data) classmethod

Create EvalExpected from dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with expected output specifications

required

Returns:

Type Description
EvalExpected

EvalExpected instance

Example

data = {"contains": ["Paris"], "not_contains": ["London"]} expected = EvalExpected.from_dict(data)

to_dict()

Convert to dictionary for serialization.

Returns:

Type Description
dict[str, Any]

Dictionary representation of the expected output.

EvalCase

prela.evals.case.EvalCase dataclass

Complete evaluation test case.

Represents a single test case with input, expected output, and assertions. Eval cases are the building blocks of eval suites.

Attributes:

Name Type Description
id str

Unique identifier for this test case

name str

Human-readable test case name

input EvalInput

Input data for the agent

expected EvalExpected | None

Expected output (optional, can use assertions instead)

assertions list[dict[str, Any]] | None

List of assertion configurations (dicts with 'type', 'value', etc.)

tags list[str]

Tags for filtering/grouping test cases

timeout_seconds float

Maximum execution time for this test case

metadata dict[str, Any]

Additional metadata for this test case

Example

case = EvalCase( ... id="test_basic_qa", ... name="Basic factual question", ... input=EvalInput(query="What is the capital of France?"), ... expected=EvalExpected(contains=["Paris"]), ... assertions=[ ... {"type": "contains", "value": "Paris"}, ... {"type": "semantic_similarity", "threshold": 0.8} ... ], ... tags=["qa", "geography"], ... timeout_seconds=10.0 ... )

Functions

__init__(id, name, input, expected=None, assertions=None, tags=list(), timeout_seconds=30.0, metadata=dict())

to_dict()

Convert to dictionary for serialization.

Returns:

Type Description
dict[str, Any]

Dictionary representation of the test case.

Example

case = EvalCase( ... id="test_1", ... name="Test", ... input=EvalInput(query="Hello"), ... expected=EvalExpected(contains=["Hi"]) ... ) data = case.to_dict() data["id"] 'test_1'

from_dict(data) classmethod

Create EvalCase from dictionary.

Parameters:

Name Type Description Default
data dict[str, Any]

Dictionary with test case specification

required

Returns:

Type Description
EvalCase

EvalCase instance

Example

data = { ... "id": "test_1", ... "name": "Test case 1", ... "input": {"query": "Hello"}, ... "expected": {"contains": ["Hi"]}, ... "tags": ["greeting"] ... } case = EvalCase.from_dict(data)

Test Suite

EvalSuite

prela.evals.suite.EvalSuite dataclass

Collection of eval cases with shared configuration.

An eval suite organizes multiple test cases with: - Shared setup/teardown hooks - Default assertions applied to all cases - YAML serialization for easy configuration - Tagging and filtering capabilities

Attributes:

Name Type Description
name str

Suite name (e.g., "RAG Quality Suite")

description str

Human-readable description of what this suite tests

cases list[EvalCase]

List of eval cases in this suite

default_assertions list[dict[str, Any]] | None

Assertions applied to all cases (unless overridden)

setup Callable[[], None] | None

Callable run before executing the suite (e.g., start services)

teardown Callable[[], None] | None

Callable run after executing the suite (e.g., cleanup)

metadata dict[str, Any]

Additional metadata for the suite

Example

suite = EvalSuite( ... name="RAG Quality Suite", ... description="Tests for RAG pipeline quality", ... cases=[ ... EvalCase( ... id="test_basic_qa", ... name="Basic factual question", ... input=EvalInput(query="What is the capital of France?"), ... expected=EvalExpected(contains=["Paris"]) ... ) ... ], ... default_assertions=[ ... {"type": "latency", "max_ms": 5000}, ... {"type": "no_errors"} ... ] ... )

Functions

__init__(name, description='', cases=list(), default_assertions=None, setup=None, teardown=None, metadata=dict())

add_case(case)

Add a test case to the suite.

Parameters:

Name Type Description Default
case EvalCase

Eval case to add

required
Example

suite = EvalSuite(name="My Suite") case = EvalCase( ... id="test_1", ... name="Test", ... input=EvalInput(query="Hello"), ... expected=EvalExpected(contains=["Hi"]) ... ) suite.add_case(case)

filter_by_tags(tags)

Filter test cases by tags.

Returns cases that have ALL specified tags.

Parameters:

Name Type Description Default
tags list[str]

List of tags to filter by

required

Returns:

Type Description
list[EvalCase]

List of matching test cases

Example

suite = EvalSuite(name="My Suite", cases=[...]) qa_cases = suite.filter_by_tags(["qa"]) geography_qa = suite.filter_by_tags(["qa", "geography"])

to_yaml(path)

Save eval suite to YAML file.

Parameters:

Name Type Description Default
path str | Path

Path to save YAML file

required

Raises:

Type Description
ImportError

If PyYAML is not installed

Example

suite = EvalSuite(name="My Suite", cases=[...]) suite.to_yaml("suite.yaml")

from_yaml(path) classmethod

Load eval suite from YAML file.

Parameters:

Name Type Description Default
path str | Path

Path to YAML file

required

Returns:

Type Description
EvalSuite

EvalSuite instance

Raises:

Type Description
ImportError

If PyYAML is not installed

FileNotFoundError

If file doesn't exist

YAMLError

If YAML parsing fails

Example

suite = EvalSuite.from_yaml("tests/suite.yaml")

Test Execution

EvalRunner

prela.evals.runner.EvalRunner

Runner for executing evaluation suites against AI agents.

The runner executes test cases, runs assertions, captures traces, and aggregates results. Supports parallel execution with thread pools.

Example

from prela.evals import EvalSuite, EvalRunner from prela import get_tracer

suite = EvalSuite.from_yaml("tests.yaml") tracer = get_tracer()

def my_agent(input_data): ... # Your agent logic here ... return "agent output"

runner = EvalRunner(suite, my_agent, tracer=tracer) result = runner.run() print(result.summary())

Functions

__init__(suite, agent, tracer=None, parallel=False, max_workers=4, on_case_complete=None)

Initialize the evaluation runner.

Parameters:

Name Type Description Default
suite EvalSuite

The evaluation suite to run.

required
agent Callable[[EvalInput], Any]

Callable that takes an EvalInput and returns agent output.

required
tracer Tracer | None

Optional tracer for capturing execution traces.

None
parallel bool

Whether to run cases in parallel using a thread pool.

False
max_workers int

Maximum number of worker threads if parallel=True.

4
on_case_complete Callable[[CaseResult], None] | None

Optional callback invoked after each case completes.

None

run()

Run all test cases in the evaluation suite.

Executes setup/teardown hooks, runs all cases (sequentially or in parallel), executes assertions, and aggregates results.

Returns:

Type Description
EvalRunResult

EvalRunResult with aggregated statistics and individual case results.

run_case(case)

Run a single test case.

Executes the agent with the case input, runs all assertions, captures the trace ID if a tracer is configured, and returns aggregated results.

Parameters:

Name Type Description Default
case EvalCase

The test case to run.

required

Returns:

Type Description
CaseResult

CaseResult with pass/fail status and assertion results.

CaseResult

prela.evals.runner.CaseResult dataclass

Result of running a single eval case.

Functions

__post_init__()

Validate fields.

EvalRunResult

prela.evals.runner.EvalRunResult dataclass

Result of running an evaluation suite.

Functions

summary()

Return human-readable summary of the evaluation run.

Returns:

Type Description
str

Multi-line string with summary statistics and case results.

create_assertion

prela.evals.runner.create_assertion(config)

Factory function to create assertion instances from configuration.

This maps assertion type strings to concrete assertion classes and instantiates them with the provided configuration.

Parameters:

Name Type Description Default
config dict

Dictionary with "type" key and type-specific parameters.

required

Returns:

Type Description
BaseAssertion

Instantiated assertion object.

Raises:

Type Description
ValueError

If assertion type is unknown or configuration is invalid.

Example

assertion = create_assertion({ ... "type": "contains", ... "text": "hello", ... "case_sensitive": False ... }) result = assertion.evaluate("Hello world", None, None) assert result.passed

Assertions

Base Assertion

prela.evals.assertions.base.BaseAssertion

Bases: ABC

Base class for all assertions.

Assertions evaluate agent outputs and traces to determine if they meet expected criteria. Subclasses should implement the evaluate() method to perform the actual check.

Functions

evaluate(output, expected, trace) abstractmethod

Evaluate the assertion against the output and trace.

Parameters:

Name Type Description Default
output Any

The actual output from the agent/function under test

required
expected Any | None

The expected output (format depends on assertion type)

required
trace list[Span] | None

Optional list of spans from the traced execution

required

Returns:

Type Description
AssertionResult

AssertionResult with pass/fail status and details

from_config(config) abstractmethod classmethod

Create an assertion instance from configuration dict.

Parameters:

Name Type Description Default
config dict[str, Any]

Configuration dictionary with assertion-specific parameters

required

Returns:

Type Description
BaseAssertion

Configured assertion instance

Raises:

Type Description
ValueError

If configuration is invalid

prela.evals.assertions.base.AssertionResult dataclass

Result of an assertion evaluation.

Attributes:

Name Type Description
passed bool

Whether the assertion passed

assertion_type str

Type of assertion (e.g., "contains", "semantic_similarity")

message str

Human-readable message describing the result

score float | None

Optional score between 0-1 for partial credit assertions

expected Any

Expected value (if applicable)

actual Any

Actual value that was evaluated

details dict[str, Any]

Additional details about the evaluation

Functions

__str__()

Human-readable string representation.

Structural Assertions

prela.evals.assertions.structural.ContainsAssertion

Bases: BaseAssertion

Assert that output contains specified text.

Example

assertion = ContainsAssertion(text="error", case_sensitive=False) result = assertion.evaluate(output="Error occurred", expected=None, trace=None) assert result.passed

Functions

__init__(text, case_sensitive=True)

Initialize contains assertion.

Parameters:

Name Type Description Default
text str

Text that must be present in output

required
case_sensitive bool

Whether to perform case-sensitive matching

True

evaluate(output, expected, trace)

Check if output contains the specified text.

from_config(config) classmethod

Create from configuration.

Config format

{ "text": "required text", "case_sensitive": true # optional, default: true }

prela.evals.assertions.structural.NotContainsAssertion

Bases: BaseAssertion

Assert that output does NOT contain specified text.

Example

assertion = NotContainsAssertion(text="error") result = assertion.evaluate(output="Success!", expected=None, trace=None) assert result.passed

Functions

__init__(text, case_sensitive=True)

Initialize not-contains assertion.

Parameters:

Name Type Description Default
text str

Text that must NOT be present in output

required
case_sensitive bool

Whether to perform case-sensitive matching

True

evaluate(output, expected, trace)

Check if output does not contain the specified text.

from_config(config) classmethod

Create from configuration.

Config format

{ "text": "forbidden text", "case_sensitive": true # optional, default: true }

prela.evals.assertions.structural.RegexAssertion

Bases: BaseAssertion

Assert that output matches a regular expression pattern.

Example

assertion = RegexAssertion(pattern=r"\d{3}-\d{4}") result = assertion.evaluate(output="Call 555-1234", expected=None, trace=None) assert result.passed

Functions

__init__(pattern, flags=0)

Initialize regex assertion.

Parameters:

Name Type Description Default
pattern str

Regular expression pattern to match

required
flags int

Optional regex flags (e.g., re.IGNORECASE)

0

evaluate(output, expected, trace)

Check if output matches the regex pattern.

from_config(config) classmethod

Create from configuration.

Config format

{ "pattern": "\d{3}-\d{4}", "flags": 2 # optional, e.g., re.IGNORECASE }

prela.evals.assertions.structural.LengthAssertion

Bases: BaseAssertion

Assert that output length is within specified bounds.

Example

assertion = LengthAssertion(min_length=10, max_length=100) result = assertion.evaluate(output="Hello, world!", expected=None, trace=None) assert result.passed

Functions

__init__(min_length=None, max_length=None)

Initialize length assertion.

Parameters:

Name Type Description Default
min_length int | None

Minimum acceptable length (inclusive)

None
max_length int | None

Maximum acceptable length (inclusive)

None

Raises:

Type Description
ValueError

If both min_length and max_length are None

evaluate(output, expected, trace)

Check if output length is within bounds.

from_config(config) classmethod

Create from configuration.

Config format

{ "min_length": 10, # optional "max_length": 100 # optional }

prela.evals.assertions.structural.JSONValidAssertion

Bases: BaseAssertion

Assert that output is valid JSON, optionally matching a schema.

Example

assertion = JSONValidAssertion() result = assertion.evaluate(output='{"key": "value"}', expected=None, trace=None) assert result.passed

Functions

__init__(schema=None)

Initialize JSON validation assertion.

Parameters:

Name Type Description Default
schema dict[str, Any] | None

Optional JSON schema to validate against (using jsonschema library)

None

evaluate(output, expected, trace)

Check if output is valid JSON and optionally matches schema.

from_config(config) classmethod

Create from configuration.

Config format

{ "schema": { # optional "type": "object", "properties": { "name": {"type": "string"} } } }

Tool Assertions

prela.evals.assertions.tool.ToolCalledAssertion

Bases: BaseAssertion

Assert that a specific tool was called during execution.

This assertion examines the trace to verify that a tool span with the specified name exists.

Example

assertion = ToolCalledAssertion(tool_name="web_search") result = assertion.evaluate(output=None, expected=None, trace=spans) assert result.passed

Functions

__init__(tool_name)

Initialize tool called assertion.

Parameters:

Name Type Description Default
tool_name str

Name of the tool that should have been called

required

evaluate(output, expected, trace)

Check if the specified tool was called in the trace.

from_config(config) classmethod

Create from configuration.

Config format

{ "tool_name": "web_search" }

prela.evals.assertions.tool.ToolArgsAssertion

Bases: BaseAssertion

Assert that a tool was called with expected arguments.

This assertion verifies both that the tool was called and that it was called with specific argument values.

Example

assertion = ToolArgsAssertion( ... tool_name="web_search", ... expected_args={"query": "Python tutorial"} ... ) result = assertion.evaluate(output=None, expected=None, trace=spans) assert result.passed

Functions

__init__(tool_name, expected_args, partial_match=True)

Initialize tool args assertion.

Parameters:

Name Type Description Default
tool_name str

Name of the tool to check

required
expected_args dict[str, Any]

Expected argument key-value pairs

required
partial_match bool

If True, only check that expected_args are present (allow additional args). If False, require exact match.

True

evaluate(output, expected, trace)

Check if tool was called with expected arguments.

from_config(config) classmethod

Create from configuration.

Config format

{ "tool_name": "web_search", "expected_args": {"query": "Python"}, "partial_match": true # optional, default: true }

prela.evals.assertions.tool.ToolSequenceAssertion

Bases: BaseAssertion

Assert that tools were called in a specific order.

This assertion verifies that tools appear in the trace in the expected sequence, though other tools may appear between them.

Example

assertion = ToolSequenceAssertion( ... sequence=["web_search", "calculator", "summarize"] ... ) result = assertion.evaluate(output=None, expected=None, trace=spans) assert result.passed

Functions

__init__(sequence, strict=False)

Initialize tool sequence assertion.

Parameters:

Name Type Description Default
sequence list[str]

Expected sequence of tool names

required
strict bool

If True, no other tools can appear between expected ones. If False, other tools are allowed between expected sequence.

False

evaluate(output, expected, trace)

Check if tools were called in the expected sequence.

from_config(config) classmethod

Create from configuration.

Config format

{ "sequence": ["tool1", "tool2", "tool3"], "strict": false # optional, default: false }

Semantic Assertions

prela.evals.assertions.semantic.SemanticSimilarityAssertion

Bases: BaseAssertion

Assert that output is semantically similar to expected text.

Uses sentence embeddings to compare semantic meaning rather than exact text matching. Useful for evaluating LLM outputs where phrasing varies but meaning should be consistent.

Example

assertion = SemanticSimilarityAssertion( ... expected_text="The weather is nice today", ... threshold=0.8 ... ) result = assertion.evaluate( ... output="Today has beautiful weather", ... expected=None, ... trace=None ... ) assert result.passed # High similarity despite different wording

Requires

pip install sentence-transformers

Functions

__init__(expected_text, threshold=0.8, model_name='all-MiniLM-L6-v2')

Initialize semantic similarity assertion.

Parameters:

Name Type Description Default
expected_text str

Text to compare against

required
threshold float

Minimum cosine similarity score (0-1) to pass

0.8
model_name str

Sentence transformer model to use (default: all-MiniLM-L6-v2, fast and accurate)

'all-MiniLM-L6-v2'

Raises:

Type Description
ImportError

If sentence-transformers is not installed

ValueError

If threshold is not between 0 and 1

evaluate(output, expected, trace)

Check if output is semantically similar to expected text.

from_config(config) classmethod

Create from configuration.

Config format

{ "expected_text": "The expected output", "threshold": 0.8, # optional, default: 0.8 "model_name": "all-MiniLM-L6-v2" # optional }

clear_cache() classmethod

Clear the embedding cache. Useful for testing or memory management.

get_cache_size() classmethod

Get the number of cached embeddings.

Reporters

ConsoleReporter

prela.evals.reporters.console.ConsoleReporter

Reporter that pretty-prints evaluation results to the console.

Uses rich library for colored output if available, falls back to plain text formatting otherwise. Provides: - Summary statistics (pass rate, duration) - List of all test cases with pass/fail status - Detailed failure information for failed cases - Color coding (green=pass, red=fail, yellow=warning)

Example

from prela.evals import EvalRunner from prela.evals.reporters import ConsoleReporter

runner = EvalRunner(suite, agent) result = runner.run()

reporter = ConsoleReporter(verbose=True, use_colors=True) reporter.report(result) ✓ Geography QA Suite ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Total: 10 | Passed: 9 (90.0%) | Failed: 1 Duration: 2.5s ...

Functions

__init__(verbose=True, use_colors=True)

Initialize the console reporter.

Parameters:

Name Type Description Default
verbose bool

If True, show detailed failure information. If False, only show summary statistics and failed case names.

True
use_colors bool

If True and rich is available, use colored output. If False or rich unavailable, use plain text.

True

report(result)

Print the evaluation results to the console.

Parameters:

Name Type Description Default
result EvalRunResult

The evaluation run result to report.

required

JSONReporter

prela.evals.reporters.json.JSONReporter

Reporter that writes evaluation results to a JSON file.

Outputs a structured JSON file containing all evaluation data: - Suite metadata (name, timestamps, duration) - Summary statistics (total, passed, failed, pass rate) - Individual case results with assertion details - Full error messages and stack traces

The JSON format is designed for: - Programmatic analysis of test results - Integration with data processing pipelines - Historical comparison of evaluation runs - CI/CD artifact storage

Example

from prela.evals import EvalRunner from prela.evals.reporters import JSONReporter

runner = EvalRunner(suite, agent) result = runner.run()

reporter = JSONReporter("results/eval_run_123.json") reporter.report(result)

Creates results/eval_run_123.json with full results

Functions

__init__(output_path, indent=2)

Initialize the JSON reporter.

Parameters:

Name Type Description Default
output_path str | Path

Path where the JSON file will be written. Parent directories will be created if they don't exist.

required
indent int

Number of spaces for JSON indentation (default: 2). Set to None for compact output.

2

report(result)

Write the evaluation results to a JSON file.

Creates parent directories if they don't exist. Overwrites any existing file at the output path.

Parameters:

Name Type Description Default
result EvalRunResult

The evaluation run result to write.

required

Raises:

Type Description
OSError

If unable to write to the output path.

JUnitReporter

prela.evals.reporters.junit.JUnitReporter

Reporter that generates JUnit XML format for CI/CD integration.

Creates a JUnit XML file that can be consumed by continuous integration systems for test result visualization, trend analysis, and failure reporting.

The XML format follows the JUnit schema with: - root element with summary statistics - elements for each test case - elements for failed assertions - elements for execution errors - for additional output/trace information

Supported CI/CD platforms: - Jenkins (JUnit plugin) - GitLab CI/CD (junit report artifacts) - GitHub Actions (test reporters) - Azure DevOps (publish test results) - CircleCI (store_test_results)

Example

from prela.evals import EvalRunner from prela.evals.reporters import JUnitReporter

runner = EvalRunner(suite, agent) result = runner.run()

reporter = JUnitReporter("test-results/junit.xml") reporter.report(result)

Creates JUnit XML at test-results/junit.xml

Functions

__init__(output_path)

Initialize the JUnit XML reporter.

Parameters:

Name Type Description Default
output_path str | Path

Path where the JUnit XML file will be written. Parent directories will be created if they don't exist.

required

report(result)

Generate and write JUnit XML for the evaluation results.

Creates parent directories if they don't exist. Overwrites any existing file at the output path.

Parameters:

Name Type Description Default
result EvalRunResult

The evaluation run result to convert to JUnit XML.

required

Raises:

Type Description
OSError

If unable to write to the output path.