Evaluations API¶

Framework for testing and evaluating AI agent behavior.

Test Case Definition¶

EvalInput¶

`prela.evals.case.EvalInput` `dataclass` ¶

Input data for an eval case.

Represents what goes into the agent being tested. Can be a simple query, a list of messages, or custom context data.

Attributes:

Name	Type	Description
`query`	`str \| None`	Simple string query/prompt (for basic use cases)
`messages`	`list[dict] \| None`	List of message dicts (for chat-based agents)
`context`	`dict[str, Any] \| None`	Additional context data (e.g., retrieved documents, metadata)

Example

Simple query¶

input1 = EvalInput(query="What is the capital of France?")

Chat messages¶

input2 = EvalInput(messages=[ ... {"role": "system", "content": "You are a helpful assistant"}, ... {"role": "user", "content": "Hello!"} ... ])

Query with context¶

input3 = EvalInput( ... query="Summarize the document", ... context={"document": "Long text here..."} ... )

Functions¶

`__post_init__()` ¶

Validate that at least one input type is provided.

`to_agent_input()` ¶

Convert to format that agent expects.

Returns:

Type	Description
`dict[str, Any]`	Dictionary with all non-None input fields.

Example

input = EvalInput(query="Hello", context={"user_id": "123"}) input.to_agent_input() {'query': 'Hello', 'context': {'user_id': '123'}}

`from_dict(data)` `classmethod` ¶

Create EvalInput from dictionary.

Parameters:

Name	Type	Description	Default
`data`	`dict[str, Any]`	Dictionary with 'query', 'messages', and/or 'context' keys	required

Returns:

Type	Description
`EvalInput`	EvalInput instance

Example

data = {"query": "Hello", "context": {"key": "value"}} input = EvalInput.from_dict(data)

`to_dict()` ¶

Convert to dictionary for serialization.

Returns:

Type	Description
`dict[str, Any]`	Dictionary representation of the input.

EvalExpected¶

`prela.evals.case.EvalExpected` `dataclass` ¶

Expected output for an eval case.

Defines what the agent's output should look like. Supports multiple validation strategies: - Exact output match - Contains/not_contains substring checks - Tool call validation - Custom metadata checks

Attributes:

Name	Type	Description
`output`	`str \| None`	Exact expected output string
`contains`	`list[str] \| None`	List of substrings that must appear in output
`not_contains`	`list[str] \| None`	List of substrings that must NOT appear in output
`tool_calls`	`list[dict[str, Any]] \| None`	Expected tool calls (list of dicts with 'name', 'args', etc.)
`metadata`	`dict[str, Any] \| None`	Expected metadata fields (e.g., final_answer, confidence)

Example

Exact match¶

expected1 = EvalExpected(output="The answer is 42")

Substring checks¶

expected2 = EvalExpected( ... contains=["Paris", "capital"], ... not_contains=["London", "Berlin"] ... )

Tool call validation¶

expected3 = EvalExpected(tool_calls=[ ... {"name": "search", "args": {"query": "weather"}} ... ])

Functions¶

`__post_init__()` ¶

Validate that at least one expectation is provided.

`from_dict(data)` `classmethod` ¶

Create EvalExpected from dictionary.

Parameters:

Name	Type	Description	Default
`data`	`dict[str, Any]`	Dictionary with expected output specifications	required

Returns:

Type	Description
`EvalExpected`	EvalExpected instance

Example

data = {"contains": ["Paris"], "not_contains": ["London"]} expected = EvalExpected.from_dict(data)

`to_dict()` ¶

Convert to dictionary for serialization.

Returns:

Type	Description
`dict[str, Any]`	Dictionary representation of the expected output.

EvalCase¶

`prela.evals.case.EvalCase` `dataclass` ¶

Complete evaluation test case.

Represents a single test case with input, expected output, and assertions. Eval cases are the building blocks of eval suites.

Attributes:

Name	Type	Description
`id`	`str`	Unique identifier for this test case
`name`	`str`	Human-readable test case name
`input`	`EvalInput`	Input data for the agent
`expected`	`EvalExpected \| None`	Expected output (optional, can use assertions instead)
`assertions`	`list[dict[str, Any]] \| None`	List of assertion configurations (dicts with 'type', 'value', etc.)
`tags`	`list[str]`	Tags for filtering/grouping test cases
`timeout_seconds`	`float`	Maximum execution time for this test case
`metadata`	`dict[str, Any]`	Additional metadata for this test case

Example

case = EvalCase( ... id="test_basic_qa", ... name="Basic factual question", ... input=EvalInput(query="What is the capital of France?"), ... expected=EvalExpected(contains=["Paris"]), ... assertions=[ ... {"type": "contains", "value": "Paris"}, ... {"type": "semantic_similarity", "threshold": 0.8} ... ], ... tags=["qa", "geography"], ... timeout_seconds=10.0 ... )

Functions¶

`init(id, name, input, expected=None, assertions=None, tags=list(), timeout_seconds=30.0, metadata=dict())` ¶

`to_dict()` ¶

Convert to dictionary for serialization.

Returns:

Type	Description
`dict[str, Any]`	Dictionary representation of the test case.

Example

case = EvalCase( ... id="test_1", ... name="Test", ... input=EvalInput(query="Hello"), ... expected=EvalExpected(contains=["Hi"]) ... ) data = case.to_dict() data["id"] 'test_1'

`from_dict(data)` `classmethod` ¶

Create EvalCase from dictionary.

Parameters:

Name	Type	Description	Default
`data`	`dict[str, Any]`	Dictionary with test case specification	required

Returns:

Type	Description
`EvalCase`	EvalCase instance

Example

data = { ... "id": "test_1", ... "name": "Test case 1", ... "input": {"query": "Hello"}, ... "expected": {"contains": ["Hi"]}, ... "tags": ["greeting"] ... } case = EvalCase.from_dict(data)

Test Suite¶

EvalSuite¶

`prela.evals.suite.EvalSuite` `dataclass` ¶

Collection of eval cases with shared configuration.

An eval suite organizes multiple test cases with: - Shared setup/teardown hooks - Default assertions applied to all cases - YAML serialization for easy configuration - Tagging and filtering capabilities

Attributes:

Name	Type	Description
`name`	`str`	Suite name (e.g., "RAG Quality Suite")
`description`	`str`	Human-readable description of what this suite tests
`cases`	`list[EvalCase]`	List of eval cases in this suite
`default_assertions`	`list[dict[str, Any]] \| None`	Assertions applied to all cases (unless overridden)
`setup`	`Callable[[], None] \| None`	Callable run before executing the suite (e.g., start services)
`teardown`	`Callable[[], None] \| None`	Callable run after executing the suite (e.g., cleanup)
`metadata`	`dict[str, Any]`	Additional metadata for the suite

Example

suite = EvalSuite( ... name="RAG Quality Suite", ... description="Tests for RAG pipeline quality", ... cases=[ ... EvalCase( ... id="test_basic_qa", ... name="Basic factual question", ... input=EvalInput(query="What is the capital of France?"), ... expected=EvalExpected(contains=["Paris"]) ... ) ... ], ... default_assertions=[ ... {"type": "latency", "max_ms": 5000}, ... {"type": "no_errors"} ... ] ... )

Functions¶

`init(name, description='', cases=list(), default_assertions=None, setup=None, teardown=None, metadata=dict())` ¶

`add_case(case)` ¶

Add a test case to the suite.

Parameters:

Name	Type	Description	Default
`case`	`EvalCase`	Eval case to add	required

Example

suite = EvalSuite(name="My Suite") case = EvalCase( ... id="test_1", ... name="Test", ... input=EvalInput(query="Hello"), ... expected=EvalExpected(contains=["Hi"]) ... ) suite.add_case(case)

`filter_by_tags(tags)` ¶

Filter test cases by tags.

Returns cases that have ALL specified tags.

Parameters:

Name	Type	Description	Default
`tags`	`list[str]`	List of tags to filter by	required

Returns:

Type	Description
`list[EvalCase]`	List of matching test cases

Example

suite = EvalSuite(name="My Suite", cases=[...]) qa_cases = suite.filter_by_tags(["qa"]) geography_qa = suite.filter_by_tags(["qa", "geography"])

`to_yaml(path)` ¶

Save eval suite to YAML file.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to save YAML file	required

Raises:

Type	Description
`ImportError`	If PyYAML is not installed

Example

suite = EvalSuite(name="My Suite", cases=[...]) suite.to_yaml("suite.yaml")

`from_yaml(path)` `classmethod` ¶

Load eval suite from YAML file.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to YAML file	required

Returns:

Type	Description
`EvalSuite`	EvalSuite instance

Raises:

Type	Description
`ImportError`	If PyYAML is not installed
`FileNotFoundError`	If file doesn't exist
`YAMLError`	If YAML parsing fails

Example

suite = EvalSuite.from_yaml("tests/suite.yaml")

Test Execution¶

EvalRunner¶

`prela.evals.runner.EvalRunner` ¶

Runner for executing evaluation suites against AI agents.

The runner executes test cases, runs assertions, captures traces, and aggregates results. Supports parallel execution with thread pools.

Example

from prela.evals import EvalSuite, EvalRunner from prela import get_tracer

suite = EvalSuite.from_yaml("tests.yaml") tracer = get_tracer()

def my_agent(input_data): ... # Your agent logic here ... return "agent output"

runner = EvalRunner(suite, my_agent, tracer=tracer) result = runner.run() print(result.summary())

Functions¶

`init(suite, agent, tracer=None, parallel=False, max_workers=4, on_case_complete=None)` ¶

Initialize the evaluation runner.

Parameters:

Name	Type	Description	Default
`suite`	`EvalSuite`	The evaluation suite to run.	required
`agent`	`Callable[[EvalInput], Any]`	Callable that takes an EvalInput and returns agent output.	required
`tracer`	`Tracer \| None`	Optional tracer for capturing execution traces.	`None`
`parallel`	`bool`	Whether to run cases in parallel using a thread pool.	`False`
`max_workers`	`int`	Maximum number of worker threads if parallel=True.	`4`
`on_case_complete`	`Callable[[CaseResult], None] \| None`	Optional callback invoked after each case completes.	`None`

`run()` ¶

Run all test cases in the evaluation suite.

Executes setup/teardown hooks, runs all cases (sequentially or in parallel), executes assertions, and aggregates results.

Returns:

Type	Description
`EvalRunResult`	EvalRunResult with aggregated statistics and individual case results.

`run_case(case)` ¶

Run a single test case.

Executes the agent with the case input, runs all assertions, captures the trace ID if a tracer is configured, and returns aggregated results.

Parameters:

Name	Type	Description	Default
`case`	`EvalCase`	The test case to run.	required

Returns:

Type	Description
`CaseResult`	CaseResult with pass/fail status and assertion results.

CaseResult¶

`prela.evals.runner.CaseResult` `dataclass` ¶

Result of running a single eval case.

Functions¶

`__post_init__()` ¶

Validate fields.

EvalRunResult¶

`prela.evals.runner.EvalRunResult` `dataclass` ¶

Result of running an evaluation suite.

Functions¶

`summary()` ¶

Return human-readable summary of the evaluation run.

Returns:

Type	Description
`str`	Multi-line string with summary statistics and case results.

create_assertion¶

`prela.evals.runner.create_assertion(config)` ¶

Factory function to create assertion instances from configuration.

This maps assertion type strings to concrete assertion classes and instantiates them with the provided configuration.

Parameters:

Name	Type	Description	Default
`config`	`dict`	Dictionary with "type" key and type-specific parameters.	required

Returns:

Type	Description
`BaseAssertion`	Instantiated assertion object.

Raises:

Type	Description
`ValueError`	If assertion type is unknown or configuration is invalid.

Example

assertion = create_assertion({ ... "type": "contains", ... "text": "hello", ... "case_sensitive": False ... }) result = assertion.evaluate("Hello world", None, None) assert result.passed

Assertions¶

Base Assertion¶

`prela.evals.assertions.base.BaseAssertion` ¶

Bases: ABC

Base class for all assertions.

Assertions evaluate agent outputs and traces to determine if they meet expected criteria. Subclasses should implement the evaluate() method to perform the actual check.

Functions¶

`evaluate(output, expected, trace)` `abstractmethod` ¶

Evaluate the assertion against the output and trace.

Parameters:

Name	Type	Description	Default
`output`	`Any`	The actual output from the agent/function under test	required
`expected`	`Any \| None`	The expected output (format depends on assertion type)	required
`trace`	`list[Span] \| None`	Optional list of spans from the traced execution	required

Returns:

Type	Description
`AssertionResult`	AssertionResult with pass/fail status and details

`from_config(config)` `abstractmethod` `classmethod` ¶

Create an assertion instance from configuration dict.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration dictionary with assertion-specific parameters	required

Returns:

Type	Description
`BaseAssertion`	Configured assertion instance

Raises:

Type	Description
`ValueError`	If configuration is invalid

`prela.evals.assertions.base.AssertionResult` `dataclass` ¶

Result of an assertion evaluation.

Attributes:

Name	Type	Description
`passed`	`bool`	Whether the assertion passed
`assertion_type`	`str`	Type of assertion (e.g., "contains", "semantic_similarity")
`message`	`str`	Human-readable message describing the result
`score`	`float \| None`	Optional score between 0-1 for partial credit assertions
`expected`	`Any`	Expected value (if applicable)
`actual`	`Any`	Actual value that was evaluated
`details`	`dict[str, Any]`	Additional details about the evaluation

Functions¶

`str()` ¶

Human-readable string representation.

Structural Assertions¶

`prela.evals.assertions.structural.ContainsAssertion` ¶

Bases: BaseAssertion

Assert that output contains specified text.

Example

assertion = ContainsAssertion(text="error", case_sensitive=False) result = assertion.evaluate(output="Error occurred", expected=None, trace=None) assert result.passed

Functions¶

`init(text, case_sensitive=True)` ¶

Initialize contains assertion.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text that must be present in output	required
`case_sensitive`	`bool`	Whether to perform case-sensitive matching	`True`

`evaluate(output, expected, trace)` ¶

Check if output contains the specified text.

`from_config(config)` `classmethod` ¶

Create from configuration.

Config format

{ "text": "required text", "case_sensitive": true # optional, default: true }

`prela.evals.assertions.structural.NotContainsAssertion` ¶

Bases: BaseAssertion

Assert that output does NOT contain specified text.

Example

assertion = NotContainsAssertion(text="error") result = assertion.evaluate(output="Success!", expected=None, trace=None) assert result.passed

Functions¶

`init(text, case_sensitive=True)` ¶

Initialize not-contains assertion.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text that must NOT be present in output	required
`case_sensitive`	`bool`	Whether to perform case-sensitive matching	`True`

`evaluate(output, expected, trace)` ¶

Check if output does not contain the specified text.

`from_config(config)` `classmethod` ¶

Create from configuration.

Config format

{ "text": "forbidden text", "case_sensitive": true # optional, default: true }

`prela.evals.assertions.structural.RegexAssertion` ¶

Bases: BaseAssertion

Assert that output matches a regular expression pattern.

Example

assertion = RegexAssertion(pattern=r"\d{3}-\d{4}") result = assertion.evaluate(output="Call 555-1234", expected=None, trace=None) assert result.passed

Functions¶

`init(pattern, flags=0)` ¶

Initialize regex assertion.

Parameters:

Name	Type	Description	Default
`pattern`	`str`	Regular expression pattern to match	required
`flags`	`int`	Optional regex flags (e.g., re.IGNORECASE)	`0`

`evaluate(output, expected, trace)` ¶

Check if output matches the regex pattern.

`from_config(config)` `classmethod` ¶

Create from configuration.

Config format

{ "pattern": "\d{3}-\d{4}", "flags": 2 # optional, e.g., re.IGNORECASE }

`prela.evals.assertions.structural.LengthAssertion` ¶

Bases: BaseAssertion

Assert that output length is within specified bounds.

Example

assertion = LengthAssertion(min_length=10, max_length=100) result = assertion.evaluate(output="Hello, world!", expected=None, trace=None) assert result.passed

Functions¶

`init(min_length=None, max_length=None)` ¶

Initialize length assertion.

Parameters:

Name	Type	Description	Default
`min_length`	`int \| None`	Minimum acceptable length (inclusive)	`None`
`max_length`	`int \| None`	Maximum acceptable length (inclusive)	`None`

Raises:

Type	Description
`ValueError`	If both min_length and max_length are None

`evaluate(output, expected, trace)` ¶

Check if output length is within bounds.

`from_config(config)` `classmethod` ¶

Create from configuration.

Config format

{ "min_length": 10, # optional "max_length": 100 # optional }

`prela.evals.assertions.structural.JSONValidAssertion` ¶

Bases: BaseAssertion

Assert that output is valid JSON, optionally matching a schema.

Example

assertion = JSONValidAssertion() result = assertion.evaluate(output='{"key": "value"}', expected=None, trace=None) assert result.passed

Functions¶

`init(schema=None)` ¶

Initialize JSON validation assertion.

Parameters:

Name	Type	Description	Default
`schema`	`dict[str, Any] \| None`	Optional JSON schema to validate against (using jsonschema library)	`None`

`evaluate(output, expected, trace)` ¶

Check if output is valid JSON and optionally matches schema.

`from_config(config)` `classmethod` ¶

Create from configuration.

Config format

{ "schema": { # optional "type": "object", "properties": { "name": {"type": "string"} } } }

Tool Assertions¶

`prela.evals.assertions.tool.ToolCalledAssertion` ¶

Bases: BaseAssertion

Assert that a specific tool was called during execution.

This assertion examines the trace to verify that a tool span with the specified name exists.

Example

assertion = ToolCalledAssertion(tool_name="web_search") result = assertion.evaluate(output=None, expected=None, trace=spans) assert result.passed

Functions¶

`init(tool_name)` ¶

Initialize tool called assertion.

Parameters:

Name	Type	Description	Default
`tool_name`	`str`	Name of the tool that should have been called	required

`evaluate(output, expected, trace)` ¶

Check if the specified tool was called in the trace.

`from_config(config)` `classmethod` ¶

Create from configuration.

Config format

{ "tool_name": "web_search" }

`prela.evals.assertions.tool.ToolArgsAssertion` ¶

Bases: BaseAssertion

Assert that a tool was called with expected arguments.

This assertion verifies both that the tool was called and that it was called with specific argument values.

Example

assertion = ToolArgsAssertion( ... tool_name="web_search", ... expected_args={"query": "Python tutorial"} ... ) result = assertion.evaluate(output=None, expected=None, trace=spans) assert result.passed

Functions¶

`init(tool_name, expected_args, partial_match=True)` ¶

Initialize tool args assertion.

Parameters:

Name	Type	Description	Default
`tool_name`	`str`	Name of the tool to check	required
`expected_args`	`dict[str, Any]`	Expected argument key-value pairs	required
`partial_match`	`bool`	If True, only check that expected_args are present (allow additional args). If False, require exact match.	`True`

`evaluate(output, expected, trace)` ¶

Check if tool was called with expected arguments.

`from_config(config)` `classmethod` ¶

Create from configuration.

Config format

{ "tool_name": "web_search", "expected_args": {"query": "Python"}, "partial_match": true # optional, default: true }

`prela.evals.assertions.tool.ToolSequenceAssertion` ¶

Bases: BaseAssertion

Assert that tools were called in a specific order.

This assertion verifies that tools appear in the trace in the expected sequence, though other tools may appear between them.

Example

assertion = ToolSequenceAssertion( ... sequence=["web_search", "calculator", "summarize"] ... ) result = assertion.evaluate(output=None, expected=None, trace=spans) assert result.passed

Functions¶

`init(sequence, strict=False)` ¶

Initialize tool sequence assertion.

Parameters:

Name	Type	Description	Default
`sequence`	`list[str]`	Expected sequence of tool names	required
`strict`	`bool`	If True, no other tools can appear between expected ones. If False, other tools are allowed between expected sequence.	`False`

`evaluate(output, expected, trace)` ¶

Check if tools were called in the expected sequence.

`from_config(config)` `classmethod` ¶

Create from configuration.

Config format

{ "sequence": ["tool1", "tool2", "tool3"], "strict": false # optional, default: false }

Semantic Assertions¶

`prela.evals.assertions.semantic.SemanticSimilarityAssertion` ¶

Bases: BaseAssertion

Assert that output is semantically similar to expected text.

Uses sentence embeddings to compare semantic meaning rather than exact text matching. Useful for evaluating LLM outputs where phrasing varies but meaning should be consistent.

Example

assertion = SemanticSimilarityAssertion( ... expected_text="The weather is nice today", ... threshold=0.8 ... ) result = assertion.evaluate( ... output="Today has beautiful weather", ... expected=None, ... trace=None ... ) assert result.passed # High similarity despite different wording

Requires

pip install sentence-transformers

Functions¶

`init(expected_text, threshold=0.8, model_name='all-MiniLM-L6-v2')` ¶

Initialize semantic similarity assertion.

Parameters:

Name	Type	Description	Default
`expected_text`	`str`	Text to compare against	required
`threshold`	`float`	Minimum cosine similarity score (0-1) to pass	`0.8`
`model_name`	`str`	Sentence transformer model to use (default: all-MiniLM-L6-v2, fast and accurate)	`'all-MiniLM-L6-v2'`

Raises:

Type	Description
`ImportError`	If sentence-transformers is not installed
`ValueError`	If threshold is not between 0 and 1

`evaluate(output, expected, trace)` ¶

Check if output is semantically similar to expected text.

`from_config(config)` `classmethod` ¶

Create from configuration.

Config format

{ "expected_text": "The expected output", "threshold": 0.8, # optional, default: 0.8 "model_name": "all-MiniLM-L6-v2" # optional }

`clear_cache()` `classmethod` ¶

Clear the embedding cache. Useful for testing or memory management.

`get_cache_size()` `classmethod` ¶

Get the number of cached embeddings.

Reporters¶

ConsoleReporter¶

`prela.evals.reporters.console.ConsoleReporter` ¶

Reporter that pretty-prints evaluation results to the console.

Uses rich library for colored output if available, falls back to plain text formatting otherwise. Provides: - Summary statistics (pass rate, duration) - List of all test cases with pass/fail status - Detailed failure information for failed cases - Color coding (green=pass, red=fail, yellow=warning)

Example

from prela.evals import EvalRunner from prela.evals.reporters import ConsoleReporter

runner = EvalRunner(suite, agent) result = runner.run()

reporter = ConsoleReporter(verbose=True, use_colors=True) reporter.report(result) ✓ Geography QA Suite ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Total: 10 | Passed: 9 (90.0%) | Failed: 1 Duration: 2.5s ...

Functions¶

`init(verbose=True, use_colors=True)` ¶

Initialize the console reporter.

Parameters:

Name	Type	Description	Default
`verbose`	`bool`	If True, show detailed failure information. If False, only show summary statistics and failed case names.	`True`
`use_colors`	`bool`	If True and rich is available, use colored output. If False or rich unavailable, use plain text.	`True`

`report(result)` ¶

Print the evaluation results to the console.

Parameters:

Name	Type	Description	Default
`result`	`EvalRunResult`	The evaluation run result to report.	required

JSONReporter¶

`prela.evals.reporters.json.JSONReporter` ¶

Reporter that writes evaluation results to a JSON file.

Outputs a structured JSON file containing all evaluation data: - Suite metadata (name, timestamps, duration) - Summary statistics (total, passed, failed, pass rate) - Individual case results with assertion details - Full error messages and stack traces

The JSON format is designed for: - Programmatic analysis of test results - Integration with data processing pipelines - Historical comparison of evaluation runs - CI/CD artifact storage

Example

from prela.evals import EvalRunner from prela.evals.reporters import JSONReporter

runner = EvalRunner(suite, agent) result = runner.run()

reporter = JSONReporter("results/eval_run_123.json") reporter.report(result)

Creates results/eval_run_123.json with full results¶

Functions¶

`init(output_path, indent=2)` ¶

Initialize the JSON reporter.

Parameters:

Name	Type	Description	Default
`output_path`	`str \| Path`	Path where the JSON file will be written. Parent directories will be created if they don't exist.	required
`indent`	`int`	Number of spaces for JSON indentation (default: 2). Set to None for compact output.	`2`

`report(result)` ¶

Write the evaluation results to a JSON file.

Creates parent directories if they don't exist. Overwrites any existing file at the output path.

Parameters:

Name	Type	Description	Default
`result`	`EvalRunResult`	The evaluation run result to write.	required

Raises:

Type	Description
`OSError`	If unable to write to the output path.

JUnitReporter¶

`prela.evals.reporters.junit.JUnitReporter` ¶

Reporter that generates JUnit XML format for CI/CD integration.

Creates a JUnit XML file that can be consumed by continuous integration systems for test result visualization, trend analysis, and failure reporting.

The XML format follows the JUnit schema with: - root element with summary statistics - elements for each test case - elements for failed assertions - elements for execution errors - for additional output/trace information

Supported CI/CD platforms: - Jenkins (JUnit plugin) - GitLab CI/CD (junit report artifacts) - GitHub Actions (test reporters) - Azure DevOps (publish test results) - CircleCI (store_test_results)

Example

from prela.evals import EvalRunner from prela.evals.reporters import JUnitReporter

runner = EvalRunner(suite, agent) result = runner.run()

reporter = JUnitReporter("test-results/junit.xml") reporter.report(result)

Creates JUnit XML at test-results/junit.xml¶

Functions¶

`init(output_path)` ¶

Initialize the JUnit XML reporter.

Parameters:

Name	Type	Description	Default
`output_path`	`str \| Path`	Path where the JUnit XML file will be written. Parent directories will be created if they don't exist.	required

`report(result)` ¶

Generate and write JUnit XML for the evaluation results.

Creates parent directories if they don't exist. Overwrites any existing file at the output path.

Parameters:

Name	Type	Description	Default
`result`	`EvalRunResult`	The evaluation run result to convert to JUnit XML.	required

Raises:

Type	Description
`OSError`	If unable to write to the output path.

Evaluations API¶

Test Case Definition¶

EvalInput¶

prela.evals.case.EvalInput dataclass ¶

Simple query¶

Chat messages¶

Query with context¶

Functions¶

__post_init__() ¶

to_agent_input() ¶

from_dict(data) classmethod ¶

to_dict() ¶

EvalExpected¶

prela.evals.case.EvalExpected dataclass ¶

Exact match¶

Substring checks¶

Tool call validation¶

Functions¶

__post_init__() ¶

from_dict(data) classmethod ¶

to_dict() ¶

EvalCase¶

prela.evals.case.EvalCase dataclass ¶

Functions¶

__init__(id, name, input, expected=None, assertions=None, tags=list(), timeout_seconds=30.0, metadata=dict()) ¶

to_dict() ¶

from_dict(data) classmethod ¶

Test Suite¶

EvalSuite¶

prela.evals.suite.EvalSuite dataclass ¶

Functions¶

__init__(name, description='', cases=list(), default_assertions=None, setup=None, teardown=None, metadata=dict()) ¶

add_case(case) ¶

filter_by_tags(tags) ¶

to_yaml(path) ¶

from_yaml(path) classmethod ¶

Test Execution¶

EvalRunner¶

prela.evals.runner.EvalRunner ¶

Functions¶

__init__(suite, agent, tracer=None, parallel=False, max_workers=4, on_case_complete=None) ¶

run() ¶

run_case(case) ¶

CaseResult¶

prela.evals.runner.CaseResult dataclass ¶

Functions¶

__post_init__() ¶

EvalRunResult¶

prela.evals.runner.EvalRunResult dataclass ¶

Functions¶

summary() ¶

create_assertion¶

prela.evals.runner.create_assertion(config) ¶

Assertions¶

Base Assertion¶

prela.evals.assertions.base.BaseAssertion ¶

Functions¶

evaluate(output, expected, trace) abstractmethod ¶

from_config(config) abstractmethod classmethod ¶

prela.evals.assertions.base.AssertionResult dataclass ¶

Functions¶

__str__() ¶

Structural Assertions¶

prela.evals.assertions.structural.ContainsAssertion ¶

Functions¶

__init__(text, case_sensitive=True) ¶

evaluate(output, expected, trace) ¶

from_config(config) classmethod ¶

prela.evals.assertions.structural.NotContainsAssertion ¶

Functions¶

__init__(text, case_sensitive=True) ¶

evaluate(output, expected, trace) ¶

from_config(config) classmethod ¶

prela.evals.assertions.structural.RegexAssertion ¶

Functions¶

__init__(pattern, flags=0) ¶

evaluate(output, expected, trace) ¶

from_config(config) classmethod ¶

prela.evals.assertions.structural.LengthAssertion ¶

Functions¶

`prela.evals.case.EvalInput` `dataclass` ¶

`__post_init__()` ¶

`to_agent_input()` ¶

`from_dict(data)` `classmethod` ¶

`to_dict()` ¶

`prela.evals.case.EvalExpected` `dataclass` ¶

`__post_init__()` ¶

`from_dict(data)` `classmethod` ¶

`to_dict()` ¶

`prela.evals.case.EvalCase` `dataclass` ¶

`init(id, name, input, expected=None, assertions=None, tags=list(), timeout_seconds=30.0, metadata=dict())` ¶

`to_dict()` ¶

`from_dict(data)` `classmethod` ¶

`prela.evals.suite.EvalSuite` `dataclass` ¶

`init(name, description='', cases=list(), default_assertions=None, setup=None, teardown=None, metadata=dict())` ¶

`add_case(case)` ¶

`filter_by_tags(tags)` ¶

`to_yaml(path)` ¶

`from_yaml(path)` `classmethod` ¶

`prela.evals.runner.EvalRunner` ¶

`init(suite, agent, tracer=None, parallel=False, max_workers=4, on_case_complete=None)` ¶

`run()` ¶

`run_case(case)` ¶

`prela.evals.runner.CaseResult` `dataclass` ¶

`__post_init__()` ¶

`prela.evals.runner.EvalRunResult` `dataclass` ¶

`summary()` ¶

`prela.evals.runner.create_assertion(config)` ¶

`prela.evals.assertions.base.BaseAssertion` ¶

`evaluate(output, expected, trace)` `abstractmethod` ¶

`from_config(config)` `abstractmethod` `classmethod` ¶

`prela.evals.assertions.base.AssertionResult` `dataclass` ¶

`str()` ¶

`prela.evals.assertions.structural.ContainsAssertion` ¶

`init(text, case_sensitive=True)` ¶

`evaluate(output, expected, trace)` ¶

`from_config(config)` `classmethod` ¶

`prela.evals.assertions.structural.NotContainsAssertion` ¶

`init(text, case_sensitive=True)` ¶

`evaluate(output, expected, trace)` ¶

`from_config(config)` `classmethod` ¶

`prela.evals.assertions.structural.RegexAssertion` ¶

`init(pattern, flags=0)` ¶

`evaluate(output, expected, trace)` ¶

`from_config(config)` `classmethod` ¶

`prela.evals.assertions.structural.LengthAssertion` ¶

`init(min_length=None, max_length=None)` ¶

`evaluate(output, expected, trace)` ¶

`from_config(config)` `classmethod` ¶

`prela.evals.assertions.structural.JSONValidAssertion` ¶

`init(schema=None)` ¶

`evaluate(output, expected, trace)` ¶

`from_config(config)` `classmethod` ¶

`prela.evals.assertions.tool.ToolCalledAssertion` ¶

`init(tool_name)` ¶

`evaluate(output, expected, trace)` ¶

`from_config(config)` `classmethod` ¶

`prela.evals.assertions.tool.ToolArgsAssertion` ¶

`init(tool_name, expected_args, partial_match=True)` ¶

`evaluate(output, expected, trace)` ¶

`from_config(config)` `classmethod` ¶

`prela.evals.assertions.tool.ToolSequenceAssertion` ¶

`init(sequence, strict=False)` ¶

`evaluate(output, expected, trace)` ¶

`from_config(config)` `classmethod` ¶

`prela.evals.assertions.semantic.SemanticSimilarityAssertion` ¶

`init(expected_text, threshold=0.8, model_name='all-MiniLM-L6-v2')` ¶

`evaluate(output, expected, trace)` ¶

`from_config(config)` `classmethod` ¶

`clear_cache()` `classmethod` ¶

`get_cache_size()` `classmethod` ¶

`prela.evals.reporters.console.ConsoleReporter` ¶

`init(verbose=True, use_colors=True)` ¶

`report(result)` ¶

`prela.evals.reporters.json.JSONReporter` ¶

`init(output_path, indent=2)` ¶

`report(result)` ¶

`prela.evals.reporters.junit.JUnitReporter` ¶

`init(output_path)` ¶

`report(result)` ¶