Evaluations API¶
Framework for testing and evaluating AI agent behavior.
Test Case Definition¶
EvalInput¶
prela.evals.case.EvalInput
dataclass
¶
Input data for an eval case.
Represents what goes into the agent being tested. Can be a simple query, a list of messages, or custom context data.
Attributes:
| Name | Type | Description |
|---|---|---|
query |
str | None
|
Simple string query/prompt (for basic use cases) |
messages |
list[dict] | None
|
List of message dicts (for chat-based agents) |
context |
dict[str, Any] | None
|
Additional context data (e.g., retrieved documents, metadata) |
Example
Simple query¶
input1 = EvalInput(query="What is the capital of France?")
Chat messages¶
input2 = EvalInput(messages=[ ... {"role": "system", "content": "You are a helpful assistant"}, ... {"role": "user", "content": "Hello!"} ... ])
Query with context¶
input3 = EvalInput( ... query="Summarize the document", ... context={"document": "Long text here..."} ... )
Functions¶
__post_init__()
¶
Validate that at least one input type is provided.
to_agent_input()
¶
Convert to format that agent expects.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with all non-None input fields. |
Example
input = EvalInput(query="Hello", context={"user_id": "123"}) input.to_agent_input() {'query': 'Hello', 'context': {'user_id': '123'}}
from_dict(data)
classmethod
¶
Create EvalInput from dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Dictionary with 'query', 'messages', and/or 'context' keys |
required |
Returns:
| Type | Description |
|---|---|
EvalInput
|
EvalInput instance |
Example
data = {"query": "Hello", "context": {"key": "value"}} input = EvalInput.from_dict(data)
to_dict()
¶
Convert to dictionary for serialization.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary representation of the input. |
EvalExpected¶
prela.evals.case.EvalExpected
dataclass
¶
Expected output for an eval case.
Defines what the agent's output should look like. Supports multiple validation strategies: - Exact output match - Contains/not_contains substring checks - Tool call validation - Custom metadata checks
Attributes:
| Name | Type | Description |
|---|---|---|
output |
str | None
|
Exact expected output string |
contains |
list[str] | None
|
List of substrings that must appear in output |
not_contains |
list[str] | None
|
List of substrings that must NOT appear in output |
tool_calls |
list[dict[str, Any]] | None
|
Expected tool calls (list of dicts with 'name', 'args', etc.) |
metadata |
dict[str, Any] | None
|
Expected metadata fields (e.g., final_answer, confidence) |
Example
Exact match¶
expected1 = EvalExpected(output="The answer is 42")
Substring checks¶
expected2 = EvalExpected( ... contains=["Paris", "capital"], ... not_contains=["London", "Berlin"] ... )
Tool call validation¶
expected3 = EvalExpected(tool_calls=[ ... {"name": "search", "args": {"query": "weather"}} ... ])
Functions¶
__post_init__()
¶
Validate that at least one expectation is provided.
from_dict(data)
classmethod
¶
Create EvalExpected from dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Dictionary with expected output specifications |
required |
Returns:
| Type | Description |
|---|---|
EvalExpected
|
EvalExpected instance |
Example
data = {"contains": ["Paris"], "not_contains": ["London"]} expected = EvalExpected.from_dict(data)
to_dict()
¶
Convert to dictionary for serialization.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary representation of the expected output. |
EvalCase¶
prela.evals.case.EvalCase
dataclass
¶
Complete evaluation test case.
Represents a single test case with input, expected output, and assertions. Eval cases are the building blocks of eval suites.
Attributes:
| Name | Type | Description |
|---|---|---|
id |
str
|
Unique identifier for this test case |
name |
str
|
Human-readable test case name |
input |
EvalInput
|
Input data for the agent |
expected |
EvalExpected | None
|
Expected output (optional, can use assertions instead) |
assertions |
list[dict[str, Any]] | None
|
List of assertion configurations (dicts with 'type', 'value', etc.) |
tags |
list[str]
|
Tags for filtering/grouping test cases |
timeout_seconds |
float
|
Maximum execution time for this test case |
metadata |
dict[str, Any]
|
Additional metadata for this test case |
Example
case = EvalCase( ... id="test_basic_qa", ... name="Basic factual question", ... input=EvalInput(query="What is the capital of France?"), ... expected=EvalExpected(contains=["Paris"]), ... assertions=[ ... {"type": "contains", "value": "Paris"}, ... {"type": "semantic_similarity", "threshold": 0.8} ... ], ... tags=["qa", "geography"], ... timeout_seconds=10.0 ... )
Functions¶
__init__(id, name, input, expected=None, assertions=None, tags=list(), timeout_seconds=30.0, metadata=dict())
¶
to_dict()
¶
Convert to dictionary for serialization.
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary representation of the test case. |
Example
case = EvalCase( ... id="test_1", ... name="Test", ... input=EvalInput(query="Hello"), ... expected=EvalExpected(contains=["Hi"]) ... ) data = case.to_dict() data["id"] 'test_1'
from_dict(data)
classmethod
¶
Create EvalCase from dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any]
|
Dictionary with test case specification |
required |
Returns:
| Type | Description |
|---|---|
EvalCase
|
EvalCase instance |
Example
data = { ... "id": "test_1", ... "name": "Test case 1", ... "input": {"query": "Hello"}, ... "expected": {"contains": ["Hi"]}, ... "tags": ["greeting"] ... } case = EvalCase.from_dict(data)
Test Suite¶
EvalSuite¶
prela.evals.suite.EvalSuite
dataclass
¶
Collection of eval cases with shared configuration.
An eval suite organizes multiple test cases with: - Shared setup/teardown hooks - Default assertions applied to all cases - YAML serialization for easy configuration - Tagging and filtering capabilities
Attributes:
| Name | Type | Description |
|---|---|---|
name |
str
|
Suite name (e.g., "RAG Quality Suite") |
description |
str
|
Human-readable description of what this suite tests |
cases |
list[EvalCase]
|
List of eval cases in this suite |
default_assertions |
list[dict[str, Any]] | None
|
Assertions applied to all cases (unless overridden) |
setup |
Callable[[], None] | None
|
Callable run before executing the suite (e.g., start services) |
teardown |
Callable[[], None] | None
|
Callable run after executing the suite (e.g., cleanup) |
metadata |
dict[str, Any]
|
Additional metadata for the suite |
Example
suite = EvalSuite( ... name="RAG Quality Suite", ... description="Tests for RAG pipeline quality", ... cases=[ ... EvalCase( ... id="test_basic_qa", ... name="Basic factual question", ... input=EvalInput(query="What is the capital of France?"), ... expected=EvalExpected(contains=["Paris"]) ... ) ... ], ... default_assertions=[ ... {"type": "latency", "max_ms": 5000}, ... {"type": "no_errors"} ... ] ... )
Functions¶
__init__(name, description='', cases=list(), default_assertions=None, setup=None, teardown=None, metadata=dict())
¶
add_case(case)
¶
Add a test case to the suite.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
case
|
EvalCase
|
Eval case to add |
required |
Example
suite = EvalSuite(name="My Suite") case = EvalCase( ... id="test_1", ... name="Test", ... input=EvalInput(query="Hello"), ... expected=EvalExpected(contains=["Hi"]) ... ) suite.add_case(case)
filter_by_tags(tags)
¶
Filter test cases by tags.
Returns cases that have ALL specified tags.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tags
|
list[str]
|
List of tags to filter by |
required |
Returns:
| Type | Description |
|---|---|
list[EvalCase]
|
List of matching test cases |
Example
suite = EvalSuite(name="My Suite", cases=[...]) qa_cases = suite.filter_by_tags(["qa"]) geography_qa = suite.filter_by_tags(["qa", "geography"])
to_yaml(path)
¶
Save eval suite to YAML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to save YAML file |
required |
Raises:
| Type | Description |
|---|---|
ImportError
|
If PyYAML is not installed |
Example
suite = EvalSuite(name="My Suite", cases=[...]) suite.to_yaml("suite.yaml")
from_yaml(path)
classmethod
¶
Load eval suite from YAML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to YAML file |
required |
Returns:
| Type | Description |
|---|---|
EvalSuite
|
EvalSuite instance |
Raises:
| Type | Description |
|---|---|
ImportError
|
If PyYAML is not installed |
FileNotFoundError
|
If file doesn't exist |
YAMLError
|
If YAML parsing fails |
Example
suite = EvalSuite.from_yaml("tests/suite.yaml")
Test Execution¶
EvalRunner¶
prela.evals.runner.EvalRunner
¶
Runner for executing evaluation suites against AI agents.
The runner executes test cases, runs assertions, captures traces, and aggregates results. Supports parallel execution with thread pools.
Example
from prela.evals import EvalSuite, EvalRunner from prela import get_tracer
suite = EvalSuite.from_yaml("tests.yaml") tracer = get_tracer()
def my_agent(input_data): ... # Your agent logic here ... return "agent output"
runner = EvalRunner(suite, my_agent, tracer=tracer) result = runner.run() print(result.summary())
Functions¶
__init__(suite, agent, tracer=None, parallel=False, max_workers=4, on_case_complete=None)
¶
Initialize the evaluation runner.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
suite
|
EvalSuite
|
The evaluation suite to run. |
required |
agent
|
Callable[[EvalInput], Any]
|
Callable that takes an EvalInput and returns agent output. |
required |
tracer
|
Tracer | None
|
Optional tracer for capturing execution traces. |
None
|
parallel
|
bool
|
Whether to run cases in parallel using a thread pool. |
False
|
max_workers
|
int
|
Maximum number of worker threads if parallel=True. |
4
|
on_case_complete
|
Callable[[CaseResult], None] | None
|
Optional callback invoked after each case completes. |
None
|
run()
¶
Run all test cases in the evaluation suite.
Executes setup/teardown hooks, runs all cases (sequentially or in parallel), executes assertions, and aggregates results.
Returns:
| Type | Description |
|---|---|
EvalRunResult
|
EvalRunResult with aggregated statistics and individual case results. |
run_case(case)
¶
Run a single test case.
Executes the agent with the case input, runs all assertions, captures the trace ID if a tracer is configured, and returns aggregated results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
case
|
EvalCase
|
The test case to run. |
required |
Returns:
| Type | Description |
|---|---|
CaseResult
|
CaseResult with pass/fail status and assertion results. |
CaseResult¶
prela.evals.runner.CaseResult
dataclass
¶
EvalRunResult¶
prela.evals.runner.EvalRunResult
dataclass
¶
create_assertion¶
prela.evals.runner.create_assertion(config)
¶
Factory function to create assertion instances from configuration.
This maps assertion type strings to concrete assertion classes and instantiates them with the provided configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict
|
Dictionary with "type" key and type-specific parameters. |
required |
Returns:
| Type | Description |
|---|---|
BaseAssertion
|
Instantiated assertion object. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If assertion type is unknown or configuration is invalid. |
Example
assertion = create_assertion({ ... "type": "contains", ... "text": "hello", ... "case_sensitive": False ... }) result = assertion.evaluate("Hello world", None, None) assert result.passed
Assertions¶
Base Assertion¶
prela.evals.assertions.base.BaseAssertion
¶
Bases: ABC
Base class for all assertions.
Assertions evaluate agent outputs and traces to determine if they meet expected criteria. Subclasses should implement the evaluate() method to perform the actual check.
Functions¶
evaluate(output, expected, trace)
abstractmethod
¶
Evaluate the assertion against the output and trace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output
|
Any
|
The actual output from the agent/function under test |
required |
expected
|
Any | None
|
The expected output (format depends on assertion type) |
required |
trace
|
list[Span] | None
|
Optional list of spans from the traced execution |
required |
Returns:
| Type | Description |
|---|---|
AssertionResult
|
AssertionResult with pass/fail status and details |
from_config(config)
abstractmethod
classmethod
¶
Create an assertion instance from configuration dict.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
dict[str, Any]
|
Configuration dictionary with assertion-specific parameters |
required |
Returns:
| Type | Description |
|---|---|
BaseAssertion
|
Configured assertion instance |
Raises:
| Type | Description |
|---|---|
ValueError
|
If configuration is invalid |
prela.evals.assertions.base.AssertionResult
dataclass
¶
Result of an assertion evaluation.
Attributes:
| Name | Type | Description |
|---|---|---|
passed |
bool
|
Whether the assertion passed |
assertion_type |
str
|
Type of assertion (e.g., "contains", "semantic_similarity") |
message |
str
|
Human-readable message describing the result |
score |
float | None
|
Optional score between 0-1 for partial credit assertions |
expected |
Any
|
Expected value (if applicable) |
actual |
Any
|
Actual value that was evaluated |
details |
dict[str, Any]
|
Additional details about the evaluation |
Structural Assertions¶
prela.evals.assertions.structural.ContainsAssertion
¶
Bases: BaseAssertion
Assert that output contains specified text.
Example
assertion = ContainsAssertion(text="error", case_sensitive=False) result = assertion.evaluate(output="Error occurred", expected=None, trace=None) assert result.passed
Functions¶
__init__(text, case_sensitive=True)
¶
Initialize contains assertion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text that must be present in output |
required |
case_sensitive
|
bool
|
Whether to perform case-sensitive matching |
True
|
evaluate(output, expected, trace)
¶
Check if output contains the specified text.
from_config(config)
classmethod
¶
Create from configuration.
Config format
{ "text": "required text", "case_sensitive": true # optional, default: true }
prela.evals.assertions.structural.NotContainsAssertion
¶
Bases: BaseAssertion
Assert that output does NOT contain specified text.
Example
assertion = NotContainsAssertion(text="error") result = assertion.evaluate(output="Success!", expected=None, trace=None) assert result.passed
Functions¶
__init__(text, case_sensitive=True)
¶
Initialize not-contains assertion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text that must NOT be present in output |
required |
case_sensitive
|
bool
|
Whether to perform case-sensitive matching |
True
|
evaluate(output, expected, trace)
¶
Check if output does not contain the specified text.
from_config(config)
classmethod
¶
Create from configuration.
Config format
{ "text": "forbidden text", "case_sensitive": true # optional, default: true }
prela.evals.assertions.structural.RegexAssertion
¶
Bases: BaseAssertion
Assert that output matches a regular expression pattern.
Example
assertion = RegexAssertion(pattern=r"\d{3}-\d{4}") result = assertion.evaluate(output="Call 555-1234", expected=None, trace=None) assert result.passed
Functions¶
__init__(pattern, flags=0)
¶
Initialize regex assertion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pattern
|
str
|
Regular expression pattern to match |
required |
flags
|
int
|
Optional regex flags (e.g., re.IGNORECASE) |
0
|
evaluate(output, expected, trace)
¶
Check if output matches the regex pattern.
from_config(config)
classmethod
¶
Create from configuration.
Config format
{ "pattern": "\d{3}-\d{4}", "flags": 2 # optional, e.g., re.IGNORECASE }
prela.evals.assertions.structural.LengthAssertion
¶
Bases: BaseAssertion
Assert that output length is within specified bounds.
Example
assertion = LengthAssertion(min_length=10, max_length=100) result = assertion.evaluate(output="Hello, world!", expected=None, trace=None) assert result.passed
Functions¶
__init__(min_length=None, max_length=None)
¶
Initialize length assertion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_length
|
int | None
|
Minimum acceptable length (inclusive) |
None
|
max_length
|
int | None
|
Maximum acceptable length (inclusive) |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If both min_length and max_length are None |
evaluate(output, expected, trace)
¶
Check if output length is within bounds.
from_config(config)
classmethod
¶
Create from configuration.
Config format
{ "min_length": 10, # optional "max_length": 100 # optional }
prela.evals.assertions.structural.JSONValidAssertion
¶
Bases: BaseAssertion
Assert that output is valid JSON, optionally matching a schema.
Example
assertion = JSONValidAssertion() result = assertion.evaluate(output='{"key": "value"}', expected=None, trace=None) assert result.passed
Functions¶
__init__(schema=None)
¶
Initialize JSON validation assertion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
schema
|
dict[str, Any] | None
|
Optional JSON schema to validate against (using jsonschema library) |
None
|
evaluate(output, expected, trace)
¶
Check if output is valid JSON and optionally matches schema.
from_config(config)
classmethod
¶
Create from configuration.
Config format
{ "schema": { # optional "type": "object", "properties": { "name": {"type": "string"} } } }
Tool Assertions¶
prela.evals.assertions.tool.ToolCalledAssertion
¶
Bases: BaseAssertion
Assert that a specific tool was called during execution.
This assertion examines the trace to verify that a tool span with the specified name exists.
Example
assertion = ToolCalledAssertion(tool_name="web_search") result = assertion.evaluate(output=None, expected=None, trace=spans) assert result.passed
Functions¶
__init__(tool_name)
¶
Initialize tool called assertion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tool_name
|
str
|
Name of the tool that should have been called |
required |
evaluate(output, expected, trace)
¶
Check if the specified tool was called in the trace.
from_config(config)
classmethod
¶
Create from configuration.
Config format
{ "tool_name": "web_search" }
prela.evals.assertions.tool.ToolArgsAssertion
¶
Bases: BaseAssertion
Assert that a tool was called with expected arguments.
This assertion verifies both that the tool was called and that it was called with specific argument values.
Example
assertion = ToolArgsAssertion( ... tool_name="web_search", ... expected_args={"query": "Python tutorial"} ... ) result = assertion.evaluate(output=None, expected=None, trace=spans) assert result.passed
Functions¶
__init__(tool_name, expected_args, partial_match=True)
¶
Initialize tool args assertion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tool_name
|
str
|
Name of the tool to check |
required |
expected_args
|
dict[str, Any]
|
Expected argument key-value pairs |
required |
partial_match
|
bool
|
If True, only check that expected_args are present (allow additional args). If False, require exact match. |
True
|
evaluate(output, expected, trace)
¶
Check if tool was called with expected arguments.
from_config(config)
classmethod
¶
Create from configuration.
Config format
{ "tool_name": "web_search", "expected_args": {"query": "Python"}, "partial_match": true # optional, default: true }
prela.evals.assertions.tool.ToolSequenceAssertion
¶
Bases: BaseAssertion
Assert that tools were called in a specific order.
This assertion verifies that tools appear in the trace in the expected sequence, though other tools may appear between them.
Example
assertion = ToolSequenceAssertion( ... sequence=["web_search", "calculator", "summarize"] ... ) result = assertion.evaluate(output=None, expected=None, trace=spans) assert result.passed
Functions¶
__init__(sequence, strict=False)
¶
Initialize tool sequence assertion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
list[str]
|
Expected sequence of tool names |
required |
strict
|
bool
|
If True, no other tools can appear between expected ones. If False, other tools are allowed between expected sequence. |
False
|
evaluate(output, expected, trace)
¶
Check if tools were called in the expected sequence.
from_config(config)
classmethod
¶
Create from configuration.
Config format
{ "sequence": ["tool1", "tool2", "tool3"], "strict": false # optional, default: false }
Semantic Assertions¶
prela.evals.assertions.semantic.SemanticSimilarityAssertion
¶
Bases: BaseAssertion
Assert that output is semantically similar to expected text.
Uses sentence embeddings to compare semantic meaning rather than exact text matching. Useful for evaluating LLM outputs where phrasing varies but meaning should be consistent.
Example
assertion = SemanticSimilarityAssertion( ... expected_text="The weather is nice today", ... threshold=0.8 ... ) result = assertion.evaluate( ... output="Today has beautiful weather", ... expected=None, ... trace=None ... ) assert result.passed # High similarity despite different wording
Requires
pip install sentence-transformers
Functions¶
__init__(expected_text, threshold=0.8, model_name='all-MiniLM-L6-v2')
¶
Initialize semantic similarity assertion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
expected_text
|
str
|
Text to compare against |
required |
threshold
|
float
|
Minimum cosine similarity score (0-1) to pass |
0.8
|
model_name
|
str
|
Sentence transformer model to use (default: all-MiniLM-L6-v2, fast and accurate) |
'all-MiniLM-L6-v2'
|
Raises:
| Type | Description |
|---|---|
ImportError
|
If sentence-transformers is not installed |
ValueError
|
If threshold is not between 0 and 1 |
evaluate(output, expected, trace)
¶
Check if output is semantically similar to expected text.
from_config(config)
classmethod
¶
Create from configuration.
Config format
{ "expected_text": "The expected output", "threshold": 0.8, # optional, default: 0.8 "model_name": "all-MiniLM-L6-v2" # optional }
clear_cache()
classmethod
¶
Clear the embedding cache. Useful for testing or memory management.
get_cache_size()
classmethod
¶
Get the number of cached embeddings.
Reporters¶
ConsoleReporter¶
prela.evals.reporters.console.ConsoleReporter
¶
Reporter that pretty-prints evaluation results to the console.
Uses rich library for colored output if available, falls back to plain text formatting otherwise. Provides: - Summary statistics (pass rate, duration) - List of all test cases with pass/fail status - Detailed failure information for failed cases - Color coding (green=pass, red=fail, yellow=warning)
Example
from prela.evals import EvalRunner from prela.evals.reporters import ConsoleReporter
runner = EvalRunner(suite, agent) result = runner.run()
reporter = ConsoleReporter(verbose=True, use_colors=True) reporter.report(result) ✓ Geography QA Suite ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Total: 10 | Passed: 9 (90.0%) | Failed: 1 Duration: 2.5s ...
Functions¶
__init__(verbose=True, use_colors=True)
¶
Initialize the console reporter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
verbose
|
bool
|
If True, show detailed failure information. If False, only show summary statistics and failed case names. |
True
|
use_colors
|
bool
|
If True and rich is available, use colored output. If False or rich unavailable, use plain text. |
True
|
report(result)
¶
Print the evaluation results to the console.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
EvalRunResult
|
The evaluation run result to report. |
required |
JSONReporter¶
prela.evals.reporters.json.JSONReporter
¶
Reporter that writes evaluation results to a JSON file.
Outputs a structured JSON file containing all evaluation data: - Suite metadata (name, timestamps, duration) - Summary statistics (total, passed, failed, pass rate) - Individual case results with assertion details - Full error messages and stack traces
The JSON format is designed for: - Programmatic analysis of test results - Integration with data processing pipelines - Historical comparison of evaluation runs - CI/CD artifact storage
Example
from prela.evals import EvalRunner from prela.evals.reporters import JSONReporter
runner = EvalRunner(suite, agent) result = runner.run()
reporter = JSONReporter("results/eval_run_123.json") reporter.report(result)
Creates results/eval_run_123.json with full results¶
Functions¶
__init__(output_path, indent=2)
¶
Initialize the JSON reporter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_path
|
str | Path
|
Path where the JSON file will be written. Parent directories will be created if they don't exist. |
required |
indent
|
int
|
Number of spaces for JSON indentation (default: 2). Set to None for compact output. |
2
|
report(result)
¶
Write the evaluation results to a JSON file.
Creates parent directories if they don't exist. Overwrites any existing file at the output path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
EvalRunResult
|
The evaluation run result to write. |
required |
Raises:
| Type | Description |
|---|---|
OSError
|
If unable to write to the output path. |
JUnitReporter¶
prela.evals.reporters.junit.JUnitReporter
¶
Reporter that generates JUnit XML format for CI/CD integration.
Creates a JUnit XML file that can be consumed by continuous integration systems for test result visualization, trend analysis, and failure reporting.
The XML format follows the JUnit schema with:
-
Supported CI/CD platforms: - Jenkins (JUnit plugin) - GitLab CI/CD (junit report artifacts) - GitHub Actions (test reporters) - Azure DevOps (publish test results) - CircleCI (store_test_results)
Example
from prela.evals import EvalRunner from prela.evals.reporters import JUnitReporter
runner = EvalRunner(suite, agent) result = runner.run()
reporter = JUnitReporter("test-results/junit.xml") reporter.report(result)
Creates JUnit XML at test-results/junit.xml¶
Functions¶
__init__(output_path)
¶
Initialize the JUnit XML reporter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_path
|
str | Path
|
Path where the JUnit XML file will be written. Parent directories will be created if they don't exist. |
required |
report(result)
¶
Generate and write JUnit XML for the evaluation results.
Creates parent directories if they don't exist. Overwrites any existing file at the output path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
EvalRunResult
|
The evaluation run result to convert to JUnit XML. |
required |
Raises:
| Type | Description |
|---|---|
OSError
|
If unable to write to the output path. |