Deterministic Replay¶

Re-execute captured traces with parameter modifications and compare results.

What is Replay?¶

Replay enables you to re-execute previously captured traces in two modes:

Exact Replay: Deterministic re-execution using cached data (no API calls, identical results)
Modified Replay: Re-execution with parameter changes (makes real API calls for modified spans)

This is powerful for:

A/B Testing: Compare different models (GPT-4 vs Claude)
Parameter Tuning: Test temperature, max_tokens, system prompts
Regression Testing: Ensure new versions produce similar outputs
Cost Optimization: Experiment with cheaper models
Debugging: Reproduce issues from production traces

How Replay Works¶

Replay Capture (Automatic)¶

When you enable replay capture, Prela automatically records additional data:

import prela

# Enable replay capture
tracer = prela.init(
    service_name="my-agent",
    exporter="file",
    file_path="traces.jsonl",
    capture_for_replay=True  # ← Enable replay
)

Captured data includes:

LLM Requests: Model, temperature, max_tokens, system_prompt, messages
LLM Responses: Full response text, token usage, finish_reason
Tool Calls: Function names, arguments, results
Retrieval Operations: Queries, documents, similarity scores
Agent State: Memory, context, configuration

Replay Execution¶

Load and replay a trace:

from prela.replay import ReplayEngine
from prela.replay.loader import TraceLoader

# Load trace from file
trace = TraceLoader.from_file("traces.jsonl")

# Create replay engine
engine = ReplayEngine(trace)

# Exact replay (no API calls)
result = engine.replay_exact()

print(f"Duration: {result.total_duration_ms}ms")
print(f"Tokens: {result.total_tokens}")
print(f"Cost: ${result.total_cost_usd:.4f}")

Replay Modes¶

1. Exact Replay¶

Re-execute using captured data without API calls:

# Deterministic, fast, free
result = engine.replay_exact()

Characteristics:

✅ Deterministic: Always produces identical results
✅ Fast: ~1ms per span (no network calls)
✅ Free: No API costs
✅ Offline: Works without API access

Use Cases:

Verify trace completeness
Measure baseline performance
Test comparison engine
Debugging without costs

2. Modified Replay¶

Re-execute with parameter changes (makes real API calls):

# Change model and temperature
result = engine.replay_with_modifications(
    model="gpt-4o",
    temperature=0.7
)

Available Modifications:

Parameter	Description	Example
`model`	Change LLM model	`"gpt-4o"`, `"claude-sonnet-4"`
`temperature`	Adjust randomness	`0.0` (deterministic) to `1.0` (creative)
`system_prompt`	Override system instructions	`"You are a helpful assistant"`
`max_tokens`	Change output length limit	`512`, `1024`, `4096`
`mock_tool_responses`	Override tool outputs	`{"search": {"results": [...]}}`
`mock_retrieval_results`	Override retrieval results	`{"query": {"documents": [...]}}`

Selective Re-execution:

Only modified spans make real API calls. Unmodified spans use cached data:

# Only LLM spans with gpt-4 → gpt-4o will call API
result = engine.replay_with_modifications(model="gpt-4o")

# If original trace used 3 LLM calls:
# - 3 API calls are made (one per modified span)
# - Tool calls use cached data
# - Retrieval uses cached data

Comparing Replays¶

Compare two replay results to see differences:

from prela.replay import compare_replays

# Exact replay (baseline)
original = engine.replay_exact()

# Modified replay (experiment)
modified = engine.replay_with_modifications(
    model="gpt-4o",
    temperature=0.5
)

# Compare
comparison = compare_replays(original, modified)

# Print summary
print(comparison.generate_summary())

Comparison Output¶

Replay Comparison Summary
========================

Total Spans: 5
Spans with Differences: 3

Changes:
- Output differences: 3 spans
- Token changes: 2 spans (+150 tokens)
- Cost changes: 2 spans (+$0.0045)
- Duration changes: 3 spans (+234ms)

Span-by-Span Differences:

1. anthropic.messages.create
   ✗ Output changed (semantic similarity: 85.3%)
   ✗ Tokens: 450 → 600 (+150)
   ✗ Cost: $0.0090 → $0.0135 (+$0.0045)
   ✗ Duration: 823ms → 1057ms (+234ms)

2. langchain.tool.search
   ✓ No differences (cached data used)

3. anthropic.messages.create
   ✗ Output changed (semantic similarity: 92.1%)
   ✓ Tokens unchanged (cached)
   ✗ Duration: 756ms → 891ms (+135ms)

Difference Types¶

Difference	Description	When It Appears
Output	Response text changed	Modified LLM spans
Input	Request changed	Modified prompts
Tokens	Token usage changed	Model change, output length change
Cost	API cost changed	Model change, token change
Duration	Execution time changed	Real API calls vs cached
Status	Success → Error (or vice versa)	API failures, timeouts
Semantic Similarity	Cosine similarity of embeddings	Requires sentence-transformers

Semantic Similarity¶

Compare text outputs semantically (requires optional dependency):

pip install sentence-transformers

The comparison engine uses embeddings to measure similarity:

comparison = compare_replays(original, modified)

for diff in comparison.differences:
    if diff.semantic_similarity:
        if diff.semantic_similarity > 0.9:
            print(f"{diff.span_name}: Highly similar ({diff.semantic_similarity:.1%})")
        elif diff.semantic_similarity > 0.7:
            print(f"{diff.span_name}: Moderately similar ({diff.semantic_similarity:.1%})")
        else:
            print(f"{diff.span_name}: Low similarity ({diff.semantic_similarity:.1%})")

Interpretation:

> 90%: Nearly identical meaning (paraphrases)
70-90%: Similar concepts, different wording
50-70%: Related but divergent responses
< 50%: Significantly different outputs

Automatic Retry Logic¶

Prela automatically retries failed API calls with exponential backoff.

How It Works¶

When an API call fails with a transient error (rate limit, timeout, connection issue), Prela:

Detects if error is retryable
Waits with exponential backoff
Retries up to configured maximum
Tracks retry count per span

Retryable Errors:

HTTP 429 (Rate Limit)
HTTP 503 (Service Unavailable)
HTTP 502 (Bad Gateway)
Connection timeouts
Network errors

Non-Retryable Errors:

Authentication failures (401, 403)
Invalid requests (400)
Not found (404)

Configuration¶

from prela.replay import ReplayEngine

# Default: 3 retries, 1s initial delay, 60s max
engine = ReplayEngine(trace)

# Custom: More aggressive retry for flaky networks
engine = ReplayEngine(
    trace,
    max_retries=5,
    retry_initial_delay=2.0,
    retry_max_delay=120.0,
    retry_exponential_base=2.0,
)

# Fast-fail: Minimal retries
engine = ReplayEngine(
    trace,
    max_retries=1,
    retry_initial_delay=0.5,
)

Exponential Backoff¶

Delays double with each retry (capped at max_delay):

Attempt 0: No delay (initial request)
Attempt 1: 1.0s delay
Attempt 2: 2.0s delay
Attempt 3: 4.0s delay
Attempt 4: 8.0s delay (capped at max_delay)

Monitoring Retries¶

result = engine.replay_with_modifications(model="gpt-4o")

# Check which spans required retries
for span in result.spans:
    if span.retry_count > 0:
        print(f"⚠️  {span.name} required {span.retry_count} retries")

Semantic Similarity Fallback¶

Prela provides intelligent fallback when sentence-transformers is unavailable.

Fallback Strategy¶

Without sentence-transformers (fallback):

Exact Match (fastest) - Returns 1.0 for identical strings
difflib.SequenceMatcher (primary) - Edit distance-based similarity (0.0-1.0)
Jaccard Word Similarity (secondary) - Word overlap measurement (0.0-1.0)

With sentence-transformers (best accuracy):

Uses all-MiniLM-L6-v2 embedding model
Computes cosine similarity between embeddings
Better for paraphrasing and semantic equivalence

Performance Comparison¶

Method	Speed	Accuracy	Use Case
Exact match	Instant	Perfect for identical text	Quick check
difflib	~1-5ms	Good for typos, minor edits	General use
Jaccard	~1-5ms	Good for word reordering	Paraphrasing
Embeddings	~10-50ms	Best for semantic similarity	Production

difflib Accuracy¶

# Same text, different case
"Hello World" vs "hello world" → 0.82 (82%)

# Minor edit
"brown fox" vs "red fox" → 0.85 (85%)

# Word reorder
"cat dog bird" vs "dog bird cat" → 0.67 (67%)

# Completely different
"apple" vs "orange" → 0.0 (0%)

When to Install sentence-transformers¶

Install for production use when you need:

Paraphrase detection ("quick" vs "fast")
Semantic equivalence ("start" vs "begin")
High accuracy requirements

pip install prela[similarity]  # Installs sentence-transformers (~500MB)

Checking Availability¶

from prela.replay import compare_replays

comparison = compare_replays(original, modified)

if comparison.semantic_similarity_available:
    print(f"Using embeddings: {comparison.semantic_similarity_model}")
else:
    print("Using fallback: difflib + Jaccard")

Tool Re-execution¶

Re-execute tools during replay instead of using cached data.

3-Tier Priority System¶

For tool spans, Prela uses this priority order:

Mock responses (highest) - Always used if provided
Real execution - Used if enabled, mocks not provided
Cached data (default) - Original captured output

This prevents accidental execution while allowing controlled testing.

Basic Usage¶

# Define tool functions
def my_calculator(input_data):
    return {"result": input_data["a"] + input_data["b"]}

def my_search(input_data):
    # Actual search implementation
    return {"results": [...]}

# Create tool registry
tool_registry = {
    "calculator": my_calculator,
    "search": my_search,
}

# Re-execute tools
result = engine.replay_with_modifications(
    enable_tool_execution=True,
    tool_registry=tool_registry,
)

Safety Controls¶

Allowlist (only execute specific tools):

result = engine.replay_with_modifications(
    enable_tool_execution=True,
    tool_execution_allowlist=["calculator", "search"],  # Only these
    tool_registry=tool_registry,
)

Blocklist (never execute specific tools):

result = engine.replay_with_modifications(
    enable_tool_execution=True,
    tool_execution_blocklist=["delete_file", "shutdown"],  # Block these
    tool_registry=tool_registry,
)

Note: Blocklist takes precedence over allowlist.

Use Cases¶

Testing with Different Tool Implementations:

# Original used production API
# Replay with mock API for testing
def mock_api(input_data):
    return {"status": "success", "data": "test"}

tool_registry = {"api_call": mock_api}

result = engine.replay_with_modifications(
    enable_tool_execution=True,
    tool_execution_allowlist=["api_call"],
    tool_registry=tool_registry,
)

Debugging with Fresh Data:

# Re-run search tool to see if results changed
def fresh_search(input_data):
    # Query current database
    return {...}

result = engine.replay_with_modifications(
    enable_tool_execution=True,
    tool_execution_allowlist=["search"],
    tool_registry={"search": fresh_search},
)

Retrieval Re-execution¶

Re-query vector databases during replay to test with updated data.

3-Tier Priority System¶

For retrieval spans, Prela uses this priority order:

Mock results (highest) - Always used if provided
Real execution - Re-queries if enabled, mocks not provided
Cached data (default) - Original retrieved documents

Supported Vector Databases¶

✅ ChromaDB - Fully implemented
⚠️ Pinecone - Requires embedding model (placeholder)
⚠️ Qdrant - Requires embedding model (placeholder)
⚠️ Weaviate - Requires class name (placeholder)

ChromaDB Example¶

import chromadb

# Setup ChromaDB client
client = chromadb.Client()
collection = client.create_collection("my_docs")

# Add some documents
collection.add(
    documents=["Updated document 1", "Updated document 2"],
    ids=["1", "2"],
)

# Re-query with current data
result = engine.replay_with_modifications(
    enable_retrieval_execution=True,
    retrieval_client=collection,
)

Query Override¶

Change retrieval query during replay:

# Original query: "What is Python?"
# Test with different query
result = engine.replay_with_modifications(
    enable_retrieval_execution=True,
    retrieval_client=collection,
    retrieval_query_override="What is JavaScript?",
)

Use Cases¶

Testing with Updated Vector Store:

# Original trace used old embeddings
# Replay with newly indexed documents
new_client = chromadb.Client()
# ... add updated documents ...

result = engine.replay_with_modifications(
    enable_retrieval_execution=True,
    retrieval_client=new_client,
)

A/B Testing Retrieval Strategies:

queries = [
    "Direct question",
    "Rephrased question",
    "Keywords only",
]

results = {}
for query in queries:
    results[query] = engine.replay_with_modifications(
        enable_retrieval_execution=True,
        retrieval_client=client,
        retrieval_query_override=query,
    )

Cost Estimation¶

Estimate costs without making API calls:

# Load trace
trace = TraceLoader.from_file("traces.jsonl")
engine = ReplayEngine(trace)

# Estimate cost of replay with different model
result = engine.replay_exact()  # No API calls

print(f"Original cost: ${result.total_cost_usd:.4f}")
print(f"Total tokens: {result.total_tokens}")

# Estimate new cost by inspecting token usage
# (Actual API call costs will vary slightly)

Supported Models:

OpenAI: gpt-4, gpt-4o, gpt-3.5-turbo, o1-preview, o1-mini
Anthropic: claude-3-opus, claude-3-sonnet, claude-3-haiku, claude-sonnet-4

Loading Traces¶

From File¶

from prela.replay.loader import TraceLoader

# JSON file (single trace)
trace = TraceLoader.from_file("trace.json")

# JSONL file (picks first trace)
trace = TraceLoader.from_file("traces.jsonl")

From Dictionary¶

# From exported trace dict
trace_dict = {
    "trace_id": "abc-123",
    "spans": [...]
}
trace = TraceLoader.from_dict(trace_dict)

From Span List¶

# From list of Span objects
from prela.core import Span

spans = [span1, span2, span3]
trace = TraceLoader.from_spans(spans)

CLI Usage¶

Replay traces from the command line:

# Exact replay
prela replay trace.json

# Modified replay with comparison
prela replay trace.json --model gpt-4o --compare

# Override multiple parameters
prela replay trace.json \
  --model claude-sonnet-4 \
  --temperature 0.7 \
  --system-prompt "You are an expert assistant" \
  --output result.json

# Save comparison report
prela replay trace.json --model gpt-4o --compare --output comparison.json

CLI Options:

Option	Description	Example
`--model`	Override model	`--model gpt-4o`
`--temperature`	Set temperature	`--temperature 0.7`
`--system-prompt`	Override system prompt	`--system-prompt "Be concise"`
`--max-tokens`	Set max tokens	`--max-tokens 1024`
`--compare`	Compare with original	`--compare`
`--output`	Save result to file	`--output result.json`

Architecture¶

Trace Tree Reconstruction¶

Traces are loaded and organized into a tree structure:

graph TD
    A[Root Span: Agent] --> B[Span: LLM Call]
    A --> C[Span: Tool Call]
    C --> D[Span: LLM Call]
    A --> E[Span: Final LLM Call]

    style A fill:#4F46E5
    style B fill:#6366F1
    style C fill:#818CF8
    style D fill:#A5B4FC
    style E fill:#6366F1

Depth-First Execution:

Spans are replayed in depth-first order to match original execution:

Root span starts
First child executes completely (including its children)
Second child executes completely
And so on...

This ensures parent-child dependencies are respected.

Selective API Calls¶

The replay engine determines which spans need real API calls:

def _span_needs_modification(span):
    if span.span_type != SpanType.LLM:
        return False  # Only LLM spans can be modified

    if modifications.get("model") and span.model != modifications["model"]:
        return True  # Model changed

    if modifications.get("temperature") and span.temperature != modifications["temperature"]:
        return True  # Temperature changed

    # ... check other parameters

    return False  # Use cached data

Optimization:

Only modified spans call APIs
Cached data used for unchanged spans
Significant cost and latency savings

Best Practices¶

1. Enable Replay Selectively¶

Only enable replay capture when needed:

# Production: No replay (minimal overhead)
prela.init(service_name="prod", capture_for_replay=False)

# Development: Enable replay
prela.init(service_name="dev", capture_for_replay=True)

2. Use Exact Replay First¶

Always start with exact replay to verify completeness:

engine = ReplayEngine(trace)
result = engine.replay_exact()

if not result.spans:
    print("Warning: Trace has no replay data")
    return

3. Compare Semantically¶

Use semantic similarity for meaningful comparisons:

comparison = compare_replays(original, modified)

for diff in comparison.differences:
    if diff.field == "output" and diff.semantic_similarity:
        if diff.semantic_similarity < 0.7:
            print(f"⚠️  Significant divergence in {diff.span_name}")

4. Batch Experiments¶

Test multiple configurations efficiently:

models = ["gpt-4", "gpt-4o", "claude-sonnet-4"]
temperatures = [0.0, 0.5, 1.0]

results = {}
for model in models:
    for temp in temperatures:
        key = f"{model}_temp{temp}"
        results[key] = engine.replay_with_modifications(
            model=model,
            temperature=temp
        )

# Compare all results
for key, result in results.items():
    print(f"{key}: ${result.total_cost_usd:.4f}, {result.total_tokens} tokens")

5. Store Comparisons¶

Save comparison reports for analysis:

comparison = compare_replays(original, modified)

# Save to JSON
import json
with open("comparison.json", "w") as f:
    json.dump({
        "summary": comparison.generate_summary(),
        "differences": [
            {
                "span": diff.span_name,
                "field": diff.field,
                "similarity": diff.semantic_similarity
            }
            for diff in comparison.differences
        ]
    }, f, indent=2)

Limitations¶

1. Tool Side Effects¶

Tools with side effects cannot be replayed safely:

# ❌ Cannot replay safely
def send_email(to, subject, body):
    smtp.send(to, subject, body)  # Side effect!

# ✅ Use mocking for side effects
result = engine.replay_with_modifications(
    mock_tool_responses={
        "send_email": {"status": "sent", "message_id": "mock-123"}
    }
)

2. Streaming Responses¶

Streaming is supported with real-time output:

from prela.replay import ReplayEngine
from prela.replay.loader import TraceLoader

# Load trace
trace = TraceLoader.from_file("trace.json")
engine = ReplayEngine(trace)

# Replay with streaming enabled
def on_chunk(chunk: str):
    print(chunk, end="", flush=True)

result = engine.replay_with_modifications(
    model="gpt-4o",
    stream=True,
    stream_callback=on_chunk
)

Note: Streaming works for both OpenAI and Anthropic models during replay execution.

3. Vendor Support¶

Currently supports OpenAI and Anthropic only:

✅ OpenAI (gpt-, o1-)
✅ Anthropic (claude-*)
❌ Other vendors (use exact replay only)

Performance¶

Exact Replay¶

Speed: ~1ms per span
Memory: O(n) where n = number of spans
Cost: $0 (no API calls)

Modified Replay¶

Speed: Depends on API latency
Memory: O(n) + API response buffers
Cost: Only modified LLM spans charged

Comparison¶

Speed: ~10-50ms per comparison (with embeddings)
Memory: O(n) for difference list
Accuracy: 70%+ similarity for semantically similar texts

API Reference¶

See Replay API Documentation for detailed API reference.

Next Steps¶

Replay Examples: Practical examples and recipes
CLI Commands: Command-line replay reference
API Reference: Complete API documentation