Advanced Replay Features¶
Prela's replay engine includes advanced features for handling transient failures, computing similarity without heavy dependencies, and re-executing tools and retrievals for comprehensive testing.
Overview¶
Advanced replay capabilities enable:
- API Retry Logic - Automatic recovery from transient API failures
- Semantic Similarity Fallback - Text comparison without 500MB dependencies
- Tool Re-execution - Execute tools during replay with safety controls
- Retrieval Re-execution - Query vector databases for fresh results
These features make replay more robust, flexible, and production-ready.
API Retry Logic¶
Exponential Backoff¶
The replay engine automatically retries failed API calls using exponential backoff:
from prela.replay import ReplayEngine, TraceLoader
trace = TraceLoader.from_file("trace.json")
engine = ReplayEngine(
trace,
max_retries=3, # Maximum retry attempts (default: 3)
retry_initial_delay=1.0, # Initial delay in seconds (default: 1.0)
retry_max_delay=60.0, # Maximum delay cap (default: 60.0)
retry_exponential_base=2.0, # Exponential base (default: 2.0)
)
result = engine.replay_with_modifications(model="gpt-4o")
Retry Pattern:
- Attempt 0: No delay (initial request)
- Attempt 1: 1.0s delay (2^0 × 1.0)
- Attempt 2: 2.0s delay (2^1 × 1.0)
- Attempt 3: 4.0s delay (2^2 × 1.0)
- Capped at retry_max_delay (60s default)
Retryable Errors¶
The engine automatically retries these error types:
HTTP Status Codes:
- 429 - Rate limit exceeded
- 503 - Service temporarily unavailable
- 502 - Bad gateway
Exception Types: - Timeout errors (connection timeout, read timeout) - Connection errors (network issues) - API responses containing "try again" messages
Non-Retryable Errors:
- 401 - Authentication errors (fail immediately)
- 403 - Permission errors (fail immediately)
- 400 - Bad request errors (fail immediately)
Retry Count Tracking¶
Each replayed span includes retry count information:
result = engine.replay_with_modifications(model="gpt-4o")
for span in result.spans:
if span.retry_count > 0:
print(f"{span.name} required {span.retry_count} retries")
# Example output:
# openai.chat.completions.create required 2 retries
Use Cases: - Monitor API reliability - Identify rate limit issues - Optimize retry configuration - Debug transient failures
Configuration Examples¶
Aggressive Retries (Development):
engine = ReplayEngine(
trace,
max_retries=5, # More attempts
retry_initial_delay=0.5, # Faster retries
retry_max_delay=30.0, # Lower cap
)
Conservative Retries (Production):
engine = ReplayEngine(
trace,
max_retries=2, # Fewer attempts
retry_initial_delay=2.0, # Slower retries
retry_max_delay=120.0, # Higher cap
)
No Retries:
Semantic Similarity Fallback¶
Overview¶
Replay comparison uses semantic similarity to compare original vs replayed outputs. By default, this requires sentence-transformers (~500MB). The fallback system enables comparison without this dependency.
Fallback Strategy¶
Three-tier fallback when sentence-transformers is unavailable:
Tier 1: Exact Match (Fastest)
Tier 2: difflib SequenceMatcher (Primary)
import difflib
ratio = difflib.SequenceMatcher(None, original_text, replayed_text).ratio()
# Returns 0.0-1.0 based on edit distance
Tier 3: Jaccard Word Similarity (Secondary)
words1 = set(original_text.lower().split())
words2 = set(replayed_text.lower().split())
intersection = len(words1 & words2)
union = len(words1 | words2)
return intersection / union if union > 0 else 0.0
Performance Comparison¶
| Method | Installation Size | Speed | Accuracy |
|---|---|---|---|
| sentence-transformers | ~500MB | 10-50ms | High (0.9+ for similar) |
| difflib (fallback) | 0MB (built-in) | 1-5ms | Medium (0.7+ for similar) |
| Jaccard (fallback) | 0MB (built-in) | <1ms | Low (0.5+ for similar) |
difflib Behavior¶
Exact match:
Case change:
Minor edit:
Word reorder:
Completely different:
Availability Flags¶
Comparison results include flags indicating which method was used:
comparison = engine.compare_replay(original_result, replayed_result)
print(f"Semantic similarity available: {comparison.semantic_similarity_available}")
print(f"Model used: {comparison.semantic_similarity_model}")
# Output (with sentence-transformers):
# Semantic similarity available: True
# Model used: all-MiniLM-L6-v2
# Output (without sentence-transformers):
# Semantic similarity available: False
# Model used: None
Installation Options¶
Minimal (fallback only):
pip install prela
# Uses difflib + Jaccard, no heavy dependencies
# Fast installation, 0 additional storage
Full (best accuracy):
pip install prela[similarity]
# Downloads sentence-transformers (~500MB first time)
# Better accuracy for semantic comparison
When to Use Each Method¶
Use Fallback (difflib) When: - Installation size matters (containers, edge devices) - Comparing structured output (JSON, code) - Fast installation required (CI/CD) - Exact or near-exact matches expected
Use sentence-transformers When: - Comparing natural language text - Semantic meaning matters more than exact wording - High accuracy required - Storage/bandwidth not constrained
Tool Re-execution¶
Overview¶
Instead of replaying cached tool outputs, you can re-execute tools during replay to test with fresh data.
Basic Usage¶
from prela.replay import ReplayEngine, TraceLoader
# Define tool functions
def calculator(expression: str) -> str:
"""Safe calculator tool."""
return str(eval(expression))
def search_api(query: str) -> str:
"""Search API tool."""
import requests
response = requests.get(f"https://api.example.com/search?q={query}")
return response.json()
# Create tool registry
tool_registry = {
"calculator": calculator,
"search_api": search_api,
}
trace = TraceLoader.from_file("trace.json")
engine = ReplayEngine(trace, tool_registry=tool_registry, enable_tool_execution=True)
result = engine.replay_with_modifications(model="gpt-4o")
Safety Controls¶
Allowlist (Recommended):
engine = ReplayEngine(
trace,
tool_registry=tool_registry,
enable_tool_execution={
"allowlist": ["calculator", "search_api"], # Only these tools
}
)
Blocklist:
engine = ReplayEngine(
trace,
tool_registry=tool_registry,
enable_tool_execution={
"blocklist": ["delete_file", "send_email"], # Block dangerous tools
}
)
All Tools:
engine = ReplayEngine(
trace,
tool_registry=tool_registry,
enable_tool_execution=True, # Enable all tools in registry
)
Priority System¶
When a tool call is encountered, the engine uses this priority:
- Mocks (highest priority) - If mock provided via
tool_mocksparameter - Execution (medium priority) - If enabled and tool in registry
- Cached (lowest priority) - Original output from trace
engine = ReplayEngine(
trace,
tool_registry={"calculator": calculator_fn},
enable_tool_execution=True,
tool_mocks={"calculator": "42"}, # Mock overrides execution
)
Error Handling¶
Tool errors are captured safely:
def risky_tool(input: str) -> str:
"""Tool that might fail."""
if input == "error":
raise ValueError("Invalid input")
return f"Processed: {input}"
tool_registry = {"risky_tool": risky_tool}
engine = ReplayEngine(
trace,
tool_registry=tool_registry,
enable_tool_execution=True,
)
result = engine.replay_with_modifications(model="gpt-4o")
# Errors captured in span status, replay continues
for span in result.spans:
if span.span_type == "tool" and span.status == "error":
print(f"Tool {span.name} failed: {span.attributes.get('error.message')}")
Use Cases¶
Integration Testing:
# Test with real APIs
tool_registry = {
"github_api": github.search_repositories,
"slack_api": slack.post_message,
}
engine = ReplayEngine(
trace,
tool_registry=tool_registry,
enable_tool_execution={"allowlist": ["github_api"]}, # Only test GitHub
)
Regression Testing:
# Compare cached vs fresh results
result1 = engine.replay_with_modifications(enable_tool_execution=False) # Use cached
result2 = engine.replay_with_modifications(enable_tool_execution=True) # Re-execute
comparison = engine.compare_replay(result1, result2)
print(f"Tool output consistency: {comparison.output_similarity}")
Controlled Testing:
# Test subset of tools
engine = ReplayEngine(
trace,
tool_registry=all_tools,
enable_tool_execution={
"allowlist": ["safe_tool_1", "safe_tool_2"],
"blocklist": ["dangerous_tool"], # Extra safety
}
)
Retrieval Re-execution¶
Overview¶
Re-execute vector database queries during replay to test with current data:
from prela.replay import ReplayEngine, TraceLoader
import chromadb
# Initialize vector database client
client = chromadb.Client()
collection = client.get_or_create_collection("my_docs")
trace = TraceLoader.from_file("trace.json")
engine = ReplayEngine(
trace,
retrieval_client=client,
enable_retrieval_execution=True,
)
result = engine.replay_with_modifications(model="gpt-4o")
Supported Vector Databases¶
ChromaDB (Fully Implemented):
import chromadb
client = chromadb.Client()
collection = client.get_or_create_collection("docs")
engine = ReplayEngine(
trace,
retrieval_client=client,
enable_retrieval_execution=True,
)
Pinecone (Placeholder):
import pinecone
pinecone.init(api_key="...")
index = pinecone.Index("my-index")
engine = ReplayEngine(
trace,
retrieval_client=index,
enable_retrieval_execution=True,
)
Qdrant (Placeholder):
from qdrant_client import QdrantClient
client = QdrantClient(url="http://localhost:6333")
engine = ReplayEngine(
trace,
retrieval_client=client,
enable_retrieval_execution=True,
)
Weaviate (Placeholder):
import weaviate
client = weaviate.Client(url="http://localhost:8080")
engine = ReplayEngine(
trace,
retrieval_client=client,
enable_retrieval_execution=True,
)
Query Override¶
Override the original query:
engine = ReplayEngine(
trace,
retrieval_client=client,
enable_retrieval_execution=True,
retrieval_query_override="Updated query text",
)
# All retrieval spans will use the new query
result = engine.replay_with_modifications(model="gpt-4o")
Priority System¶
When a retrieval operation is encountered:
- Execution (if enabled and client provided) - Query vector database
- Cached (fallback) - Original documents from trace
# Test with fresh data
result1 = engine.replay_with_modifications(enable_retrieval_execution=True)
# Test with cached data
result2 = engine.replay_with_modifications(enable_retrieval_execution=False)
# Compare consistency
comparison = engine.compare_replay(result1, result2)
Use Cases¶
Data Freshness Testing:
# Verify agent works with current data
engine = ReplayEngine(
trace,
retrieval_client=chroma_client,
enable_retrieval_execution=True,
)
result = engine.replay_with_modifications(model="gpt-4o")
print(f"Documents retrieved: {len(result.spans[0].documents)}")
RAG Pipeline Testing:
# Test retrieval → generation pipeline
engine = ReplayEngine(
trace,
retrieval_client=chroma_client,
enable_retrieval_execution=True,
enable_tool_execution=True, # Also re-execute tools
)
result = engine.replay_with_modifications(
model="gpt-4o",
temperature=0.0, # Deterministic generation
)
Query Sensitivity Testing:
# Test different query variations
queries = [
"original query",
"rephrased query",
"shorter query",
]
results = []
for query in queries:
engine = ReplayEngine(
trace,
retrieval_client=client,
enable_retrieval_execution=True,
retrieval_query_override=query,
)
results.append(engine.replay_with_modifications(model="gpt-4o"))
# Compare outputs across query variations
Combining Advanced Features¶
Complete Example¶
from prela.replay import ReplayEngine, TraceLoader
import chromadb
# Define tools
def calculator(expr: str) -> str:
return str(eval(expr))
def search_api(query: str) -> str:
import requests
return requests.get(f"https://api.example.com/search?q={query}").json()
# Initialize vector database
chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection("docs")
# Load trace
trace = TraceLoader.from_file("trace.json")
# Create engine with all advanced features
engine = ReplayEngine(
trace,
# API retry configuration
max_retries=3,
retry_initial_delay=1.0,
retry_max_delay=60.0,
# Tool re-execution
tool_registry={"calculator": calculator, "search_api": search_api},
enable_tool_execution={"allowlist": ["calculator"]},
# Retrieval re-execution
retrieval_client=chroma_client,
enable_retrieval_execution=True,
)
# Replay with modifications
result = engine.replay_with_modifications(
model="gpt-4o",
temperature=0.7,
)
# Analyze results
print(f"Replayed {len(result.spans)} spans")
print(f"Retries required: {sum(s.retry_count for s in result.spans)}")
print(f"Tools executed: {sum(1 for s in result.spans if s.span_type == 'tool')}")
print(f"Retrievals executed: {sum(1 for s in result.spans if s.span_type == 'retrieval')}")
# Compare with original (using fallback similarity)
comparison = engine.compare_replay(
original_result=trace,
replayed_result=result,
)
print(f"\nSemantic similarity available: {comparison.semantic_similarity_available}")
print(f"Similarity model: {comparison.semantic_similarity_model or 'difflib (fallback)'}")
print(f"Output similarity: {comparison.output_similarity:.2%}")
Best Practices¶
1. Start with Defaults¶
Use default retry configuration unless you have specific needs:
2. Use Allowlists for Tool Execution¶
Always specify which tools are safe to execute:
engine = ReplayEngine(
trace,
tool_registry=all_tools,
enable_tool_execution={"allowlist": ["safe_tool_1", "safe_tool_2"]},
)
3. Monitor Retry Counts¶
Track which spans require retries:
retry_spans = [s for s in result.spans if s.retry_count > 0]
if retry_spans:
print(f"Warning: {len(retry_spans)} spans required retries")
4. Fallback is Usually Sufficient¶
Use difflib fallback unless semantic understanding is critical:
# No need to install sentence-transformers for most use cases
pip install prela # Fallback is fast and accurate enough
5. Combine Features Carefully¶
Enable only features you need:
# Development: Enable everything
engine = ReplayEngine(
trace,
max_retries=5,
enable_tool_execution=True,
enable_retrieval_execution=True,
)
# Production: Conservative settings
engine = ReplayEngine(
trace,
max_retries=2,
enable_tool_execution={"allowlist": ["read_only_tool"]},
enable_retrieval_execution=False, # Use cached data
)
Troubleshooting¶
Issue: Retries Not Working¶
Symptoms: API calls fail immediately without retrying.
Solutions: 1. Check error type is retryable:
- Verify max_retries > 0:
Issue: Tool Execution Failing¶
Symptoms: Tools not executing during replay.
Solutions: 1. Verify tool in registry:
-
Check allowlist/blocklist:
-
Ensure tool function signature is correct:
Issue: Retrieval Client Not Working¶
Symptoms: Retrieval not re-executing, using cached data.
Solutions: 1. Verify client type is supported:
- Check enable_retrieval_execution is True:
Next Steps¶
- Replay Multi-Agent Examples - Replay with CrewAI, AutoGen, LangGraph, Swarm
- Replay with Tools Examples - Tool re-execution patterns
- Basic Replay - Core replay concepts
- CLI Replay Commands - Command-line replay interface