Production-Validated Test Scenarios¶
These test scenarios validate all core Prela SDK features with real Anthropic Claude API calls. All 21 features have been validated in Phase 4 of the SDK testing process.
Validation Status
✅ 21/21 features validated (100%) ✅ 4/4 performance criteria met (100%) ✅ 4/4 documentation checks passed (100%)
Overview¶
The test scenarios directory contains 6 production-ready scripts that demonstrate and validate:
- File Exporter - Traces saved to
./test_tracesdirectory - Console Exporter - Colored tree-structured output
- Anthropic Instrumentation - Automatic LLM call tracing
- Span Hierarchy - Parent-child span relationships
- Streaming - Streaming response capture
- Tool Calling - Tool use event capture
- Error Handling - Error status and attributes
- Replay Engine - Model switching and comparison
- Evaluation Framework - Systematic testing with assertions
- CLI Commands - All 11 CLI commands validated
Quick Start¶
Prerequisites¶
# Set API key
export ANTHROPIC_API_KEY="sk-ant-..."
# Install SDK
cd /Users/gw/prela/sdk
pip install -e .
Run All Scenarios¶
cd /Users/gw/prela/sdk/examples/test_scenarios
# Run each scenario
python 01_simple_success.py
python 02_multi_step.py
python 03_rate_limit_failure.py
python 04_streaming.py
python 05_tool_calling.py
python 06_evaluation.py
Scenario 1: Simple Success¶
File: 01_simple_success.py
Validates basic LLM tracing with file exporter.
import prela
from anthropic import Anthropic
# Initialize with file exporter
tracer = prela.init(
service_name="simple-success",
exporter="file",
file_path="./test_traces"
)
# Make API call - automatically traced
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=100,
messages=[{"role": "user", "content": "What is 2+2?"}]
)
print(f"Response: {response.content[0].text}")
Validates:
- ✅ File exporter creates
./test_traces/directory - ✅ Traces saved in JSONL format
- ✅ Anthropic instrumentation captures all LLM calls
- ✅ Token usage recorded (
llm.input_tokens,llm.output_tokens) - ✅ Span attributes include model, provider, latency
Expected Output:
✓ Prela initialized
✓ Trace file: ./test_traces/traces-2026-01-30-001.jsonl
✓ Making simple Claude API call...
✓ Response: 2 + 2 equals 4.
✓ Tokens: 20 in, 14 out
✓ Trace saved with 1 span
Scenario 2: Multi-Step Workflow¶
File: 02_multi_step.py
Validates span hierarchy with parent-child relationships.
import prela
from anthropic import Anthropic
tracer = prela.init(service_name="multi-step")
def research_step():
with tracer.span("step_1_research"):
client = Anthropic()
response = client.messages.create(...)
return response.content[0].text
def analysis_step():
with tracer.span("step_2_analysis"):
# ... similar ...
def summary_step():
with tracer.span("step_3_summary"):
# ... similar ...
# Parent span wraps all steps
with tracer.span("research_workflow"):
results = []
results.append(research_step())
results.append(analysis_step())
results.append(summary_step())
Validates:
- ✅ Span hierarchy with nested operations
- ✅ Parent-child relationships via
parent_span_id - ✅ Context propagation across functions
- ✅ Tree visualization with
prela show
CLI Validation:
$ prela show <trace_id>
└─ research_workflow (3.5s) ✓
├─ step_1_research (1.2s) ✓
├─ step_2_analysis (1.1s) ✓
└─ step_3_summary (0.8s) ✓
Scenario 3: Rate Limit Handling¶
File: 03_rate_limit_failure.py
Validates error capture and status tracking.
import prela
from anthropic import Anthropic
tracer = prela.init(service_name="rate-limit-test")
try:
client = Anthropic(api_key="invalid-key")
response = client.messages.create(...)
except Exception as e:
print(f"Error captured: {e}")
Validates:
- ✅ Error handling for API failures
- ✅ Span status set to
"error" - ✅ Error attributes:
error.type,error.message,error.stack_trace - ✅ CLI
prela errorscommand shows failed traces
CLI Validation:
$ prela errors --limit 5
Showing 1 error trace (from last 50):
╭────────────┬──────────────┬──────────┬────────┬───────┬──────────────────────╮
│ Trace ID │ Root Span │ Duration │ Status │ Spans │ Time │
├────────────┼──────────────┼──────────┼────────┼───────┼──────────────────────┤
│ abc-123... │ llm call │ 52ms │ error │ 1 │ 2026-01-30 12:34:56 │
╰────────────┴──────────────┴──────────┴────────┴───────┴──────────────────────╯
Scenario 4: Streaming Responses¶
File: 04_streaming.py
Validates streaming LLM response capture.
import prela
from anthropic import Anthropic
tracer = prela.init(service_name="streaming-test")
client = Anthropic()
with client.messages.stream(
model="claude-sonnet-4-20250514",
max_tokens=100,
messages=[{"role": "user", "content": "Tell a story"}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
Validates:
- ✅ Streaming response capture
- ✅
llm.stream=trueattribute - ✅ Token usage from final message
- ✅ Text content aggregation
Span Attributes:
{
"llm.stream": true,
"llm.prompt_tokens": 15,
"llm.completion_tokens": 89,
"llm.latency_ms": 1234.5
}
Scenario 5: Tool Calling¶
File: 05_tool_calling.py
Validates LLM tool/function calling.
import prela
from anthropic import Anthropic
tracer = prela.init(service_name="tool-test")
tools = [{
"name": "get_weather",
"description": "Get weather for a location",
"input_schema": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}]
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=100,
tools=tools,
messages=[{"role": "user", "content": "What's the weather in SF?"}]
)
Validates:
- ✅ Tool use detection
- ✅ Stop reason =
"tool_use" - ✅ Tool call events with
tool.id,tool.name,tool.input
Span Events:
{
"events": [
{
"name": "tool_call",
"attributes": {
"tool.id": "toolu_123",
"tool.name": "get_weather",
"tool.input": {"location": "San Francisco"}
}
}
]
}
Scenario 6: Evaluation Framework¶
File: 06_evaluation.py
Validates systematic testing with assertions.
import prela
from prela.evals import EvalCase, EvalSuite, EvalRunner
from prela.evals.assertions import ContainsAssertion, RegexAssertion
# Define test cases
cases = [
EvalCase(
id="test_addition",
name="Addition test",
input={"query": "What is 5+3?"},
assertions=[
ContainsAssertion(text="8")
]
),
# ... more cases
]
# Create suite
suite = EvalSuite(name="Math QA Tests", cases=cases)
# Run evaluation
runner = EvalRunner(suite, agent_function)
result = runner.run()
print(result.summary())
Validates:
- ✅ Eval framework (EvalCase, EvalSuite, EvalRunner)
- ✅ Assertions execute correctly
- ✅ Tracer integration during eval runs
- ✅ Summary report generation
Expected Output:
Evaluation Suite: Math QA Tests
Total Cases: 3
Passed: 3 (100.0%)
Failed: 0 (0.0%)
Case Results:
✓ Addition test (842ms)
✓ Complex calculation (1231ms)
✓ JSON format test (923ms)
CLI Validation¶
After running scenarios, verify all CLI commands:
List Traces¶
$ prela list
Showing 22 traces (from last 50):
╭────────────┬──────────────┬──────────┬────────┬───────┬──────────────────────╮
│ Trace ID │ Root Span │ Duration │ Status │ Spans │ Time │
├────────────┼──────────────┼──────────┼────────┼───────┼──────────────────────┤
│ abc-123... │ simple call │ 1234ms │ success│ 1 │ 2026-01-30 12:34:56 │
│ def-456... │ workflow │ 3456ms │ success│ 4 │ 2026-01-30 12:33:21 │
╰────────────┴──────────────┴──────────┴────────┴───────┴──────────────────────╯
Show Trace Details¶
$ prela show abc-123
Trace: abc-123 @ 12:34:56
Service: simple-success
Status: success
Duration: 1234ms
Spans: 1
└─ anthropic.messages.create (1234ms) ✓
llm.model: claude-sonnet-4-20250514
llm.input_tokens: 20
llm.output_tokens: 14
llm.latency_ms: 1234.5
Compact Mode¶
$ prela show abc-123 --compact
└─ anthropic.messages.create (1234ms) ✓
💡 Tip: Run without --compact to see full span details and events
Most Recent Trace¶
$ prela last
# Shows most recent trace with full details
# Equivalent to: prela list | head -1 | prela show
Filter Errors¶
Real-Time Monitoring¶
$ prela tail --compact
Watching for new traces (Ctrl+C to stop)...
[12:34:56] └─ simple call (1234ms) ✓
[12:35:12] └─ workflow (3456ms) ✓
[12:35:45] └─ streaming (2345ms) ✓
Performance Validation¶
All performance criteria validated:
SDK Overhead¶
- Target: < 5% of request time
- Actual: < 100ms instrumentation overhead (~1-2% for 1-2 second API calls)
- Status: ✅ PASS
Trace File Writes¶
- Target: Non-blocking
- Actual: Async file I/O, scripts complete without waiting
- Status: ✅ PASS
CLI Commands Response¶
- Target: < 1 second
- Actual: < 100ms for list/show/search
- Status: ✅ PASS
Replay Engine¶
- Target: Reasonable time
- Actual: ~2 seconds for API call replay
- Status: ✅ PASS
Documentation Validation¶
All documentation criteria validated:
Test Scenario Comments¶
- Target: Clear docstrings
- Actual: All 6 scenarios have detailed docstrings
- Status: ✅ PASS
Expected Outputs¶
- Target: Documented
- Actual: SDK_LOCAL_TESTING.md documents all expected outputs
- Status: ✅ PASS
Error Messages¶
- Target: Helpful and actionable
- Actual: All errors include clear messages and suggestions
- Status: ✅ PASS
CLI Help Text¶
- Target: Accurate
- Actual:
prela --helpshows complete, accurate help - Status: ✅ PASS
Full Validation Report¶
See the complete Phase 4 validation report with all evidence:
Summary:
- Total Features Validated: 21/21 (100%)
- Performance Criteria Met: 4/4 (100%)
- Documentation Quality: 4/4 (100%)
- Overall Status: ✅ COMPLETE
Next Steps¶
After validating these scenarios:
- Explore Advanced Examples: See sdk/examples/ for more patterns
- Read Integration Guides: Check Integrations for framework-specific usage
- Build Your Agent: Apply these patterns to production applications
- Deploy Observability: Use file exporter or OTLP exporter for production monitoring
Troubleshooting¶
API Key Not Set¶
Module Not Found¶
No Traces Generated¶
# Check directory exists
ls -la ./test_traces/
# Verify JSONL contents
cat ./test_traces/traces-*.jsonl | jq .
CLI Command Not Found¶
# Ensure CLI tools installed
pip install -e ".[cli]"
# Verify installation
which prela
prela --version