Monitoring, Evaluation, and Continuous Improvement
This guide explains how to monitor your agents’ performance in PromptOwl , collect and analyze feedback, run systematic evaluations, and use AI to continuously improve your prompts.
Table of Contents
- The Improvement Lifecycle
- Collecting Quality Feedback
- Monitoring Performance
- Creating Evaluation Sets
- Running Evaluations
- Using AI Judge
- AI-Assisted Improvement
- Best Practices
The Improvement Lifecycle
Continuous improvement follows a cyclical process:
┌─────────────────────────────────────────────────────────────┐
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ DEPLOY │────▶│ MONITOR │────▶│ COLLECT │ │
│ │ Agent │ │ Usage │ │ Feedback │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ▲ │ │
│ │ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ PUBLISH │◀────│ IMPROVE │◀────│ EVALUATE │ │
│ │ Update │ │ Prompt │ │ Quality │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘The Six Steps
| Step | Action | Purpose |
|---|---|---|
| 1. Deploy | Publish your agent to production | Make available to users |
| 2. Monitor | Track usage, tokens, conversations | Understand real-world behavior |
| 3. Collect | Gather annotations and feedback | Identify improvement areas |
| 4. Evaluate | Test systematically with eval sets | Measure quality objectively |
| 5. Improve | Use AI to refine prompts | Generate better versions |
| 6. Publish | Release improvements | Complete the cycle |
Collecting Quality Feedback
Types of Feedback
PromptOwl supports two levels of feedback:
1. Message-Level Annotations
Feedback on individual AI responses:
- When to use: Rating specific answers
- Components: Sentiment (thumbs up/down) + detailed text
- Best for: Identifying specific issues
User: "What's your return policy?"
AI: "Our return policy allows returns within 30 days..."
[👍 Thumbs Up]
Annotation: "Accurate and complete answer. Good tone."2. Conversation-Level Annotations
Feedback on entire conversations:
- When to use: Rating overall experience
- Components: Sentiment + summary feedback
- Best for: Holistic assessment
Overall Conversation Rating: [👎 Thumbs Down]
Annotation: "Agent was helpful but took too many turns to resolve the issue."Encouraging Quality Feedback
For Internal Teams
Train your team to provide actionable annotations:
Good annotation:
“The response was accurate but too technical for our customer audience. Should use simpler language and avoid jargon like ‘API endpoint’.”
Poor annotation:
“Bad response”
Annotation Guidelines
| Do | Don’t |
|---|---|
| Be specific about what’s wrong | Use vague terms like “bad” or “wrong” |
| Suggest how to improve | Just criticize without direction |
| Note what worked well | Only focus on negatives |
| Include context if relevant | Assume reader knows the situation |
Sentiment Best Practices
Use sentiment consistently:
| Sentiment | When to Use |
|---|---|
| 👍 Positive | Response was helpful, accurate, appropriate |
| 👎 Negative | Response was wrong, unhelpful, inappropriate |
| — Neutral | Response was acceptable but could be better |
Monitoring Performance
Accessing the Monitor
- Open your prompt from the Dashboard
- Click Monitor in the top navigation
- View the conversation history and analytics
Monitor Interface Overview
The Monitor has two main tabs:
History Tab
Shows all conversations with your agent:
| Column | Description |
|---|---|
| User | Who had the conversation |
| Start Time | When it began |
| Duration | How long it lasted |
| Source | Where it came from |
| Topic | Conversation subject |
All Annotations Tab
Aggregates all feedback:
| Column | Description |
|---|---|
| User | Who provided feedback |
| Question | What was asked |
| Response | What the AI answered |
| Annotation | The feedback text |
| Sentiment | Thumbs up/down/neutral |
| Date | When feedback was given |
Key Metrics to Track
Conversation Metrics
- Total Conversations: Overall usage volume
- Average Duration: How long users engage
- Messages per Conversation: Conversation depth
Quality Metrics
- Satisfaction Score: Average sentiment (0-100)
- Positive Rate: Percentage of thumbs up
- Annotation Volume: How much feedback collected
Usage Metrics
- Total Tokens Used: API consumption
- Model Distribution: Which models are used
- Peak Usage Times: When most active
Filtering and Search
Find specific conversations:
- Search: Type to filter by user, topic, or content
- Date Range: Focus on specific time periods
- Eval Filter: Show only test conversations
Viewing Conversation Details
Click any conversation to see:
- Complete message history
- User questions and AI responses
- Block outputs (for sequential/supervisor)
- Citations shown
- Annotations provided
Creating Evaluation Sets
Evaluation sets (Eval Sets) are collections of test cases used to systematically measure prompt quality.
What’s in an Eval Set?
Each eval set contains:
| Component | Description | Required |
|---|---|---|
| Name | Descriptive identifier | Yes |
| Input | Test question/query | Yes |
| Expected Output | Desired response (annotation) | Recommended |
Method 1: Create from Annotations
Convert quality feedback into test cases:
- Go to Monitor → All Annotations tab
- Check boxes next to high-quality annotations
- Click Save to Eval Set
- Name your eval set
- Click Create
Best annotations for eval sets:
- Clear positive examples (what good looks like)
- Clear negative examples (what to avoid)
- Edge cases users actually encountered
- Common question patterns
Method 2: Upload CSV
Import test cases from a spreadsheet:
- Go to Eval tab on your prompt
- Click Upload CSV
- Select your file with columns:
questionorinput- The test queryresultorannotation- Expected response
- Name the eval set
- Click Upload
CSV Format Example:
question,result
"What is your return policy?","We accept returns within 30 days of purchase..."
"How do I reset my password?","Click 'Forgot Password' on the login page..."
"What are your business hours?","We're open Monday-Friday 9am-5pm EST..."Method 3: Manual Entry
Add test cases one at a time:
- Go to Eval tab
- Click Add Test Case
- Enter the input question
- Enter the expected output
- Save
Organizing Eval Sets
Create multiple eval sets for different purposes:
| Eval Set Name | Purpose |
|---|---|
| ”Core FAQs” | Basic functionality testing |
| ”Edge Cases” | Unusual or difficult queries |
| ”Regression Tests” | Ensure fixes don’t break |
| ”New Feature Tests” | Validate new capabilities |
Running Evaluations
The Evaluation Process
Eval Set (Test Cases)
↓
┌───────────────────────────────────┐
│ Select Prompt Version │
│ (v1, v2, production, etc.) │
└─────────────────┬─────────────────┘
↓
┌───────────────────────────────────┐
│ Run Each Test Case │
│ Input → Prompt → Response │
└─────────────────┬─────────────────┘
↓
┌───────────────────────────────────┐
│ Store Results │
│ Save all responses for review │
└─────────────────┬─────────────────┘
↓
┌───────────────────────────────────┐
│ Compare & Analyze │
│ Review outputs vs expected │
└───────────────────────────────────┘Running an Evaluation
- Go to your prompt’s Eval tab
- Select an Eval Set from the dropdown
- Select the Version to test
- Click Run Eval
- Wait for all test cases to complete
- Review results in the table
Understanding Results
After running, you’ll see:
| Column | Description |
|---|---|
| Input | The test question |
| Expected Output | What you wanted |
| Eval Response | What the AI actually said |
| Version | Which version was tested |
Comparing Versions
Test multiple versions to find the best:
- Run eval with Version 1
- Make prompt improvements → Create Version 2
- Run eval with Version 2
- Compare results in the Runs tab
The Runs tab shows historical comparisons:
| Eval Set | Version | Date | Count | Avg Score | Pass Rate |
|---|---|---|---|---|---|
| Core FAQs | v1 | Dec 28 | 25 | 3.2 | 60% |
| Core FAQs | v2 | Dec 29 | 25 | 4.1 | 84% |
| Core FAQs | v3 | Dec 30 | 25 | 4.5 | 92% |
![Screenshot: Runs Comparison]
Using AI Judge
AI Judge automatically evaluates response quality using another AI model.
What AI Judge Does
Instead of manually reviewing every response, AI Judge:
- Reads the input question
- Reads the expected output
- Reads the actual response
- Scores quality (1-5 scale)
- Explains its reasoning
Judge Evaluation Criteria
The AI Judge evaluates based on:
| Criterion | Description |
|---|---|
| Accuracy | Does the response match expected output? |
| Completeness | Does it cover all necessary points? |
| Quality | Is it clear, helpful, and well-structured? |
Scoring Scale
| Score | Meaning | Description |
|---|---|---|
| 5 | Excellent | Exceeds expectations |
| 4 | Good | Meets expectations well |
| 3 | Average | Acceptable, room for improvement |
| 2 | Below Average | Missing key elements |
| 1 | Poor | Significantly wrong or unhelpful |
Running Judge Evaluation
- First, run a regular evaluation
- Click Run Judge
- Wait for judge to score all results
- Review scores and reasoning
![Screenshot: Run Judge Button]
Understanding Judge Output
After judging, you’ll see:
| Column | Description |
|---|---|
| Judge | AI’s reasoning for the score |
| Score | Numeric rating (1-5) |
Example Judge Output:
Score: 4
Reasoning: The response accurately addresses the return policy question
and includes the 30-day window. It could be improved by mentioning
the refund method (original payment method). Overall, a good response
that covers the essential information.Aggregate Metrics
Judge evaluation calculates:
- Average Score: Mean of all test case scores
- Pass Rate: Percentage scoring 3 or above
These appear in the Runs tab for easy comparison across versions.
AI-Assisted Improvement
PromptOwl can use AI to suggest improvements based on real conversations and feedback.
Accessing Improve with AI
- Open your prompt in edit mode
- Click Improve with AI button
- The improvement dialog opens
![Screenshot: Improve Button]
Improvement Sources
The AI can analyze:
1. Example Conversations
Paste or auto-load real conversations:
User: "How do I cancel my subscription?"
Assistant: "To cancel, go to Settings > Billing > Cancel Plan."
User: "But I don't see a Billing option"
Assistant: "The Billing option is under Account Settings..."2. User Feedback
Your specific improvement requests:
Feedback: "The agent takes too many turns to answer simple questions.
Should provide complete answers in the first response.
Also needs to be more empathetic when users are frustrated."3. Annotations
Auto-loaded feedback from production:
- Message-level annotations
- Conversation-level notes
- Sentiment data
How Improvement Works
┌─────────────────────────────────────────────────────────────┐
│ IMPROVEMENT PROCESS │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Current │ │ Example │ │ User │ │
│ │ Prompt │ + │ Conversation│ + │ Feedback │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ ↓ │
│ ┌─────────────────────────┐ │
│ │ AI Analysis Engine │ │
│ │ (Prompt-type specific) │ │
│ └────────────┬────────────┘ │
│ ↓ │
│ ┌─────────────────────────┐ │
│ │ Improved Variations │ │
│ │ + Change Explanations │ │
│ └─────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘Prompt-Type Specific Improvements
AI understands your prompt type and optimizes accordingly:
Simple Prompts
Focus areas:
- Clarity of instructions
- Response format consistency
- Tone and personality
- Edge case handling
Sequential Prompts
Focus areas:
- Block efficiency and necessity
- Data flow between blocks
- Inter-step coordination
- Output quality at each stage
Supervisor Prompts
Focus areas:
- Task routing logic
- Agent specialization clarity
- Delegation decisions
- Response synthesis
Applying Improvements
Improvements are applied non-destructively:
- Block Names: Updated if AI suggests better names
- Content: Saved as new variations (original preserved)
- Testing: A/B test original vs improved
![Screenshot: Improvement Results]
Improvement-to-Evaluation Loop
After applying improvements:
- New variations are created
- Run evaluation against the new version
- Compare scores with previous version
- If improved, publish; if not, iterate
Best Practices
Annotation Best Practices
Collecting Effective Feedback
| Practice | Why It Matters |
|---|---|
| Annotate immediately | Details are fresh |
| Be specific | Vague feedback isn’t actionable |
| Include examples | Show what you expected |
| Note patterns | Same issue multiple times = priority |
Building Annotation Culture
- Train team on annotation guidelines
- Celebrate improvements from feedback
- Share “before/after” examples
- Make annotation part of workflow
Monitoring Best Practices
What to Monitor Daily
- Conversation volume trends
- Negative sentiment spikes
- Unusual error patterns
- Token usage anomalies
What to Review Weekly
- Aggregate satisfaction scores
- Common failure patterns
- Top annotation themes
- Version performance comparison
Setting Up Alerts
Consider monitoring for:
- Satisfaction score drops below threshold
- Unusual volume changes
- High negative annotation rates
Evaluation Best Practices
Building Quality Eval Sets
| Guideline | Implementation |
|---|---|
| Cover common cases | 70% should be typical questions |
| Include edge cases | 20% should test boundaries |
| Add failure scenarios | 10% should test error handling |
| Update regularly | Add new cases from production |
Running Effective Evaluations
- Baseline first: Eval current version before changes
- One variable: Change one thing, then eval
- Statistical significance: Use enough test cases (25+)
- Judge consistently: Use same judge prompt
Interpreting Results
| Metric | Target | Action if Below |
|---|---|---|
| Pass Rate | >80% | Review failing cases |
| Avg Score | >4.0 | Focus on low scorers |
| Consistency | Low variance | Investigate outliers |
Improvement Best Practices
When to Use AI Improvement
- After identifying patterns in annotations
- When manual tweaking isn’t working
- To get fresh perspective on prompt
- When scaling improvement efforts
What to Include in Feedback
Helpful feedback:
"Users are asking about refunds but the agent only mentions returns.
Need to distinguish between refund (money back) and return (exchange).
Should ask clarifying question if unclear which they want."Not helpful:
"Make it better"Validating Improvements
- Don’t trust blindly: AI suggestions need human review
- Test systematically: Run eval before publishing
- Start with one change: Don’t apply all suggestions at once
- Monitor after publish: Watch for regressions
Continuous Improvement Workflow
Daily
- Check Monitor for new conversations
- Review any negative annotations
- Note patterns for later
Weekly
- Review aggregate metrics
- Analyze annotation themes
- Prioritize improvements needed
- Run eval on current version
Monthly
- Add new test cases from production
- Compare version performance trends
- Archive outdated eval sets
- Document what’s been learned
Troubleshooting
Annotations not appearing
- Check annotation feature is enabled (Enterprise Settings)
- Verify user has permission to annotate
- Refresh the Monitor view
- Check correct prompt is selected
Evaluation failing
- Verify eval set has test cases
- Check prompt version exists
- Ensure API keys are configured
- Try running single test case first
Judge scores seem wrong
- Review judge prompt (uses production version)
- Check expected outputs are realistic
- Verify input/output alignment in eval set
- Consider adjusting pass threshold
AI improvement suggestions unhelpful
- Provide more specific feedback
- Include actual conversation examples
- Describe what good looks like
- Try with different conversation samples
Metrics not updating
- Refresh the page
- Check date range filters
- Verify conversations are completed
- Allow time for aggregation
Quick Reference
Keyboard Shortcuts
| Action | Shortcut |
|---|---|
| Refresh Monitor | Ctrl/Cmd + R |
| Search | Ctrl/Cmd + F |
| Save annotation | Enter (in modal) |
Key Metrics Glossary
| Metric | Formula | Good Target |
|---|---|---|
| Pass Rate | (Scores ≥ 3) / Total | >80% |
| Avg Score | Sum(Scores) / Count | >4.0 |
| Satisfaction | Positive / (Positive + Negative) | >85% |
Eval Set Size Guidelines
| Purpose | Recommended Size |
|---|---|
| Quick check | 10-15 cases |
| Standard eval | 25-50 cases |
| Comprehensive | 100+ cases |