Monitoring, Evaluation, and Continuous Improvement

This guide explains how to monitor your agents’ performance in PromptOwl , collect and analyze feedback, run systematic evaluations, and use AI to continuously improve your prompts.

The Improvement Lifecycle
Collecting Quality Feedback
Monitoring Performance
Creating Evaluation Sets
Running Evaluations
Using AI Judge
AI-Assisted Improvement
Best Practices

The Improvement Lifecycle

Continuous improvement follows a cyclical process:


┌─────────────────────────────────────────────────────────────┐
│                                                             │
│    ┌──────────┐     ┌──────────┐     ┌──────────┐          │
│    │  DEPLOY  │────▶│  MONITOR │────▶│ COLLECT  │          │
│    │  Agent   │     │  Usage   │     │ Feedback │          │
│    └──────────┘     └──────────┘     └──────────┘          │
│         ▲                                  │               │
│         │                                  ▼               │
│    ┌──────────┐     ┌──────────┐     ┌──────────┐          │
│    │  PUBLISH │◀────│ IMPROVE  │◀────│ EVALUATE │          │
│    │  Update  │     │  Prompt  │     │  Quality │          │
│    └──────────┘     └──────────┘     └──────────┘          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The Six Steps

Step	Action	Purpose
1. Deploy	Publish your agent to production	Make available to users
2. Monitor	Track usage, tokens, conversations	Understand real-world behavior
3. Collect	Gather annotations and feedback	Identify improvement areas
4. Evaluate	Test systematically with eval sets	Measure quality objectively
5. Improve	Use AI to refine prompts	Generate better versions
6. Publish	Release improvements	Complete the cycle

Collecting Quality Feedback

Types of Feedback

PromptOwl supports two levels of feedback:

1. Message-Level Annotations

Feedback on individual AI responses:

When to use: Rating specific answers
Components: Sentiment (thumbs up/down) + detailed text
Best for: Identifying specific issues


User: "What's your return policy?"
AI: "Our return policy allows returns within 30 days..."

[👍 Thumbs Up]
Annotation: "Accurate and complete answer. Good tone."

2. Conversation-Level Annotations

Feedback on entire conversations:

When to use: Rating overall experience
Components: Sentiment + summary feedback
Best for: Holistic assessment


Overall Conversation Rating: [👎 Thumbs Down]
Annotation: "Agent was helpful but took too many turns to resolve the issue."

Encouraging Quality Feedback

For Internal Teams

Train your team to provide actionable annotations:

Good annotation:

“The response was accurate but too technical for our customer audience. Should use simpler language and avoid jargon like ‘API endpoint’.”

Poor annotation:

“Bad response”

Annotation Guidelines

Do	Don’t
Be specific about what’s wrong	Use vague terms like “bad” or “wrong”
Suggest how to improve	Just criticize without direction
Note what worked well	Only focus on negatives
Include context if relevant	Assume reader knows the situation

Sentiment Best Practices

Use sentiment consistently:

Sentiment	When to Use
👍 Positive	Response was helpful, accurate, appropriate
👎 Negative	Response was wrong, unhelpful, inappropriate
— Neutral	Response was acceptable but could be better

Monitoring Performance

Accessing the Monitor

Open your prompt from the Dashboard
Click Monitor in the top navigation
View the conversation history and analytics

Monitor Interface Overview

The Monitor has two main tabs:

History Tab

Shows all conversations with your agent:

Column	Description
User	Who had the conversation
Start Time	When it began
Duration	How long it lasted
Source	Where it came from
Topic	Conversation subject

All Annotations Tab

Aggregates all feedback:

Column	Description
User	Who provided feedback
Question	What was asked
Response	What the AI answered
Annotation	The feedback text
Sentiment	Thumbs up/down/neutral
Date	When feedback was given

Key Metrics to Track

Conversation Metrics

Total Conversations: Overall usage volume
Average Duration: How long users engage
Messages per Conversation: Conversation depth

Quality Metrics

Satisfaction Score: Average sentiment (0-100)
Positive Rate: Percentage of thumbs up
Annotation Volume: How much feedback collected

Usage Metrics

Total Tokens Used: API consumption
Model Distribution: Which models are used
Peak Usage Times: When most active

Filtering and Search

Find specific conversations:

Search: Type to filter by user, topic, or content
Date Range: Focus on specific time periods
Eval Filter: Show only test conversations

Viewing Conversation Details

Click any conversation to see:

Complete message history
User questions and AI responses
Block outputs (for sequential/supervisor)
Citations shown
Annotations provided

Creating Evaluation Sets

Evaluation sets (Eval Sets) are collections of test cases used to systematically measure prompt quality.

What’s in an Eval Set?

Each eval set contains:

Component	Description	Required
Name	Descriptive identifier	Yes
Input	Test question/query	Yes
Expected Output	Desired response (annotation)	Recommended

Method 1: Create from Annotations

Convert quality feedback into test cases:

Go to Monitor → All Annotations tab
Check boxes next to high-quality annotations
Click Save to Eval Set
Name your eval set
Click Create

Best annotations for eval sets:

Clear positive examples (what good looks like)
Clear negative examples (what to avoid)
Edge cases users actually encountered
Common question patterns

Method 2: Upload CSV

Import test cases from a spreadsheet:

Go to Eval tab on your prompt
Click Upload CSV
Select your file with columns:
- question or input - The test query
- result or annotation - Expected response
Name the eval set
Click Upload

CSV Format Example:


question,result
"What is your return policy?","We accept returns within 30 days of purchase..."
"How do I reset my password?","Click 'Forgot Password' on the login page..."
"What are your business hours?","We're open Monday-Friday 9am-5pm EST..."

Method 3: Manual Entry

Add test cases one at a time:

Go to Eval tab
Click Add Test Case
Enter the input question
Enter the expected output
Save

Organizing Eval Sets

Create multiple eval sets for different purposes:

Eval Set Name	Purpose
”Core FAQs”	Basic functionality testing
”Edge Cases”	Unusual or difficult queries
”Regression Tests”	Ensure fixes don’t break
”New Feature Tests”	Validate new capabilities

Running Evaluations

The Evaluation Process


Eval Set (Test Cases)
        ↓
┌───────────────────────────────────┐
│     Select Prompt Version         │
│   (v1, v2, production, etc.)      │
└─────────────────┬─────────────────┘
                  ↓
┌───────────────────────────────────┐
│      Run Each Test Case           │
│   Input → Prompt → Response       │
└─────────────────┬─────────────────┘
                  ↓
┌───────────────────────────────────┐
│       Store Results               │
│   Save all responses for review   │
└─────────────────┬─────────────────┘
                  ↓
┌───────────────────────────────────┐
│       Compare & Analyze           │
│   Review outputs vs expected      │
└───────────────────────────────────┘

Running an Evaluation

Go to your prompt’s Eval tab
Select an Eval Set from the dropdown
Select the Version to test
Click Run Eval
Wait for all test cases to complete
Review results in the table

Understanding Results

After running, you’ll see:

Column	Description
Input	The test question
Expected Output	What you wanted
Eval Response	What the AI actually said
Version	Which version was tested

Comparing Versions

Test multiple versions to find the best:

Run eval with Version 1
Make prompt improvements → Create Version 2
Run eval with Version 2
Compare results in the Runs tab

The Runs tab shows historical comparisons:

Eval Set	Version	Date	Count	Avg Score	Pass Rate
Core FAQs	v1	Dec 28	25	3.2	60%
Core FAQs	v2	Dec 29	25	4.1	84%
Core FAQs	v3	Dec 30	25	4.5	92%

![Screenshot: Runs Comparison]

Using AI Judge

AI Judge automatically evaluates response quality using another AI model.

What AI Judge Does

Instead of manually reviewing every response, AI Judge:

Reads the input question
Reads the expected output
Reads the actual response
Scores quality (1-5 scale)
Explains its reasoning

Judge Evaluation Criteria

The AI Judge evaluates based on:

Criterion	Description
Accuracy	Does the response match expected output?
Completeness	Does it cover all necessary points?
Quality	Is it clear, helpful, and well-structured?

Scoring Scale

Score	Meaning	Description
5	Excellent	Exceeds expectations
4	Good	Meets expectations well
3	Average	Acceptable, room for improvement
2	Below Average	Missing key elements
1	Poor	Significantly wrong or unhelpful

Running Judge Evaluation

First, run a regular evaluation
Click Run Judge
Wait for judge to score all results
Review scores and reasoning

![Screenshot: Run Judge Button]

Understanding Judge Output

After judging, you’ll see:

Column	Description
Judge	AI’s reasoning for the score
Score	Numeric rating (1-5)

Example Judge Output:


Score: 4
Reasoning: The response accurately addresses the return policy question
and includes the 30-day window. It could be improved by mentioning
the refund method (original payment method). Overall, a good response
that covers the essential information.

Aggregate Metrics

Judge evaluation calculates:

Average Score: Mean of all test case scores
Pass Rate: Percentage scoring 3 or above

These appear in the Runs tab for easy comparison across versions.

AI-Assisted Improvement

PromptOwl can use AI to suggest improvements based on real conversations and feedback.

Accessing Improve with AI

Open your prompt in edit mode
Click Improve with AI button
The improvement dialog opens

![Screenshot: Improve Button]

Improvement Sources

The AI can analyze:

1. Example Conversations

Paste or auto-load real conversations:


User: "How do I cancel my subscription?"
Assistant: "To cancel, go to Settings > Billing > Cancel Plan."

User: "But I don't see a Billing option"
Assistant: "The Billing option is under Account Settings..."

2. User Feedback

Your specific improvement requests:


Feedback: "The agent takes too many turns to answer simple questions.
Should provide complete answers in the first response.
Also needs to be more empathetic when users are frustrated."

3. Annotations

Auto-loaded feedback from production:

Message-level annotations
Conversation-level notes
Sentiment data

How Improvement Works


┌─────────────────────────────────────────────────────────────┐
│                     IMPROVEMENT PROCESS                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │
│  │   Current   │    │   Example   │    │    User     │     │
│  │   Prompt    │ +  │ Conversation│ +  │  Feedback   │     │
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘     │
│         │                  │                  │             │
│         └──────────────────┼──────────────────┘             │
│                            ↓                                │
│              ┌─────────────────────────┐                    │
│              │    AI Analysis Engine    │                    │
│              │  (Prompt-type specific)  │                    │
│              └────────────┬────────────┘                    │
│                           ↓                                 │
│              ┌─────────────────────────┐                    │
│              │   Improved Variations   │                    │
│              │   + Change Explanations │                    │
│              └─────────────────────────┘                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Prompt-Type Specific Improvements

AI understands your prompt type and optimizes accordingly:

Simple Prompts

Focus areas:

Clarity of instructions
Response format consistency
Tone and personality
Edge case handling

Sequential Prompts

Focus areas:

Block efficiency and necessity
Data flow between blocks
Inter-step coordination
Output quality at each stage

Supervisor Prompts

Focus areas:

Task routing logic
Agent specialization clarity
Delegation decisions
Response synthesis

Applying Improvements

Improvements are applied non-destructively:

Block Names: Updated if AI suggests better names
Content: Saved as new variations (original preserved)
Testing: A/B test original vs improved

![Screenshot: Improvement Results]

Improvement-to-Evaluation Loop

After applying improvements:

New variations are created
Run evaluation against the new version
Compare scores with previous version
If improved, publish; if not, iterate

Best Practices

Annotation Best Practices

Collecting Effective Feedback

Practice	Why It Matters
Annotate immediately	Details are fresh
Be specific	Vague feedback isn’t actionable
Include examples	Show what you expected
Note patterns	Same issue multiple times = priority

Building Annotation Culture

Train team on annotation guidelines
Celebrate improvements from feedback
Share “before/after” examples
Make annotation part of workflow

Monitoring Best Practices

What to Monitor Daily

Conversation volume trends
Negative sentiment spikes
Unusual error patterns
Token usage anomalies

What to Review Weekly

Aggregate satisfaction scores
Common failure patterns
Top annotation themes
Version performance comparison

Setting Up Alerts

Consider monitoring for:

Satisfaction score drops below threshold
Unusual volume changes
High negative annotation rates

Evaluation Best Practices

Building Quality Eval Sets

Guideline	Implementation
Cover common cases	70% should be typical questions
Include edge cases	20% should test boundaries
Add failure scenarios	10% should test error handling
Update regularly	Add new cases from production

Running Effective Evaluations

Baseline first: Eval current version before changes
One variable: Change one thing, then eval
Statistical significance: Use enough test cases (25+)
Judge consistently: Use same judge prompt

Interpreting Results

Metric	Target	Action if Below
Pass Rate	>80%	Review failing cases
Avg Score	>4.0	Focus on low scorers
Consistency	Low variance	Investigate outliers

Improvement Best Practices

When to Use AI Improvement

After identifying patterns in annotations
When manual tweaking isn’t working
To get fresh perspective on prompt
When scaling improvement efforts

What to Include in Feedback

Helpful feedback:


"Users are asking about refunds but the agent only mentions returns.
Need to distinguish between refund (money back) and return (exchange).
Should ask clarifying question if unclear which they want."

Not helpful:


"Make it better"

Validating Improvements

Don’t trust blindly: AI suggestions need human review
Test systematically: Run eval before publishing
Start with one change: Don’t apply all suggestions at once
Monitor after publish: Watch for regressions

Continuous Improvement Workflow

Daily

Check Monitor for new conversations
Review any negative annotations
Note patterns for later

Weekly

Review aggregate metrics
Analyze annotation themes
Prioritize improvements needed
Run eval on current version

Monthly

Add new test cases from production
Compare version performance trends
Archive outdated eval sets
Document what’s been learned

Troubleshooting

Annotations not appearing

Check annotation feature is enabled (Enterprise Settings)
Verify user has permission to annotate
Refresh the Monitor view
Check correct prompt is selected

Evaluation failing

Verify eval set has test cases
Check prompt version exists
Ensure API keys are configured
Try running single test case first

Judge scores seem wrong

Review judge prompt (uses production version)
Check expected outputs are realistic
Verify input/output alignment in eval set
Consider adjusting pass threshold

AI improvement suggestions unhelpful

Provide more specific feedback
Include actual conversation examples
Describe what good looks like
Try with different conversation samples

Metrics not updating

Refresh the page
Check date range filters
Verify conversations are completed
Allow time for aggregation

Quick Reference

Keyboard Shortcuts

Action	Shortcut
Refresh Monitor	`Ctrl/Cmd + R`
Search	`Ctrl/Cmd + F`
Save annotation	`Enter` (in modal)

Key Metrics Glossary

Metric	Formula	Good Target
Pass Rate	(Scores ≥ 3) / Total	>80%
Avg Score	Sum(Scores) / Count	>4.0
Satisfaction	Positive / (Positive + Negative)	>85%

Eval Set Size Guidelines

Purpose	Recommended Size
Quick check	10-15 cases
Standard eval	25-50 cases
Comprehensive	100+ cases

Monitoring, Evaluation, and Continuous Improvement

Table of Contents

The Improvement Lifecycle

The Six Steps

Collecting Quality Feedback

Types of Feedback

1. Message-Level Annotations

2. Conversation-Level Annotations

Encouraging Quality Feedback

For Internal Teams

Annotation Guidelines

Sentiment Best Practices

Monitoring Performance

Accessing the Monitor

Monitor Interface Overview

History Tab

All Annotations Tab

Key Metrics to Track

Conversation Metrics

Quality Metrics

Usage Metrics

Filtering and Search

Viewing Conversation Details

Creating Evaluation Sets

What’s in an Eval Set?

Method 1: Create from Annotations

Method 2: Upload CSV

Method 3: Manual Entry

Organizing Eval Sets

Running Evaluations

The Evaluation Process

Running an Evaluation

Understanding Results

Comparing Versions

Using AI Judge

What AI Judge Does

Judge Evaluation Criteria

Scoring Scale

Running Judge Evaluation

Understanding Judge Output

Aggregate Metrics

AI-Assisted Improvement

Accessing Improve with AI

Improvement Sources

1. Example Conversations

2. User Feedback

3. Annotations

How Improvement Works

Prompt-Type Specific Improvements

Simple Prompts

Sequential Prompts

Supervisor Prompts

Applying Improvements

Improvement-to-Evaluation Loop

Best Practices

Annotation Best Practices

Collecting Effective Feedback

Building Annotation Culture

Monitoring Best Practices

What to Monitor Daily

What to Review Weekly

Setting Up Alerts

Evaluation Best Practices

Building Quality Eval Sets

Running Effective Evaluations

Interpreting Results

Improvement Best Practices

When to Use AI Improvement

What to Include in Feedback

Validating Improvements

Continuous Improvement Workflow

Daily

Weekly

Monthly

Troubleshooting

Annotations not appearing

Evaluation failing

Judge scores seem wrong

AI improvement suggestions unhelpful

Metrics not updating