Skip to Content
Enterprise GuidesMonitoring & Evaluation

Monitoring, Evaluation, and Continuous Improvement

This guide explains how to monitor your agents’ performance in PromptOwl , collect and analyze feedback, run systematic evaluations, and use AI to continuously improve your prompts.


Table of Contents

  1. The Improvement Lifecycle
  2. Collecting Quality Feedback
  3. Monitoring Performance
  4. Creating Evaluation Sets
  5. Running Evaluations
  6. Using AI Judge
  7. AI-Assisted Improvement
  8. Best Practices

The Improvement Lifecycle

Continuous improvement follows a cyclical process:

┌─────────────────────────────────────────────────────────────┐ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ DEPLOY │────▶│ MONITOR │────▶│ COLLECT │ │ │ │ Agent │ │ Usage │ │ Feedback │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ ▲ │ │ │ │ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ PUBLISH │◀────│ IMPROVE │◀────│ EVALUATE │ │ │ │ Update │ │ Prompt │ │ Quality │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘

The Six Steps

StepActionPurpose
1. DeployPublish your agent to productionMake available to users
2. MonitorTrack usage, tokens, conversationsUnderstand real-world behavior
3. CollectGather annotations and feedbackIdentify improvement areas
4. EvaluateTest systematically with eval setsMeasure quality objectively
5. ImproveUse AI to refine promptsGenerate better versions
6. PublishRelease improvementsComplete the cycle

Collecting Quality Feedback

Types of Feedback

PromptOwl supports two levels of feedback:

1. Message-Level Annotations

Feedback on individual AI responses:

  • When to use: Rating specific answers
  • Components: Sentiment (thumbs up/down) + detailed text
  • Best for: Identifying specific issues
User: "What's your return policy?" AI: "Our return policy allows returns within 30 days..." [👍 Thumbs Up] Annotation: "Accurate and complete answer. Good tone."

2. Conversation-Level Annotations

Feedback on entire conversations:

  • When to use: Rating overall experience
  • Components: Sentiment + summary feedback
  • Best for: Holistic assessment
Overall Conversation Rating: [👎 Thumbs Down] Annotation: "Agent was helpful but took too many turns to resolve the issue."

Encouraging Quality Feedback

For Internal Teams

Train your team to provide actionable annotations:

Good annotation:

“The response was accurate but too technical for our customer audience. Should use simpler language and avoid jargon like ‘API endpoint’.”

Poor annotation:

“Bad response”

Annotation Guidelines

DoDon’t
Be specific about what’s wrongUse vague terms like “bad” or “wrong”
Suggest how to improveJust criticize without direction
Note what worked wellOnly focus on negatives
Include context if relevantAssume reader knows the situation

Sentiment Best Practices

Use sentiment consistently:

SentimentWhen to Use
👍 PositiveResponse was helpful, accurate, appropriate
👎 NegativeResponse was wrong, unhelpful, inappropriate
— NeutralResponse was acceptable but could be better

Monitoring Performance

Accessing the Monitor

  1. Open your prompt from the Dashboard
  2. Click Monitor in the top navigation
  3. View the conversation history and analytics

Monitor Interface Overview

The Monitor has two main tabs:

History Tab

Shows all conversations with your agent:

ColumnDescription
UserWho had the conversation
Start TimeWhen it began
DurationHow long it lasted
SourceWhere it came from
TopicConversation subject

All Annotations Tab

Aggregates all feedback:

ColumnDescription
UserWho provided feedback
QuestionWhat was asked
ResponseWhat the AI answered
AnnotationThe feedback text
SentimentThumbs up/down/neutral
DateWhen feedback was given

Key Metrics to Track

Conversation Metrics

  • Total Conversations: Overall usage volume
  • Average Duration: How long users engage
  • Messages per Conversation: Conversation depth

Quality Metrics

  • Satisfaction Score: Average sentiment (0-100)
  • Positive Rate: Percentage of thumbs up
  • Annotation Volume: How much feedback collected

Usage Metrics

  • Total Tokens Used: API consumption
  • Model Distribution: Which models are used
  • Peak Usage Times: When most active

Find specific conversations:

  1. Search: Type to filter by user, topic, or content
  2. Date Range: Focus on specific time periods
  3. Eval Filter: Show only test conversations

Viewing Conversation Details

Click any conversation to see:

  • Complete message history
  • User questions and AI responses
  • Block outputs (for sequential/supervisor)
  • Citations shown
  • Annotations provided

Creating Evaluation Sets

Evaluation sets (Eval Sets) are collections of test cases used to systematically measure prompt quality.

What’s in an Eval Set?

Each eval set contains:

ComponentDescriptionRequired
NameDescriptive identifierYes
InputTest question/queryYes
Expected OutputDesired response (annotation)Recommended

Method 1: Create from Annotations

Convert quality feedback into test cases:

  1. Go to MonitorAll Annotations tab
  2. Check boxes next to high-quality annotations
  3. Click Save to Eval Set
  4. Name your eval set
  5. Click Create

Best annotations for eval sets:

  • Clear positive examples (what good looks like)
  • Clear negative examples (what to avoid)
  • Edge cases users actually encountered
  • Common question patterns

Method 2: Upload CSV

Import test cases from a spreadsheet:

  1. Go to Eval tab on your prompt
  2. Click Upload CSV
  3. Select your file with columns:
    • question or input - The test query
    • result or annotation - Expected response
  4. Name the eval set
  5. Click Upload

CSV Format Example:

question,result "What is your return policy?","We accept returns within 30 days of purchase..." "How do I reset my password?","Click 'Forgot Password' on the login page..." "What are your business hours?","We're open Monday-Friday 9am-5pm EST..."

Method 3: Manual Entry

Add test cases one at a time:

  1. Go to Eval tab
  2. Click Add Test Case
  3. Enter the input question
  4. Enter the expected output
  5. Save

Organizing Eval Sets

Create multiple eval sets for different purposes:

Eval Set NamePurpose
”Core FAQs”Basic functionality testing
”Edge Cases”Unusual or difficult queries
”Regression Tests”Ensure fixes don’t break
”New Feature Tests”Validate new capabilities

Running Evaluations

The Evaluation Process

Eval Set (Test Cases) ┌───────────────────────────────────┐ │ Select Prompt Version │ │ (v1, v2, production, etc.) │ └─────────────────┬─────────────────┘ ┌───────────────────────────────────┐ │ Run Each Test Case │ │ Input → Prompt → Response │ └─────────────────┬─────────────────┘ ┌───────────────────────────────────┐ │ Store Results │ │ Save all responses for review │ └─────────────────┬─────────────────┘ ┌───────────────────────────────────┐ │ Compare & Analyze │ │ Review outputs vs expected │ └───────────────────────────────────┘

Running an Evaluation

  1. Go to your prompt’s Eval tab
  2. Select an Eval Set from the dropdown
  3. Select the Version to test
  4. Click Run Eval
  5. Wait for all test cases to complete
  6. Review results in the table

Understanding Results

After running, you’ll see:

ColumnDescription
InputThe test question
Expected OutputWhat you wanted
Eval ResponseWhat the AI actually said
VersionWhich version was tested

Comparing Versions

Test multiple versions to find the best:

  1. Run eval with Version 1
  2. Make prompt improvements → Create Version 2
  3. Run eval with Version 2
  4. Compare results in the Runs tab

The Runs tab shows historical comparisons:

Eval SetVersionDateCountAvg ScorePass Rate
Core FAQsv1Dec 28253.260%
Core FAQsv2Dec 29254.184%
Core FAQsv3Dec 30254.592%

![Screenshot: Runs Comparison]


Using AI Judge

AI Judge automatically evaluates response quality using another AI model.

What AI Judge Does

Instead of manually reviewing every response, AI Judge:

  1. Reads the input question
  2. Reads the expected output
  3. Reads the actual response
  4. Scores quality (1-5 scale)
  5. Explains its reasoning

Judge Evaluation Criteria

The AI Judge evaluates based on:

CriterionDescription
AccuracyDoes the response match expected output?
CompletenessDoes it cover all necessary points?
QualityIs it clear, helpful, and well-structured?

Scoring Scale

ScoreMeaningDescription
5ExcellentExceeds expectations
4GoodMeets expectations well
3AverageAcceptable, room for improvement
2Below AverageMissing key elements
1PoorSignificantly wrong or unhelpful

Running Judge Evaluation

  1. First, run a regular evaluation
  2. Click Run Judge
  3. Wait for judge to score all results
  4. Review scores and reasoning

![Screenshot: Run Judge Button]

Understanding Judge Output

After judging, you’ll see:

ColumnDescription
JudgeAI’s reasoning for the score
ScoreNumeric rating (1-5)

Example Judge Output:

Score: 4 Reasoning: The response accurately addresses the return policy question and includes the 30-day window. It could be improved by mentioning the refund method (original payment method). Overall, a good response that covers the essential information.

Aggregate Metrics

Judge evaluation calculates:

  • Average Score: Mean of all test case scores
  • Pass Rate: Percentage scoring 3 or above

These appear in the Runs tab for easy comparison across versions.


AI-Assisted Improvement

PromptOwl can use AI to suggest improvements based on real conversations and feedback.

Accessing Improve with AI

  1. Open your prompt in edit mode
  2. Click Improve with AI button
  3. The improvement dialog opens

![Screenshot: Improve Button]

Improvement Sources

The AI can analyze:

1. Example Conversations

Paste or auto-load real conversations:

User: "How do I cancel my subscription?" Assistant: "To cancel, go to Settings > Billing > Cancel Plan." User: "But I don't see a Billing option" Assistant: "The Billing option is under Account Settings..."

2. User Feedback

Your specific improvement requests:

Feedback: "The agent takes too many turns to answer simple questions. Should provide complete answers in the first response. Also needs to be more empathetic when users are frustrated."

3. Annotations

Auto-loaded feedback from production:

  • Message-level annotations
  • Conversation-level notes
  • Sentiment data

How Improvement Works

┌─────────────────────────────────────────────────────────────┐ │ IMPROVEMENT PROCESS │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Current │ │ Example │ │ User │ │ │ │ Prompt │ + │ Conversation│ + │ Feedback │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ └──────────────────┼──────────────────┘ │ │ ↓ │ │ ┌─────────────────────────┐ │ │ │ AI Analysis Engine │ │ │ │ (Prompt-type specific) │ │ │ └────────────┬────────────┘ │ │ ↓ │ │ ┌─────────────────────────┐ │ │ │ Improved Variations │ │ │ │ + Change Explanations │ │ │ └─────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘

Prompt-Type Specific Improvements

AI understands your prompt type and optimizes accordingly:

Simple Prompts

Focus areas:

  • Clarity of instructions
  • Response format consistency
  • Tone and personality
  • Edge case handling

Sequential Prompts

Focus areas:

  • Block efficiency and necessity
  • Data flow between blocks
  • Inter-step coordination
  • Output quality at each stage

Supervisor Prompts

Focus areas:

  • Task routing logic
  • Agent specialization clarity
  • Delegation decisions
  • Response synthesis

Applying Improvements

Improvements are applied non-destructively:

  1. Block Names: Updated if AI suggests better names
  2. Content: Saved as new variations (original preserved)
  3. Testing: A/B test original vs improved

![Screenshot: Improvement Results]

Improvement-to-Evaluation Loop

After applying improvements:

  1. New variations are created
  2. Run evaluation against the new version
  3. Compare scores with previous version
  4. If improved, publish; if not, iterate

Best Practices

Annotation Best Practices

Collecting Effective Feedback

PracticeWhy It Matters
Annotate immediatelyDetails are fresh
Be specificVague feedback isn’t actionable
Include examplesShow what you expected
Note patternsSame issue multiple times = priority

Building Annotation Culture

  • Train team on annotation guidelines
  • Celebrate improvements from feedback
  • Share “before/after” examples
  • Make annotation part of workflow

Monitoring Best Practices

What to Monitor Daily

  • Conversation volume trends
  • Negative sentiment spikes
  • Unusual error patterns
  • Token usage anomalies

What to Review Weekly

  • Aggregate satisfaction scores
  • Common failure patterns
  • Top annotation themes
  • Version performance comparison

Setting Up Alerts

Consider monitoring for:

  • Satisfaction score drops below threshold
  • Unusual volume changes
  • High negative annotation rates

Evaluation Best Practices

Building Quality Eval Sets

GuidelineImplementation
Cover common cases70% should be typical questions
Include edge cases20% should test boundaries
Add failure scenarios10% should test error handling
Update regularlyAdd new cases from production

Running Effective Evaluations

  • Baseline first: Eval current version before changes
  • One variable: Change one thing, then eval
  • Statistical significance: Use enough test cases (25+)
  • Judge consistently: Use same judge prompt

Interpreting Results

MetricTargetAction if Below
Pass Rate>80%Review failing cases
Avg Score>4.0Focus on low scorers
ConsistencyLow varianceInvestigate outliers

Improvement Best Practices

When to Use AI Improvement

  • After identifying patterns in annotations
  • When manual tweaking isn’t working
  • To get fresh perspective on prompt
  • When scaling improvement efforts

What to Include in Feedback

Helpful feedback:

"Users are asking about refunds but the agent only mentions returns. Need to distinguish between refund (money back) and return (exchange). Should ask clarifying question if unclear which they want."

Not helpful:

"Make it better"

Validating Improvements

  1. Don’t trust blindly: AI suggestions need human review
  2. Test systematically: Run eval before publishing
  3. Start with one change: Don’t apply all suggestions at once
  4. Monitor after publish: Watch for regressions

Continuous Improvement Workflow

Daily

  • Check Monitor for new conversations
  • Review any negative annotations
  • Note patterns for later

Weekly

  • Review aggregate metrics
  • Analyze annotation themes
  • Prioritize improvements needed
  • Run eval on current version

Monthly

  • Add new test cases from production
  • Compare version performance trends
  • Archive outdated eval sets
  • Document what’s been learned

Troubleshooting

Annotations not appearing

  1. Check annotation feature is enabled (Enterprise Settings)
  2. Verify user has permission to annotate
  3. Refresh the Monitor view
  4. Check correct prompt is selected

Evaluation failing

  1. Verify eval set has test cases
  2. Check prompt version exists
  3. Ensure API keys are configured
  4. Try running single test case first

Judge scores seem wrong

  1. Review judge prompt (uses production version)
  2. Check expected outputs are realistic
  3. Verify input/output alignment in eval set
  4. Consider adjusting pass threshold

AI improvement suggestions unhelpful

  1. Provide more specific feedback
  2. Include actual conversation examples
  3. Describe what good looks like
  4. Try with different conversation samples

Metrics not updating

  1. Refresh the page
  2. Check date range filters
  3. Verify conversations are completed
  4. Allow time for aggregation

Quick Reference

Keyboard Shortcuts

ActionShortcut
Refresh MonitorCtrl/Cmd + R
SearchCtrl/Cmd + F
Save annotationEnter (in modal)

Key Metrics Glossary

MetricFormulaGood Target
Pass Rate(Scores ≥ 3) / Total>80%
Avg ScoreSum(Scores) / Count>4.0
SatisfactionPositive / (Positive + Negative)>85%

Eval Set Size Guidelines

PurposeRecommended Size
Quick check10-15 cases
Standard eval25-50 cases
Comprehensive100+ cases

Last updated on