April 22, 202612 min readEvaluation Guide

7 Ways to Evaluate Your Copilot Studio Agents With Practical Examples

Copilot Studio offers seven distinct evaluation methods. Each one tests something different. Here's what they do, when to use them, and how to combine them into a strategy that actually catches problems.

Why evaluation matters more than you think

Your Copilot Studio agent doesn't give the same answer twice. Ask it the same question ten times and you'll get ten slightly different responses. That's not a bug. It's the nature of language models. But it means traditional pass/fail testing doesn't work anymore.

Instead, you need a layered evaluation strategy. Different methods test different qualities: Is the answer correct? Did it use the right source? Does it contain the right keywords? Is it compliant with company policy?

Copilot Studio gives you seven evaluation methods. Let's break down each one with real examples so you know exactly which to use and when.

Method 1General Quality

The "is this answer actually good?" test

Scoring: 0-100% | No expected answer needed

General Quality is the default evaluation method and the one every test set starts with. It uses a language model to judge your agent's response across four criteria: relevance (does it answer the question?), groundedness (is it based on actual source data?), completeness (does it cover everything?), and abstention (did it even try to answer?).

The beauty of General Quality is that you don't need to write expected answers. Just provide the test questions and let the evaluator judge the response. This makes it perfect for initial quality checks when you don't yet know what the "right" answer looks like.

Example: IT helpdesk agent

QuestionHow do I reset my VPN password?
ResponseTo reset your VPN password, go to myaccess.company.com, click 'Forgot Password,' and follow the steps. You'll need your employee ID and registered email. The new password takes 15 minutes to sync.
Score92% - Pass
QuestionHow do I request a standing desk?
ResponseStanding desks are a great option for ergonomic health!
Score18% - Fail (relevant but not complete, no actionable info)
Use for: Quick pulse checksUse for: Early developmentNot for: Precise answer validation
Method 2Compare Meaning

The "did it get the idea right?" test

Scoring: 0-100% | Pass threshold configurable (default 50%) | Expected answer required

Compare Meaning doesn't care about wording. It compares the intent and meaning behind the agent's response to your expected answer. If your agent says "submit a ticket via ServiceNow" and your expected answer says "create a request in the IT portal," that's still a match, because the meaning is the same.

This is ideal for knowledge-based agents where the same answer can be phrased in dozens of correct ways. Set your pass threshold based on how strict you need to be. For critical HR or compliance agents, try 70-80%. For general Q&A, 50% might be enough.

Example: HR benefits agent

QuestionHow many vacation days do I get?
ExpectedFull-time employees receive 25 vacation days per year, accrued monthly.
ResponseYou're entitled to 25 days of annual leave. These are added to your balance each month.
Score94% - Pass (same meaning, different words)
ResponsePlease contact HR for information about your leave balance.
Score12% - Fail (deflected instead of answering)
Use for: Knowledge-based Q&AUse for: Flexible answer formatsNot for: Exact data validation
Method 3Tool Use

The "did it use the right tools?" test

Scoring: Pass/Fail | Expected tools required

Tool Use doesn't evaluate the answer itself. It checks whether the agent triggered the right topics, actions, or connectors to get there. Did your agent call the ServiceNow connector when someone asked about a ticket? Did it route to the escalation topic when it should have?

This is essential for agents with complex orchestration. The answer might look correct, but if the agent got there by accident (or used the wrong data source), that's a problem waiting to happen.

Example: order tracking agent

QuestionWhere is my order #45892?
Expected toolOrder Lookup API connector
Actual toolOrder Lookup API connector
ResultPass - correct connector triggered
QuestionI want to return this product
Expected toolReturns & Refunds topic
Actual toolGeneral FAQ knowledge source
ResultFail - wrong routing, used FAQ instead of returns flow
Use for: Routing validationUse for: Connector/API testingNot for: Answer quality
Method 4Keyword Match

The "did it mention the important stuff?" test

Scoring: Pass/Fail | Match mode: Any or All | Keywords required

Keyword Match is straightforward: does the response contain specific words or phrases you define? You can require all keywords to be present (strict mode) or just any of them (flexible mode).

This is your workhorse for compliance and safety checks. If your agent discusses medication, the disclaimer must appear. If it talks about pricing, the currency and effective date must be mentioned. Keyword Match catches those missing pieces that other methods might overlook.

Example: pharmaceutical support agent

QuestionWhat are the side effects of Product X?
Keywords"consult your doctor" + "healthcare professional" (Any)
ResponseCommon side effects include headaches and nausea. Please consult your doctor before making any changes to your medication.
ResultPass - contains "consult your doctor"
Keywords"disclaimer" + "not medical advice" (All)
ResponseSide effects may include dizziness. This is not medical advice.
ResultFail - missing "disclaimer" keyword
Use for: Compliance checksUse for: Required disclaimersNot for: Open-ended answers
Method 5Text Similarity

The "how close is the wording?" test

Scoring: 0-1 (cosine similarity) | Pass threshold configurable | Expected answer required

Text Similarity sits between Compare Meaning and Exact Match. It uses cosine similarity to compare the wording and structure of the response against your expected answer. Same meaning in completely different words scores lower here than with Compare Meaning.

Use this when you want the agent to not just get the idea right, but to phrase it in a reasonably similar way. For instance, when your agent needs to echo specific policy language or standard operating procedures.

Example: policy information agent

QuestionWhat is our data retention policy?
ExpectedCustomer data is retained for 3 years after the last interaction, then permanently deleted per GDPR requirements.
ResponseWe keep customer data for 3 years following the last interaction. After that period, data is permanently removed in compliance with GDPR.
Score0.87 - Pass (similar wording and structure)
ResponseWe follow European privacy regulations for data handling.
Score0.31 - Fail (vague, missing specifics)
Use for: Policy languageUse for: Standard responsesNot for: Creative or varied answers
Method 6Exact Match

The "character for character" test

Scoring: Pass/Fail | Expected answer required

Exact Match is the strictest evaluation method. The response must match your expected answer character for character, word for word. One extra space? Fail. Different capitalization? Fail.

This sounds extreme, and it is. But it's exactly what you need for deterministic outputs like order numbers, reference codes, URLs, or short factual lookups where there is only one correct answer.

Example: internal tools agent

QuestionWhat is the URL for the expense portal?
Expectedhttps://expenses.company.com
Responsehttps://expenses.company.com
ResultPass - exact match
ResponseYou can find the expense portal at expenses.company.com
ResultFail - extra text around the URL
Use for: URLs, codes, IDsUse for: Fixed-format outputsNot for: Natural language answers
Method 7Custom

The "your rules, your criteria" test

Scoring: Pass/Fail via custom labels | Evaluation instructions + labels required

Custom is the most powerful evaluation method. You write the evaluation criteria, define the labels, and the evaluator uses AI to classify each response accordingly. This is where you encode your organization's specific standards that no built-in method can capture.

Think HR compliance, brand tone of voice, safety protocols, legal disclaimers. Anything specific to how your organization expects agents to behave. You write the rules in plain language, define labels like "Compliant" vs "Non-Compliant," and let the evaluator do the rest.

Example: HR compliance evaluation

InstructionsEvaluate whether the response protects employee privacy, avoids legal claims, and provides neutral HR-aligned guidance without discrimination or bias.
LabelsCompliant (pass) | Non-Compliant (fail)
QuestionCan I be fired for taking parental leave?
ResponseNo, you are legally protected. Parental leave is a right under company policy and applicable labor law. If you have concerns, please reach out to your HR representative confidentially.
LabelCompliant - protects privacy, neutral, no legal claims
ResponseLegally, they can't fire you, but honestly, some managers don't look favorably on extended leave. I'd document everything just in case.
LabelNon-Compliant - speculative, undermines trust, not HR-neutral
Use for: Compliance & policyUse for: Brand tone checksUse for: Safety validation

Quick reference

Here's every method at a glance so you can quickly pick the right one for your scenario:

MethodMeasuresScoringNeeds expected answer?
General QualityOverall answer quality0-100%No
Compare MeaningIntent & meaning alignment0-100%Yes
Tool UseCorrect tools/topics triggeredPass/FailYes (tools)
Keyword MatchRequired words presentPass/FailYes (keywords)
Text SimilarityWording closeness0-1Yes
Exact MatchCharacter-perfect matchPass/FailYes
CustomYour own criteriaPass/FailNo (uses instructions)

Combining methods: real-world strategies

No single method tells the full story. The power is in combining them. Here are four common agent types and the evaluation stack we recommend:

IT Helpdesk Agent

Needs to answer accurately, route to the right systems, and include links to self-service portals.

General QualityCompare Meaning @ 70%Tool UseKeyword Match

HR Policy Agent

Must be compliant, neutral, and echo policy language accurately. No room for creative interpretation.

Text Similarity @ 0.7Custom (Compliance)Keyword Match (All)

Customer-Facing FAQ

Flexible phrasing is fine, but answers must be grounded and complete. Tone matters.

General QualityCompare Meaning @ 50%Custom (Tone)

Order Tracking Agent

Must call the right APIs, return precise data, and never hallucinate tracking numbers.

Tool UseExact MatchKeyword Match

Common mistakes to avoid

Using only General Quality. It's the default, and many teams never add anything else. General Quality is a great starting point, but it won't catch routing errors, missing disclaimers, or compliance violations. Layer it with at least one other method.

Setting pass thresholds too low. A 50% Compare Meaning score means half the meaning is off. For anything customer-facing or compliance-related, start at 70% and adjust from there.

Running evaluations only before launch. Your agent changes every time a knowledge source updates, a topic is edited, or a model version changes. Evaluations should run continuously, not once.

Forgetting Tool Use. An agent can give a perfect-sounding answer from the wrong source. Tool Use is the only method that catches this, and most teams skip it entirely.

Pro tip

Run each test set multiple times and average the results. Because language models are non-deterministic, a single run might pass or fail based on the phrasing of that particular response. Microsoft recommends aiming for an 80-90% pass rate across multiple runs, not 100% on a single run.

From evaluation to monitoring

Evaluation methods tell you how your agent performs at a point in time. But agents don't stand still. Knowledge sources change, topics get edited, user behavior evolves, and model updates roll out.

The real challenge isn't running evaluations. It's running them continuously, automatically, and at scale. That's where one-time testing becomes ongoing monitoring, and where a purpose-built platform starts to make a lot more sense than manual test runs.

Make evaluation effortless

Agentowr runs your evaluations automatically, tracks results over time, and alerts you when quality drops. No manual test runs. No Dataverse setup.