Why evaluation matters more than you think
Your Copilot Studio agent doesn't give the same answer twice. Ask it the same question ten times and you'll get ten slightly different responses. That's not a bug. It's the nature of language models. But it means traditional pass/fail testing doesn't work anymore.
Instead, you need a layered evaluation strategy. Different methods test different qualities: Is the answer correct? Did it use the right source? Does it contain the right keywords? Is it compliant with company policy?
Copilot Studio gives you seven evaluation methods. Let's break down each one with real examples so you know exactly which to use and when.
The "is this answer actually good?" test
Scoring: 0-100% | No expected answer neededGeneral Quality is the default evaluation method and the one every test set starts with. It uses a language model to judge your agent's response across four criteria: relevance (does it answer the question?), groundedness (is it based on actual source data?), completeness (does it cover everything?), and abstention (did it even try to answer?).
The beauty of General Quality is that you don't need to write expected answers. Just provide the test questions and let the evaluator judge the response. This makes it perfect for initial quality checks when you don't yet know what the "right" answer looks like.
Example: IT helpdesk agent
The "did it get the idea right?" test
Scoring: 0-100% | Pass threshold configurable (default 50%) | Expected answer requiredCompare Meaning doesn't care about wording. It compares the intent and meaning behind the agent's response to your expected answer. If your agent says "submit a ticket via ServiceNow" and your expected answer says "create a request in the IT portal," that's still a match, because the meaning is the same.
This is ideal for knowledge-based agents where the same answer can be phrased in dozens of correct ways. Set your pass threshold based on how strict you need to be. For critical HR or compliance agents, try 70-80%. For general Q&A, 50% might be enough.
Example: HR benefits agent
The "did it use the right tools?" test
Scoring: Pass/Fail | Expected tools requiredTool Use doesn't evaluate the answer itself. It checks whether the agent triggered the right topics, actions, or connectors to get there. Did your agent call the ServiceNow connector when someone asked about a ticket? Did it route to the escalation topic when it should have?
This is essential for agents with complex orchestration. The answer might look correct, but if the agent got there by accident (or used the wrong data source), that's a problem waiting to happen.
Example: order tracking agent
The "did it mention the important stuff?" test
Scoring: Pass/Fail | Match mode: Any or All | Keywords requiredKeyword Match is straightforward: does the response contain specific words or phrases you define? You can require all keywords to be present (strict mode) or just any of them (flexible mode).
This is your workhorse for compliance and safety checks. If your agent discusses medication, the disclaimer must appear. If it talks about pricing, the currency and effective date must be mentioned. Keyword Match catches those missing pieces that other methods might overlook.
Example: pharmaceutical support agent
The "how close is the wording?" test
Scoring: 0-1 (cosine similarity) | Pass threshold configurable | Expected answer requiredText Similarity sits between Compare Meaning and Exact Match. It uses cosine similarity to compare the wording and structure of the response against your expected answer. Same meaning in completely different words scores lower here than with Compare Meaning.
Use this when you want the agent to not just get the idea right, but to phrase it in a reasonably similar way. For instance, when your agent needs to echo specific policy language or standard operating procedures.
Example: policy information agent
The "character for character" test
Scoring: Pass/Fail | Expected answer requiredExact Match is the strictest evaluation method. The response must match your expected answer character for character, word for word. One extra space? Fail. Different capitalization? Fail.
This sounds extreme, and it is. But it's exactly what you need for deterministic outputs like order numbers, reference codes, URLs, or short factual lookups where there is only one correct answer.
Example: internal tools agent
The "your rules, your criteria" test
Scoring: Pass/Fail via custom labels | Evaluation instructions + labels requiredCustom is the most powerful evaluation method. You write the evaluation criteria, define the labels, and the evaluator uses AI to classify each response accordingly. This is where you encode your organization's specific standards that no built-in method can capture.
Think HR compliance, brand tone of voice, safety protocols, legal disclaimers. Anything specific to how your organization expects agents to behave. You write the rules in plain language, define labels like "Compliant" vs "Non-Compliant," and let the evaluator do the rest.
Example: HR compliance evaluation
Quick reference
Here's every method at a glance so you can quickly pick the right one for your scenario:
| Method | Measures | Scoring | Needs expected answer? |
|---|---|---|---|
| General Quality | Overall answer quality | 0-100% | No |
| Compare Meaning | Intent & meaning alignment | 0-100% | Yes |
| Tool Use | Correct tools/topics triggered | Pass/Fail | Yes (tools) |
| Keyword Match | Required words present | Pass/Fail | Yes (keywords) |
| Text Similarity | Wording closeness | 0-1 | Yes |
| Exact Match | Character-perfect match | Pass/Fail | Yes |
| Custom | Your own criteria | Pass/Fail | No (uses instructions) |
Combining methods: real-world strategies
No single method tells the full story. The power is in combining them. Here are four common agent types and the evaluation stack we recommend:
IT Helpdesk Agent
Needs to answer accurately, route to the right systems, and include links to self-service portals.
HR Policy Agent
Must be compliant, neutral, and echo policy language accurately. No room for creative interpretation.
Customer-Facing FAQ
Flexible phrasing is fine, but answers must be grounded and complete. Tone matters.
Order Tracking Agent
Must call the right APIs, return precise data, and never hallucinate tracking numbers.
Common mistakes to avoid
Using only General Quality. It's the default, and many teams never add anything else. General Quality is a great starting point, but it won't catch routing errors, missing disclaimers, or compliance violations. Layer it with at least one other method.
Setting pass thresholds too low. A 50% Compare Meaning score means half the meaning is off. For anything customer-facing or compliance-related, start at 70% and adjust from there.
Running evaluations only before launch. Your agent changes every time a knowledge source updates, a topic is edited, or a model version changes. Evaluations should run continuously, not once.
Forgetting Tool Use. An agent can give a perfect-sounding answer from the wrong source. Tool Use is the only method that catches this, and most teams skip it entirely.
Pro tip
Run each test set multiple times and average the results. Because language models are non-deterministic, a single run might pass or fail based on the phrasing of that particular response. Microsoft recommends aiming for an 80-90% pass rate across multiple runs, not 100% on a single run.
From evaluation to monitoring
Evaluation methods tell you how your agent performs at a point in time. But agents don't stand still. Knowledge sources change, topics get edited, user behavior evolves, and model updates roll out.
The real challenge isn't running evaluations. It's running them continuously, automatically, and at scale. That's where one-time testing becomes ongoing monitoring, and where a purpose-built platform starts to make a lot more sense than manual test runs.