May 27, 20269 min readAgent Evaluation

Your Copilot Studio Instructions Are a Contract. Are You Testing Them?

Most teams write agent instructions once and never look at them again. Then they wonder why the agent drifts. Instructions aren't a creative writing exercise. They're a behavioral contract, and contracts need verification.

Instructions are a contract, not a prompt

There's a mental model problem with how most people approach agent instructions in Copilot Studio. They treat instructions like a prompt: write something reasonable, hit publish, hope for the best.

But instructions aren't a suggestion to the agent. They're a behavioral specification. Every line you write is a claim about how your agent will behave in production. "Never discuss topics outside of product support" is a claim. "Escalate billing disputes to a human" is a claim. "Respond in plain language" is a claim.

The question is: are you verifying those claims, or just assuming they hold?

The contract mindset

Software engineers write unit tests to verify that code does what they expect. Agent instructions deserve the same discipline. If you can't write a test case for an instruction, the instruction is too vague to be useful. "Be helpful" is untestable. "When a user asks about return policies, cite the returns page and include the 30-day window" is testable. That distinction is everything.

This isn't about writing perfect instructions on day one. It's about building a system where you can detect when instructions stop working. Knowledge sources get updated. User behavior shifts. New tools get added. The instructions you wrote three months ago may no longer match reality. Without a verification loop, you won't know until users start complaining.

Five failure patterns hiding in your transcripts

We've analyzed thousands of agent conversations across dozens of Copilot Studio deployments. The same five failure patterns show up again and again. None of them are caused by broken connectors or missing knowledge sources. All of them trace back to instructions that are either missing, vague, or untested.

📡

Scope creep

The agent answers things it was never meant to

From a real transcript

User

Can you help me compare your pricing to Acme Corp?

Agent

Sure! Compared to Acme Corp, our plans offer better value because we include priority support at no extra cost. Acme charges $29/month for a similar tier, while we start at $19/month...

What went wrong

The agent was built for product support, not competitive analysis. Nothing in the instructions explicitly excluded competitor discussions, so the agent improvised. It generated a comparison using hallucinated pricing data.

The testable instruction that prevents this

When asked about competitor products or pricing comparisons, the agent must decline and redirect to the sales team.

🔇

Silent fallback

The agent gives up when it actually has the answer

From a real transcript

User

How do I configure SSO for the mobile app?

Agent

I'm sorry, I don't have information about that. Please contact our support team for assistance.

What went wrong

The SSO configuration guide exists in the knowledge base. But the instructions said "only use the FAQ for general questions" without clarifying what counts as general. The orchestrator classified this as a technical question, skipped the FAQ, found nothing in the other sources, and gave up.

The testable instruction that prevents this

When a user asks about SSO configuration, the agent must search the technical documentation knowledge source and return the relevant setup steps.

🔀

Tool confusion

The agent calls the wrong action for the job

From a real transcript

User

I need to update my shipping address for order #4821.

Agent

I've created a new order with your updated address. Your new order number is #4837.

What went wrong

The agent had two tools: "Create Order" and "Update Order Details." The instructions mentioned both but never specified which one handles address changes. The orchestrator picked the wrong one. The customer now has a duplicate order.

The testable instruction that prevents this

When a user requests a change to an existing order, the agent must use the Update Order Details action. It must never create a new order to handle a modification request.

🎭

Tone drift

The agent changes personality mid-conversation

From a real transcript

User

This is the third time I've asked about this. Nobody is helping me.

Agent

I understand your concern. Let me look into that for you. Based on our records, your request was submitted on May 12th. Per our SLA guidelines, resolution times for P3 tickets are 5-7 business days. Your ticket is within the expected timeframe. Is there anything else I can help with?

What went wrong

The agent acknowledged frustration in one sentence, then immediately switched to cold, policy-heavy language. The instructions said "be empathetic" but never defined what empathy looks like in practice, especially when delivering news the user won't want to hear.

The testable instruction that prevents this

When a user expresses frustration, the agent must acknowledge the specific issue, apologize for the experience, and explain next steps in plain language without referencing internal SLA terminology.

🚧

Escalation failure

The agent tries to handle what it should hand off

From a real transcript

User

I think someone accessed my account without permission. There are purchases I didn't make.

Agent

I'm sorry to hear that. Let me help you secure your account. First, I'll reset your password. What email address is associated with your account?

What went wrong

A potential security breach should go straight to the security team. Instead, the agent tried to resolve it by resetting a password, which could actually make things worse by alerting the attacker or destroying forensic evidence. The instructions never defined escalation triggers for security incidents.

The testable instruction that prevents this

When a user reports unauthorized access, suspicious activity, or unrecognized transactions, the agent must immediately transfer to the security team without attempting any account modifications.

Writing instructions you can actually verify

Every failure pattern above has the same root cause: an instruction that was either missing or too vague to verify. The fix isn't to write longer instructions. It's to write instructions that double as test cases.

A testable instruction has three properties: a trigger (what situation activates it), an expected behavior (what the agent should do), and a boundary (what the agent must not do).

Vague

Help users with their orders.

Testable

When a user asks about an existing order, look up the order using the Order Lookup action. Provide the current status, estimated delivery date, and tracking link if available. If the order is not found, ask the user to verify the order number. Do not offer to create a new order.

Vague

Be careful with sensitive topics.

Testable

If a user mentions a data breach, privacy concern, or GDPR request, do not attempt to resolve it. Respond with: 'This requires our privacy team. I'm transferring you now.' Then transfer to the Privacy Escalation topic.

Vague

Use the knowledge base to answer questions.

Testable

For questions about product specifications, search the Product Documentation source first. If no result is found, search the Community FAQ. If neither source has an answer, tell the user you don't have that information and offer to connect them with a product specialist.

Notice the pattern. Each testable instruction tells you exactly what to check in a transcript. Did the agent use the right action? Did it follow the fallback chain? Did it avoid the prohibited behavior? If you can read a transcript and definitively say "this instruction was followed" or "this instruction was violated," you have a testable instruction.

The feedback loop most teams skip

Writing testable instructions is the first step. The second step is actually running those tests, not once, but continuously.

Most teams treat agent instructions as a "set and forget" artifact. They write them during development, maybe tweak them after the first round of user feedback, then move on to the next project. Meanwhile, the agent keeps running. Knowledge sources get refreshed with new documents. Seasonal content changes the mix of user questions. A tool gets updated with new required parameters.

Instruction drift is silent. Your agent doesn't throw an error when it stops following an instruction. It just starts behaving differently. The only way to catch it is to continuously evaluate agent behavior against your instructions.

The verification cycle

1

Define

Write instructions as testable assertions with clear triggers, behaviors, and boundaries.

2

Evaluate

Run test conversations against each instruction. Check real transcripts for violations.

3

Refine

When a test fails, update the instruction. When behavior drifts, investigate why.

This is the gap between teams that ship agents and teams that ship agents that stay reliable. The first group writes instructions. The second group treats instructions as living assertions and builds a feedback loop around them.

Agentowr automates this cycle. It ingests your agent's transcripts, evaluates them against the behavior you defined, and flags the moment something stops matching. You don't need to manually read transcripts or wait for user complaints. You get a signal the moment an instruction is violated.

Start auditing, stop guessing

Go look at your agent's instructions right now. For each line, ask yourself: "Can I write a test for this?" If the answer is no, rewrite it until you can. Then ask the harder question: "Am I actually running that test?"

The agents that work in production aren't the ones with the most instructions or the cleverest prompts. They're the ones where someone is watching, measuring, and continuously verifying that the agent does what it was told to do.

Stop reading transcripts manually

Agentowr evaluates every conversation against the behavior you defined. Catch instruction violations, monitor drift, and get alerts before your users notice.