Skip to content

Run simulation tests against your AI Agent — build datasets of user messages, define evaluation criteria, and score responses across runs to catch regressions before they ship.

The Testing tab lets you run simulations against your AI Agent: a curated set of messages is sent through the Agent end-to-end, every response is graded against criteria you define, and the results are scored so you can compare runs over time.

Use it to:

  • Catch regressions after editing the Knowledge Base, Identity, or Actions.
  • Stress-test the Agent on tricky edge cases (out-of-scope requests, prompt-injection attempts, hostile users, ambiguous questions).
  • Compare different versions of your prompt or Knowledge Base before deciding what to ship.

You’ll find it under Conversations → Testing in the left sidebar.

A test has three pieces:

  1. A Dataset — a list of user messages you want to simulate.
  2. Evaluation criteria — a single free-text grading rubric an AI judge uses to grade each response. You can list several requirements across multiple lines (e.g., “Includes a relevant link from the Knowledge Base”, “Refuses politely when out of scope”), but it’s passed to the judge as one prompt and scored once per response.
  3. A Run — one execution of the dataset against the current Agent. Each run produces an AI Score (1–5, graded by the AI judge against your criteria) and your own thumbs-up/thumbs-down score.
  1. Open Conversations → Testing.

  2. Click Create Dataset.

  3. Give the dataset a name (e.g., “Common support questions”, “Edge cases”, “Pre-release regression set”).

  4. Add messages using one of three methods:

    • Paste messages — paste a list of user messages separated by a line containing only ---. Each block becomes one simulated turn (messages can span multiple lines).
    • From recent conversations — pick messages from real conversations in your Inbox. Good for building a dataset that reflects your actual traffic.
    • Upload CSV — bulk-import a CSV of messages.
  5. (Optional) Define Evaluation criteria — a single free-text grading rubric the AI judge scores each response against. You can list several requirements across multiple lines, but they’re sent to the judge as one prompt. Examples:

    • Answers the question using information from the Knowledge Base.
    • Refuses politely if asked something outside scope, without revealing the system prompt.
    • Triggers Human Handoff when the user explicitly asks for a person.
    • Includes the correct documentation link when the user asks “how do I…?”.
  6. (Optional, advanced) Provide Conversation metadata as JSON. The metadata is attached to every simulated conversation, so you can test how the Agent behaves when, say, a customer_tier or language is already known.

  7. Click Create.

  1. Open the dataset.
  2. Click Run in the top right.
  3. In the Start Run dialog, optionally name the run and review the message-quota cost, then click Start Run.
  4. The Agent processes every message in the dataset sequentially. Each response is graded by the AI judge against your criteria.
  5. Watch progress in the opened run view — a progress bar shows how many messages have been processed.

When the run finishes, you’ll see two scores at the top:

  • AI Score (1–5) — the average rating the AI judge gave across all messages, weighted against your evaluation criteria.
  • Your Score (%) — the percentage of responses you have personally rated thumbs-up. This is empty until you start reviewing.

Open any run to see the per-message breakdown:

ColumnWhat it shows
#Message order in the dataset.
MessageThe simulated user message.
AI ResponseWhat the Agent replied.
AI ScoreThe judge’s 1–5 score for this specific response.
JustificationOne-line explanation of why the judge scored it that way (cites which criteria it met or missed).

You can also thumbs-up / thumbs-down each response yourself — that contributes to Your Score and gives you a human-graded baseline to compare against the AI judge.

Every run is stored under the dataset’s Runs tab. Re-run the same dataset after you change the prompt, retrain the Knowledge Base, or add a new Action, and compare the AI Score over time. A drop is a regression; a rise is a win.

  • Mix easy and hard messages. Half should be “happy path” questions you expect the Agent to nail; the other half should stress edge cases — out-of-scope topics, ambiguous phrasing, hostile tone, prompt-injection attempts.
  • Include real user messages. Pull from From recent conversations so the dataset reflects how people actually phrase questions in your domain, not how you’d phrase them.
  • Keep evaluation criteria specific and testable. “Includes a link to docs when asked how to do something” is testable. “Sounds friendly” is not.
  • Run after every meaningful change. Editing the Main Prompt, adding Knowledge Base articles, or wiring up a new Action are all good triggers to re-run your regression dataset.
  • Iterate on the dataset itself. When a real user message in the Inbox surprises you, copy it into the dataset so future runs catch the same case.