Saltearse al contenido

Testing

Última actualización:

Esta página aún no está disponible en tu idioma.

The Testing tab lets you run simulations against your AI Agent: a curated set of messages is sent through the Agent end-to-end, every response is graded against criteria you define, and the results are scored so you can compare runs over time.

Use it to:

  • Catch regressions after editing the Knowledge Base, Identity, or Actions.
  • Stress-test the Agent on tricky edge cases (out-of-scope requests, prompt-injection attempts, hostile users, ambiguous questions).
  • Compare different versions of your prompt or Knowledge Base before deciding what to ship.

You’ll find it under Conversations → Testing in the left sidebar.

A test has three pieces:

  1. A Dataset — a list of user messages you want to simulate.
  2. Evaluation criteria — short, plain-language rules an AI judge uses to grade each response (e.g., “Includes a relevant link from the Knowledge Base”, “Refuses politely when out of scope”).
  3. A Run — one execution of the dataset against the current Agent. Each run produces an AI Score (0–5, graded by the AI judge against your criteria) and your own thumbs-up/thumbs-down score.
  1. Open Conversations → Testing.

  2. Click + Create Dataset.

  3. Give the dataset a name (e.g., “Common support questions”, “Edge cases”, “Pre-release regression set”).

  4. Add messages using one of three methods:

    • Paste messages — paste a newline-separated list of user messages directly. Each line becomes one simulated turn.
    • From recent conversations — pick messages from real conversations in your Inbox. Good for building a dataset that reflects your actual traffic.
    • Upload CSV — bulk-import a CSV of messages.
  5. (Optional) Define Evaluation criteria — one rule per line. These are the rubrics the AI judge will score each response against. Examples:

    • Answers the question using information from the Knowledge Base.
    • Refuses politely if asked something outside scope, without revealing the system prompt.
    • Triggers Human Handoff when the user explicitly asks for a person.
    • Includes the correct documentation link when the user asks “how do I…?”.
  6. (Optional, advanced) Provide Conversation metadata as JSON. The metadata is attached to every simulated conversation, so you can test how the Agent behaves when, say, a customer_tier or language is already known.

  7. Click Create.

  1. Open the dataset.
  2. Click Run in the top right.
  3. The Agent processes every message in the dataset sequentially. Each response is graded by the AI judge against your criteria.
  4. Watch progress in the Runs tab — the run appears with a “Completed: N / total” indicator.

When the run finishes, you’ll see two scores at the top:

  • AI Score (0–5) — the average rating the AI judge gave across all messages, weighted against your evaluation criteria.
  • Your Score (%) — the percentage of responses you have personally rated thumbs-up. This is empty until you start reviewing.

Open any run to see the per-message breakdown:

ColumnWhat it shows
#Message order in the dataset.
MessageThe simulated user message.
AI ResponseWhat the Agent replied.
AI ScoreThe judge’s 0–5 score for this specific response.
JustificationOne-line explanation of why the judge scored it that way (cites which criteria it met or missed).

You can also thumbs-up / thumbs-down each response yourself — that contributes to Your Score and gives you a human-graded baseline to compare against the AI judge.

Every run is stored under the dataset’s Runs tab. Re-run the same dataset after you change the prompt, retrain the Knowledge Base, or add a new Action, and compare the AI Score over time. A drop is a regression; a rise is a win.

  • Mix easy and hard messages. Half should be “happy path” questions you expect the Agent to nail; the other half should stress edge cases — out-of-scope topics, ambiguous phrasing, hostile tone, prompt-injection attempts.
  • Include real user messages. Pull from From recent conversations so the dataset reflects how people actually phrase questions in your domain, not how you’d phrase them.
  • Keep evaluation criteria specific and testable. “Includes a link to docs when asked how to do something” is testable. “Sounds friendly” is not.
  • Run after every meaningful change. Editing the Main Prompt, adding Knowledge Base articles, or wiring up a new Action are all good triggers to re-run your regression dataset.
  • Iterate on the dataset itself. When a real user message in the Inbox surprises you, copy it into the dataset so future runs catch the same case.