Testing
Zuletzt bearbeitet:
Dieser Inhalt ist noch nicht in deiner Sprache verfügbar.
The Testing tab lets you run simulations against your AI Agent: a curated set of messages is sent through the Agent end-to-end, every response is graded against criteria you define, and the results are scored so you can compare runs over time.
Use it to:
- Catch regressions after editing the Knowledge Base, Identity, or Actions.
- Stress-test the Agent on tricky edge cases (out-of-scope requests, prompt-injection attempts, hostile users, ambiguous questions).
- Compare different versions of your prompt or Knowledge Base before deciding what to ship.
You’ll find it under Conversations → Testing in the left sidebar.
How it works
Section titled “How it works”A test has three pieces:
- A Dataset — a list of user messages you want to simulate.
- Evaluation criteria — short, plain-language rules an AI judge uses to grade each response (e.g., “Includes a relevant link from the Knowledge Base”, “Refuses politely when out of scope”).
- A Run — one execution of the dataset against the current Agent. Each run produces an AI Score (0–5, graded by the AI judge against your criteria) and your own thumbs-up/thumbs-down score.
Create a dataset
Section titled “Create a dataset”-
Open Conversations → Testing.
-
Click + Create Dataset.
-
Give the dataset a name (e.g., “Common support questions”, “Edge cases”, “Pre-release regression set”).
-
Add messages using one of three methods:
- Paste messages — paste a newline-separated list of user messages directly. Each line becomes one simulated turn.
- From recent conversations — pick messages from real conversations in your Inbox. Good for building a dataset that reflects your actual traffic.
- Upload CSV — bulk-import a CSV of messages.
-
(Optional) Define Evaluation criteria — one rule per line. These are the rubrics the AI judge will score each response against. Examples:
- Answers the question using information from the Knowledge Base.
- Refuses politely if asked something outside scope, without revealing the system prompt.
- Triggers Human Handoff when the user explicitly asks for a person.
- Includes the correct documentation link when the user asks “how do I…?”.
-
(Optional, advanced) Provide Conversation metadata as JSON. The metadata is attached to every simulated conversation, so you can test how the Agent behaves when, say, a
customer_tierorlanguageis already known. -
Click Create.
Run a dataset
Section titled “Run a dataset”- Open the dataset.
- Click Run in the top right.
- The Agent processes every message in the dataset sequentially. Each response is graded by the AI judge against your criteria.
- Watch progress in the Runs tab — the run appears with a “Completed: N / total” indicator.
When the run finishes, you’ll see two scores at the top:
- AI Score (0–5) — the average rating the AI judge gave across all messages, weighted against your evaluation criteria.
- Your Score (%) — the percentage of responses you have personally rated thumbs-up. This is empty until you start reviewing.
Review a run
Section titled “Review a run”Open any run to see the per-message breakdown:
| Column | What it shows |
|---|---|
# | Message order in the dataset. |
| Message | The simulated user message. |
| AI Response | What the Agent replied. |
| AI Score | The judge’s 0–5 score for this specific response. |
| Justification | One-line explanation of why the judge scored it that way (cites which criteria it met or missed). |
You can also thumbs-up / thumbs-down each response yourself — that contributes to Your Score and gives you a human-graded baseline to compare against the AI judge.
Compare runs
Section titled “Compare runs”Every run is stored under the dataset’s Runs tab. Re-run the same dataset after you change the prompt, retrain the Knowledge Base, or add a new Action, and compare the AI Score over time. A drop is a regression; a rise is a win.
Tips for effective testing
Section titled “Tips for effective testing”- Mix easy and hard messages. Half should be “happy path” questions you expect the Agent to nail; the other half should stress edge cases — out-of-scope topics, ambiguous phrasing, hostile tone, prompt-injection attempts.
- Include real user messages. Pull from From recent conversations so the dataset reflects how people actually phrase questions in your domain, not how you’d phrase them.
- Keep evaluation criteria specific and testable. “Includes a link to docs when asked how to do something” is testable. “Sounds friendly” is not.
- Run after every meaningful change. Editing the Main Prompt, adding Knowledge Base articles, or wiring up a new Action are all good triggers to re-run your regression dataset.
- Iterate on the dataset itself. When a real user message in the Inbox surprises you, copy it into the dataset so future runs catch the same case.