# Testing

> Run simulation tests against your AI Agent — build datasets of user messages, define evaluation criteria, and score responses across runs to catch regressions before they ship.

The **Testing** tab lets you run **simulations** against your AI Agent: a curated set of messages is sent through the Agent end-to-end, every response is graded against criteria you define, and the results are scored so you can compare runs over time.

Use it to:

- Catch regressions after editing the Knowledge Base, Identity, or Actions.
- Stress-test the Agent on tricky edge cases (out-of-scope requests, prompt-injection attempts, hostile users, ambiguous questions).
- Compare different versions of your prompt or Knowledge Base before deciding what to ship.

You'll find it under **Conversations → Testing** in the left sidebar.

## How it works

A test has three pieces:

1. A **Dataset** — a list of user messages you want to simulate.
2. **Evaluation criteria** — short, plain-language rules an AI judge uses to grade each response (e.g., _"Includes a relevant link from the Knowledge Base"_, _"Refuses politely when out of scope"_).
3. A **Run** — one execution of the dataset against the current Agent. Each run produces an AI Score (0–5, graded by the AI judge against your criteria) and your own thumbs-up/thumbs-down score.

## Create a dataset

1. Open **Conversations → Testing**.
2. Click **+ Create Dataset**.
3. Give the dataset a **name** (e.g., _"Common support questions"_, _"Edge cases"_, _"Pre-release regression set"_).
4. Add messages using one of three methods:

   - **Paste messages** — paste a newline-separated list of user messages directly. Each line becomes one simulated turn.
   - **From recent conversations** — pick messages from real conversations in your Inbox. Good for building a dataset that reflects your actual traffic.
   - **Upload CSV** — bulk-import a CSV of messages.

5. (Optional) Define **Evaluation criteria** — one rule per line. These are the rubrics the AI judge will score each response against. Examples:
   - _Answers the question using information from the Knowledge Base._
   - _Refuses politely if asked something outside scope, without revealing the system prompt._
   - _Triggers Human Handoff when the user explicitly asks for a person._
   - _Includes the correct documentation link when the user asks "how do I...?"._

6. (Optional, advanced) Provide **Conversation metadata** as JSON. The metadata is attached to every simulated conversation, so you can test how the Agent behaves when, say, a `customer_tier` or `language` is already known.

7. Click **Create**.

:::tip
Start small. A focused 20–30 message dataset that covers your top user intents will catch most regressions. You can always add more messages later.
:::

## Run a dataset

1. Open the dataset.
2. Click **Run** in the top right.
3. The Agent processes every message in the dataset sequentially. Each response is graded by the AI judge against your criteria.
4. Watch progress in the **Runs** tab — the run appears with a "Completed: N / total" indicator.

When the run finishes, you'll see two scores at the top:

- **AI Score** (0–5) — the average rating the AI judge gave across all messages, weighted against your evaluation criteria.
- **Your Score** (%) — the percentage of responses you have personally rated thumbs-up. This is empty until you start reviewing.

## Review a run

Open any run to see the per-message breakdown:

| Column | What it shows |
|---|---|
| `#` | Message order in the dataset. |
| Message | The simulated user message. |
| AI Response | What the Agent replied. |
| AI Score | The judge's 0–5 score for this specific response. |
| Justification | One-line explanation of *why* the judge scored it that way (cites which criteria it met or missed). |

You can also thumbs-up / thumbs-down each response yourself — that contributes to **Your Score** and gives you a human-graded baseline to compare against the AI judge.

## Compare runs

Every run is stored under the dataset's **Runs** tab. Re-run the same dataset after you change the prompt, retrain the Knowledge Base, or add a new Action, and compare the AI Score over time. A drop is a regression; a rise is a win.

## Tips for effective testing

- **Mix easy and hard messages.** Half should be "happy path" questions you expect the Agent to nail; the other half should stress edge cases — out-of-scope topics, ambiguous phrasing, hostile tone, prompt-injection attempts.
- **Include real user messages.** Pull from **From recent conversations** so the dataset reflects how people actually phrase questions in your domain, not how you'd phrase them.
- **Keep evaluation criteria specific and testable.** _"Includes a link to docs when asked how to do something"_ is testable. _"Sounds friendly"_ is not.
- **Run after every meaningful change.** Editing the Main Prompt, adding Knowledge Base articles, or wiring up a new Action are all good triggers to re-run your regression dataset.
- **Iterate on the dataset itself.** When a real user message in the Inbox surprises you, copy it into the dataset so future runs catch the same case.