> ## Documentation Index
> Fetch the complete documentation index at: https://docs.prisme.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluations

> Measure and improve agent quality with test cases

Evaluations help you systematically test your agent. Create test cases with expected outcomes, run them against your agent, and track quality over time.

## Why Evaluate?

Manual testing in the Playground is useful for exploration, but evaluations provide:

* **Consistency** - Same tests run the same way every time
* **Regression detection** - Know immediately if changes break something
* **Quality metrics** - Track scores over time
* **Documentation** - Test cases describe expected behavior

## The Evaluate Page

Open any agent and go to the **Evaluate** section. You'll see:

* **Test Cases** - Your collection of input/expected output pairs
* **Run History** - Past evaluation runs and their results
* **Current Run** - Real-time progress when an evaluation is running

## Creating Test Cases

A test case defines:

| Field               | Description                         |
| ------------------- | ----------------------------------- |
| **Input**           | The user message to send            |
| **Expected Output** | What the agent should say or do     |
| **Criteria**        | Specific things to check (optional) |
| **Tags**            | Categories for organizing tests     |

### From Scratch

1. Click **Create Test Case**
2. Enter the user input
3. Describe the expected output
4. Add any specific criteria
5. Save

### From Playground

When you have a good conversation in the Playground:

1. Note the user message and agent response
2. Go to Evaluate and create a test case
3. Use the conversation as your baseline

### Import Bulk

For many test cases:

1. Click **Import**
2. Upload a JSON or CSV file with your test cases
3. Map the columns to fields
4. Import

## Writing Good Test Cases

### Be specific about expectations

Instead of "should be helpful", specify:

```
Expected: Agent should search the knowledge base and provide 
a step-by-step guide for resetting the password. Should include 
the link to the self-service portal.
```

### Cover different scenarios

Build a diverse test suite:

* Happy paths (common, expected use)
* Edge cases (unusual inputs)
* Error handling (what happens when things go wrong)
* Out of scope (things the agent shouldn't do)

### Use criteria for precision

Criteria let you check specific aspects:

| Criterion          | What It Checks                |
| ------------------ | ----------------------------- |
| `contains_link`    | Response includes a URL       |
| `mentions_product` | Specific product name appears |
| `polite_tone`      | Response is professional      |
| `under_100_words`  | Response is concise           |

### Tag for organization

Use tags to group related tests:

* `password-reset` - All password-related tests
* `billing` - Payment and subscription tests
* `edge-case` - Unusual scenarios
* `regression` - Tests for fixed bugs

## Running Evaluations

### Run All Tests

1. Click **Run Evaluation**
2. Wait for all tests to complete
3. Review results

### Run Single Test

To test one case:

1. Find the test case
2. Click the play button
3. See the result inline

### Run with Different Models

Compare how different models perform:

1. Click **Run Evaluation**
2. Select a different model from the dropdown
3. Compare results to previous runs

## Understanding Results

After a run completes, you'll see:

### Summary

* **Score** - Overall pass rate (0-100%)
* **Passed/Failed** - Count of each
* **Duration** - How long the run took

### Per-Test Results

For each test case:

| Field             | Description                              |
| ----------------- | ---------------------------------------- |
| **Status**        | Pass or Fail                             |
| **Score**         | How well it matched expectations (0-100) |
| **Actual Output** | What the agent actually said             |
| **Feedback**      | Explanation of scoring                   |

### Regression Detection

When you run multiple evaluations, the system detects:

* **Improvements** - Tests that now pass
* **Regressions** - Tests that used to pass but now fail
* **Score changes** - How individual test scores changed

## Working with History

### Compare Runs

Select two runs to see side-by-side comparison:

* Which tests changed status
* Score differences
* What's improved vs regressed

### Delete Runs

To remove old evaluation data:

1. Find the run in history
2. Click the delete button
3. Confirm

### Export Results

Export evaluation data for reporting:

1. Select a run
2. Click Export
3. Download as JSON or CSV

## Best Practices

<AccordionGroup>
  <Accordion title="Run before publishing">
    Always run your full test suite before publishing changes. Catch regressions before users do.
  </Accordion>

  <Accordion title="Add tests for bugs">
    When you find and fix a bug, add a test case to prevent it from returning.
  </Accordion>

  <Accordion title="Review failures carefully">
    A failing test might mean the agent is wrong, or it might mean the test expectation needs updating.
  </Accordion>

  <Accordion title="Keep tests maintainable">
    Overly specific expectations break easily. Focus on what matters, not exact wording.
  </Accordion>

  <Accordion title="Test tool usage">
    Include tests that verify the agent uses the right tools for the right tasks.
  </Accordion>
</AccordionGroup>

## Advanced: Tool Expectations

For agents with tools, you can specify expected tool usage:

```json theme={null}
{
  "input": "What's the weather in Paris?",
  "expected_tools": [
    { "name": "weather_lookup", "arguments": { "city": "Paris" } }
  ],
  "forbidden_tools": ["calendar"]
}
```

This checks that the agent:

* Calls the expected tools with correct arguments
* Doesn't call forbidden tools

## Next Steps

<CardGroup cols="2">
  <Card title="Configure settings" icon="gear" href="./settings">
    Set up retention, sharing, and safety guardrails
  </Card>

  <Card title="View analytics" icon="chart-line" href="./analytics">
    See how users interact with your published agent
  </Card>
</CardGroup>
