Framework Flow
A run moves from release question to domain selection, target integration, seeded behavior coverage, trace judging, and evidence. The flow stays stable while each domain changes the behavior language.
Question
Domain
Swarm
Trace
Judge
Evidence
Core Terms
Term
Release question
Meaning
The behavior risk the team wants evidence for before launch.
Term
Swarm
Meaning
A repeatable set of users, queries, tasks, journeys, or scenarios used to exercise the target.
Term
Target
Meaning
The AI system under test: a service, callable, agent graph, or protocol endpoint.
Term
Trace
Meaning
The recorded interaction between a seeded actor/task and the target.
Term
Judge
Meaning
The domain-owned scorer that interprets completed traces.
Term
Evidence
Meaning
Human-readable and machine-readable artifacts used for release review.
| Term | Meaning |
|---|---|
| Release question | The behavior risk the team wants evidence for before launch. |
| Swarm | A repeatable set of users, queries, tasks, journeys, or scenarios used to exercise the target. |
| Target | The AI system under test: a service, callable, agent graph, or protocol endpoint. |
| Trace | The recorded interaction between a seeded actor/task and the target. |
| Judge | The domain-owned scorer that interprets completed traces. |
| Evidence | Human-readable and machine-readable artifacts used for release review. |
What Makes It Different
- The run is replayable enough to compare releases, not a one-off prompt review.
- The judge is domain-shaped, so the evidence uses the right failure language.
- The artifacts preserve inputs, outputs, traces, manifests, and compare decisions.
- Generation is an optional coverage layer, not the source of truth for scoring.