Behavioral evidence for AI products

Find the AI behavior your tests miss.

Evidpath runs controlled domain trials before launch, then writes judged traces and evidence your team can review, rerun, and compare.

Hidden failures

trust collapse, freshness gaps, tool misuse

Rerunnable evidence

fixed seeds, traces, reports, manifests

Launch decisions

baseline vs candidate before release

The category shift

AI products do not just return outputs. They exhibit behavior.

Unit tests and thresholds still matter. They just do not tell you which users lose trust, which queries miss intent, which agents misuse tools, or whether a candidate changed behavior in a risky slice.

Testing gap
aggregate pass can hide behavior risk
Deterministic test shape
input
function
assert
pass

Same input, same output. The test can be exact.

AI behavior shape
aggregate metricpasses
low-patience usertrust collapse
time-sensitive queryfreshness gap
agent tasktool-use risk
candidate compareneeds review

The average can pass while a user journey fails.

Client benefits

What changes before launch.

Evidpath gives teams a controlled way to find behavior failures, rerun the same coverage, compare release candidates, and leave a launch packet that can be reviewed by humans.

Catch hidden behavior failures

Find user trust loss, freshness gaps, tool misuse, and weak edge behavior before customers do.

Replace vibe checks

Turn release concerns into repeatable domain runs with saved traces, judges, and rerunnable evidence.

Compare release candidates

Run baseline and candidate systems through comparable coverage before approving a change.

Leave a launch packet

Give reviewers reports, JSON, trace ledgers, manifests, and domain-specific failure language.

Evidpath evidence run
DOMAIN PILOT
$evidpath audit --domain agents --scenario current-info-tool-use --driver-config-path ./driver_config.json

Run path

Domain -> Coverage -> Judge -> Evidence

run-28a8a5cb16eb
Domain outputproduct pack selected

Each domain gives the run its target shape, scenario grammar, judge, and report language.

Domain products

Choose the behavior trial for the system you ship.

Evidpath is the shared evidence engine. The domain product gives it the contract, scenario language, judge, and report vocabulary for a specific AI system.

Generated swarms most mature

Evidpath for Recommenders

Run repeatable user coverage against recommendation slates before release changes reach real users.

Failure language

slate repetitionnovelty driftcold starttrust collapseabandonment

Integration paths

native HTTP / schema-mapped HTTP / Python callable / Hugging Face adapter

Open domain guide

Public audit and compare domain

Evidpath for Search

Run repeatable query coverage against rankers to surface relevance, freshness, ambiguity, and zero-result risks.

Failure language

relevance lossfreshness gapsambiguous intenttypo recoverypersonalization drift

Integration paths

native HTTP / schema-mapped HTTP / Python callable

Open domain guide

Public trajectory domain

Evidpath for Agents

Run repeatable task coverage against agents to evaluate tool use, grounding, refusal behavior, state, and latency.

Failure language

tool misuseungrounded answerrefusal failurestate losslatency cliff

Integration paths

Python/LangGraph / OpenAI-compatible / Anthropic / MCP stdio / HTTP session

Open domain guide

The method

Turn a release question into a controlled behavior trial.

Evidpath does not stop at a manual spot check. It builds a domain-shaped run, calls the target through the right integration path, judges completed traces, then writes artifacts a team can inspect and rerun.

Command surface
public domains: recommender, search, agents
Repeatable run
evidpath audit --domain search --target-url http://127.0.0.1:8051 --scenario time-sensitive-query
Search evidence
freshness percentile low
ambiguous intent preserved
top bucket relevance 0.82
run_plan.jsonrun_manifest.jsonreport.mdresults.jsontraces.jsonlregression_summary.json

Coverage model

Swarms are the repeatable coverage beneath each domain.

A swarm is coverage with memory: seeded actors, scenarios, journeys, tasks, and saved plans that can be rerun when a target changes. Generated coverage is currently strongest for the recommender domain.

Swarm coverage builder
seeded, replayable, domain-aware
Swarm shaped from a release question
1

Question

what could fail before launch?

2

Domain

recommender, search, or agents

3

Swarm

users, queries, or tasks

4

Run plan

seeded and replayable

Swarm trace preview
agent-7low patienceabandonment risk
query-12time sensitivefreshness gap
task-21tool usegrounding check
candidatebaseline compareno material change

domain-shaped

artifact-backed

Launch evidence

The output is a release packet, not a slogan.

Reports should name the domain, the run, the trace, the concern, and the files behind the conclusion. That is the difference between “we tried it” and repeatable launch review.

Output guide
Launch evidence dossier
run_id run-28a8a5cb16eb

Executive summary

release question answered with trace-backed evidence

Domain finding

trust collapse, freshness gap, or tool-use regression

Trace ledger

seeded user/task interactions preserved step by step

Semantic advisory

optional explanation sidecar, non-gating

Regression summary

baseline -> candidate with pass, warn, or fail

Artifact manifest

environment, inputs, outputs, and content hashes

Artifact bundle
run_plan.json
run_manifest.json
report.md
results.json
traces.jsonl
regression_summary.json

Platform shape

One evidence engine. Multiple domain products.

The platform gives every domain the same repeatable run engine, trace ledger, regression workflow, and artifact model. The domain product gives each customer the scenarios, contract, judge, and failure language that match the system they ship.

Maturity note

Generated swarm planning is most mature for recommenders today and is expanding across search and agent domains.

Shared platform capabilities

seeded runstrace capturedomain judgingartifact manifestsbaseline/candidate compareCI-ready evidence
Platform stack
shared core + domain modules
1

Evidence engine

Seeds, planning, execution

2

Domain product

Contract, scenarios, judge

3

Target run

HTTP, Python, protocol driver

4

Evidence

Traces, reports, manifests

5

Release

Compare, policy, review

Platform flow

1Release questionWhat behavior could break before launch?
2Domain coverageUsers, queries, or tasks shaped for the product line.
3Target interactionHTTP, Python, or protocol drivers call the system.
4Judged tracesDomain judges score behavior and surface risks.
5Launch evidenceReports, JSON, traces, manifests, and compare decisions.

Repeatable behavior trials

Fixed seeds, scenarios, tasks, and reruns make behavior reviewable instead of anecdotal.

Domain product packs

Each domain owns target shape, scenario grammar, judge, metrics, and failure language.

Generated coverage where mature

Recommender briefs can become structured scenarios, populations, swarms, and reusable run plans.

Launch comparison

Compare runs, regression policy, saved plans, and CI-ready outputs turn testing into release review.

Product manual

Docs organized by the jobs teams need.

The docs explain the swarm model after the buyer story is clear: choose a domain product, pick an integration path, run workflows, and read the evidence packet.

Domain pilot

Have an AI system with behavior risk before launch?

Bring the target. We will help choose the domain product, implementation path, and first release question to turn into trace-backed evidence.

Request a domain pilot