Public packageDomain products: recommenders, search, agents

Behavioral evidence for AI products

Find the AI behavior your tests miss.

Evidpath runs controlled domain trials before launch, then writes judged traces and evidence your team can review, rerun, and compare.

Request a domain pilot

Hidden failures

trust collapse, freshness gaps, tool misuse

Rerunnable evidence

fixed seeds, traces, reports, manifests

Launch decisions

baseline vs candidate before release

The category shift

AI products do not just return outputs. They exhibit behavior.

Unit tests and thresholds still matter. They just do not tell you which users lose trust, which queries miss intent, which agents misuse tools, or whether a candidate changed behavior in a risky slice.

Testing gap

aggregate pass can hide behavior risk

Deterministic test shape

input

function

assert

pass

Same input, same output. The test can be exact.

AI behavior shape

aggregate metricpasses

low-patience usertrust collapse

time-sensitive queryfreshness gap

agent tasktool-use risk

candidate compareneeds review

The average can pass while a user journey fails.

Client benefits

What changes before launch.

Evidpath gives teams a controlled way to find behavior failures, rerun the same coverage, compare release candidates, and leave a launch packet that can be reviewed by humans.

Catch hidden behavior failures

Find user trust loss, freshness gaps, tool misuse, and weak edge behavior before customers do.

Replace vibe checks

Turn release concerns into repeatable domain runs with saved traces, judges, and rerunnable evidence.

Compare release candidates

Run baseline and candidate systems through comparable coverage before approving a change.

Leave a launch packet

Give reviewers reports, JSON, trace ledgers, manifests, and domain-specific failure language.

Evidpath evidence run

DOMAIN PILOT

$evidpath audit --domain agents --scenario current-info-tool-use --driver-config-path ./driver_config.json

Run path

Domain -> Coverage -> Judge -> Evidence

run-28a8a5cb16eb

Domain outputproduct pack selected

Each domain gives the run its target shape, scenario grammar, judge, and report language.

Domain products

Choose the behavior trial for the system you ship.

Evidpath is the shared evidence engine. The domain product gives it the contract, scenario language, judge, and report vocabulary for a specific AI system.

Generated swarms most mature

Evidpath for Recommenders

Run repeatable user coverage against recommendation slates before release changes reach real users.

Failure language

slate repetitionnovelty driftcold starttrust collapseabandonment

Integration paths

native HTTP / schema-mapped HTTP / Python callable / Hugging Face adapter

Open domain guide

Public audit and compare domain

Evidpath for Search

Run repeatable query coverage against rankers to surface relevance, freshness, ambiguity, and zero-result risks.

Failure language

relevance lossfreshness gapsambiguous intenttypo recoverypersonalization drift

Integration paths

native HTTP / schema-mapped HTTP / Python callable

Open domain guide

Public trajectory domain

Evidpath for Agents

Run repeatable task coverage against agents to evaluate tool use, grounding, refusal behavior, state, and latency.

Failure language

tool misuseungrounded answerrefusal failurestate losslatency cliff

Integration paths

Python/LangGraph / OpenAI-compatible / Anthropic / MCP stdio / HTTP session

Open domain guide

The method

Turn a release question into a controlled behavior trial.

Evidpath does not stop at a manual spot check. It builds a domain-shaped run, calls the target through the right integration path, judges completed traces, then writes artifacts a team can inspect and rerun.

Command surface

public domains: recommender, search, agents

Repeatable run

evidpath audit --domain search --target-url http://127.0.0.1:8051 --scenario time-sensitive-query

Search evidence

freshness percentile low

ambiguous intent preserved

top bucket relevance 0.82

run_plan.jsonrun_manifest.jsonreport.mdresults.jsontraces.jsonlregression_summary.json

Coverage model

Swarms are the repeatable coverage beneath each domain.

A swarm is coverage with memory: seeded actors, scenarios, journeys, tasks, and saved plans that can be rerun when a target changes. Generated coverage is currently strongest for the recommender domain.

Swarm coverage builder

seeded, replayable, domain-aware

Swarm shaped from a release question

Question

what could fail before launch?

Domain

recommender, search, or agents

Swarm

users, queries, or tasks

Run plan

seeded and replayable

Swarm trace preview

agent-7low patienceabandonment risk

query-12time sensitivefreshness gap

task-21tool usegrounding check

candidatebaseline compareno material change

domain-shaped

artifact-backed

Launch evidence

The output is a release packet, not a slogan.

Reports should name the domain, the run, the trace, the concern, and the files behind the conclusion. That is the difference between “we tried it” and repeatable launch review.

Output guide

Launch evidence dossier

run_id run-28a8a5cb16eb

Executive summary

release question answered with trace-backed evidence

Domain finding

trust collapse, freshness gap, or tool-use regression

Trace ledger

seeded user/task interactions preserved step by step

Semantic advisory

optional explanation sidecar, non-gating

Regression summary

baseline -> candidate with pass, warn, or fail

Artifact manifest

environment, inputs, outputs, and content hashes

Artifact bundle

run_plan.json

run_manifest.json

report.md

results.json

traces.jsonl

regression_summary.json

Platform shape

One evidence engine. Multiple domain products.

The platform gives every domain the same repeatable run engine, trace ledger, regression workflow, and artifact model. The domain product gives each customer the scenarios, contract, judge, and failure language that match the system they ship.

Current domain products

recommenderslates, novelty, trust collapse searchrelevance, freshness, ambiguity agentstool use, grounding, refusal

Maturity note

Generated swarm planning is most mature for recommenders today and is expanding across search and agent domains.

Shared platform capabilities

seeded runstrace capturedomain judgingartifact manifestsbaseline/candidate compareCI-ready evidence

Platform stack

shared core + domain modules

Evidence engine

Seeds, planning, execution

Domain product

Contract, scenarios, judge

Target run

HTTP, Python, protocol driver

Evidence

Traces, reports, manifests

Release

Compare, policy, review

Platform flow

1Release questionWhat behavior could break before launch?

2Domain coverageUsers, queries, or tasks shaped for the product line.

3Target interactionHTTP, Python, or protocol drivers call the system.

4Judged tracesDomain judges score behavior and surface risks.

5Launch evidenceReports, JSON, traces, manifests, and compare decisions.

Repeatable behavior trials

Fixed seeds, scenarios, tasks, and reruns make behavior reviewable instead of anecdotal.

Domain product packs

Each domain owns target shape, scenario grammar, judge, metrics, and failure language.

Generated coverage where mature

Recommender briefs can become structured scenarios, populations, swarms, and reusable run plans.

Launch comparison

Compare runs, regression policy, saved plans, and CI-ready outputs turn testing into release review.

Product manual

Docs organized by the jobs teams need.

The docs explain the swarm model after the buyer story is clear: choose a domain product, pick an integration path, run workflows, and read the evidence packet.

Open docs Multi-domain quickstart

Start here

domain product docs

Swarm model

How release questions become seeded behavior swarms.

Domain products

Choose recommender, search, or agent trajectory testing.

Integration paths

Native HTTP, schema-mapped HTTP, Python, and agent drivers.

Evidence artifacts

Reports, traces, JSON, manifests, and regression packets.

Domain pilot

Have an AI system with behavior risk before launch?

Bring the target. We will help choose the domain product, implementation path, and first release question to turn into trace-backed evidence.

Request a domain pilot