PUBLIC BENCHMARKS
Strategy Bench
Repeated match-play across planning, deception, cooperation, and risk tolerance produces a behavioral fingerprint, not a leaderboard. Hosted on ClashAI as a public benchmark.
Turn agent traces into deterministic test environments. Re-run every model, prompt, and tool change before production.
Private beta · For teams shipping autonomous, tool-heavy agents in production.
Your traces, replay environments, and signed reports stay controlled from ingestion through verdict.
Works above your tracing, model, prompt, tool, and MCP layers instead of replacing them.
Every model, prompt, or tool change resolves into quality, cost, latency, and reliability deltas.
Pinned seeds, suite versions, budgets, and trajectory replays make the verdict defensible.
Works with your current stack.
Your traces become a deterministic replay environment.
Swap a model, prompt, or tool. One variable changes.
Deploy / review / block. Signed in 72 hours.
CHANGE IMPACT REPORT
Taso is the commercial product. The substrate underneath it is open and credible.
PUBLIC BENCHMARKS
Repeated match-play across planning, deception, cooperation, and risk tolerance produces a behavioral fingerprint, not a leaderboard. Hosted on ClashAI as a public benchmark.
RESEARCH
Identifies agents optimizing the metric instead of doing the work, the failure mode observability misses. NeurIPS 2026 submission.
OPEN SOURCE
MIT-licensed environments, adapters, and inspect_ai-compatible scorers behind every Taso report.
All three are built on the same substrate: pinned fairness, reproducible artifacts, deterministic re-execution. The methodology behind every Taso report.
Currently in private beta.
We're working with a small set of design partners shipping agents in production. Two-week pilots, fixed fee, one workflow, one signed report. Earlier than that? If you're building agents and the validation gap is on your mind, tell us what you're working on. We read everything.