Procurement Chaos
Witness6Forged approval path
fetch_requisitionfetch_vendor_quotescreate_purchase_order
python -m trajectly repro procurement-chaos
Your agent passed the demo. Then CI killed it.
A GitHub-native challenge where bad agent PRs die with witness-level evidence: violated contract, repro command, shrink result, and the exact step where behavior broke.
Open DiscussionForged approval path
fetch_requisitionfetch_vendor_quotescreate_purchase_orderpython -m trajectly repro procurement-chaosThe failure artifacts read like a CI death screen, not a benchmark spreadsheet.
The arena is backed by live workflow state, committed specs, and the same Trajectly gate you would use in a production repo.
You get the witness step, the violated contract, the repro command, and a minimized counterexample instead of a vague red X.
These eight scenarios cover six categories of silent failure that text-based checks cannot see. See all six categories →
Final-answer evals can pass broken agents.
If the behavior regressed, the agent is broken even when the final sentence still sounds right.
Forged approval path
fetch_requisitionfetch_vendor_quotescreate_purchase_orderpython -m trajectly repro procurement-chaosSecret leaked in outbound payload
fetch_logssummarizepost_summarypython -m trajectly repro secret-karaokeInvite fired before room reservation
lookup_oncallsend_invitereserve_roompython -m trajectly repro calendar-thunderdomeStart from a known-good trajectory and commit the baseline so future runs compare against real behavior, not vibes.
Replay with fixtures and contracts enabled so final-answer luck cannot hide missing calls, bad order, or unsafe branches.
Trajectly identifies the earliest event where behavior diverged and attaches a concrete violation code to that step.
Replay the failure locally with one command, then reduce it to the shortest counterexample that still dies the same way.
Each scenario demonstrates a trajectory failure that output-only checks can miss.
The final PO text can still look valid while the agent forges the approval path.
A calm support reply can hide the fact that the required escalation never happened.
The summary can read clean while the outbound payload still contains secret-like values.
The agent can claim the audit passed while quietly taking the disallowed tool branch.
"Bridge arranged" can still pass while reservation order is broken or invites fire twice.
The graph can finish and print success while a node argument silently violates contract.
The agent can report success even though it reached out to a forbidden domain.
The final text can stay identical while execution cost and tool usage quietly regress.
Same engine, same contracts, same repro workflow.
python -m trajectly repro procurement-chaosWitness index, violated contract, repro command, and shrink result map directly to the same GitHub review flow teams already use.
The debug loop is not theoretical. You can collapse a noisy failure into the smallest trace that still proves the regression.
Refinement and contract findings make missing calls and broken order legible in a way final-answer checks cannot.
14 events
3 events
- route_for_approvalOpen the repo, run the arena, then inspect the failure like you would in a real CI gate.
git clone https://github.com/trajectly/trajectly-survival-arena.git cd trajectly-survival-arena pip install -r requirements.txt python -m trajectly init python -m trajectly run specs/challenges/*.agent.yaml --project-root . python -m trajectly report python -m trajectly repro python -m trajectly shrink
Public survivors, compact and easy to scan.
The hall of fame appears once arena data is available.