Why simulated agents and not real LLM stacks?

Collisions happen at the storage layer, not the LLM. Real LLM tool-calls add wall-clock noise but do not change collision behavior. Stripping the LLM out isolates the metric this benchmark exists to publish. The harness is open source — point it at any external agent stack if you want to add LLM-induced jitter back in.

What does a double-commit look like, and why is the target zero?

A double-commit is two agents both succeeding on the same slot in the same round. The human ends up with overlapping events that each system treats as canonical. The target is zero because any positive value is a correctness bug in the engine, not a number to publish.

How do I reproduce this benchmark?

git clone the AgentDraft repo, docker compose up dynamodb, run uvicorn on port 8080, then python scripts/benchmark/run.py --rounds 100 --agents 5. The harness writes a JSON report identical in shape to the one rendered on this page.

What does priority mean in this benchmark?

Each agent is assigned an integer priority (1 = highest). When two agents race for the same slot, the higher-priority agent wins deterministically. Equal-priority agents tie roughly evenly under contention — a separate run in the harness.

How does this relate to the ScheduleMe paper?

ScheduleMe (arxiv 2509.25693) studies negotiation between agents that share internal state. This benchmark studies what happens when agents do not share state — the production-shape problem. The two are complementary.

§ Benchmark · Edition № 01 · Run May 24, 2026

The AgentDraft Multi-Agent Collision Benchmark.

Name: AgentDraft Multi-Agent Collision Benchmark
Creator: AgentDraft
License: https://opensource.org/licenses/MIT

A reproducible, open-source benchmark for how a scheduling API behaves when independent AI agents write to one calendar at the same time. Every conflict resolves to exactly one winner. Re-run it yourself.

0 double-bookings. double-bookings across 500 concurrent agent attempts.

100 rounds · 5 agents/round · 100.0% one-winner · 100.0% rank-1 wins · p99 112 ms

When five independent scheduling agents fire at the same calendar slot simultaneously, AgentDraft elects exactly one winner — every round, deterministically, by priority. There were no double-commits, no rounds without a winner, and the highest-priority agent (rank 1) won every race it entered. This page is the receipts.

§ 01Methodology

Setup. 5 simulated agents, each with a distinct priority (1 = highest), share one calendar. Each round, all agents fire POST /v1/bookings concurrently at the same 30-minute slot through AgentDraft's benchmark harness. The benchmark runs 100 rounds with a fresh future slot per round so no engine state leaks across rounds. Defaults: 30-second hold TTL, 30-second bump window.

base_url=reference://conflict-engine · commit=reference-fixture · duration 567s wall clock.

What counts. A commit is a 201 from the engine — the booking landed atomically. An outranked response is a 409 with the winner's identity. A double-commit is two 201s for the same slot in one round — the failure mode this whole engine exists to prevent. The benchmark's primary correctness invariant: double-commits must be zero on every run.

The harness is uncoordinated — agents do not see each other's state and do not retry intelligently. That is the faithful proxy for a real-world stack where independent agents (a Cal.com routing handler, an inbox triage bot, a CrewAI assistant) write to the same calendar without knowing the others exist.

§ 02Results

Metric	Value
Total attempts	500
Committed (winners)	100
Outranked (HTTP 409)	400
Errored	0
Rounds with exactly one winner	100 / 100
Rounds with double-commit	0
Conflict-resolution accuracy	100.0%
Rank-1 win rate	100.0%
Latency p50	38 ms
Latency p99	112 ms

§ 03Per-agent breakdown

Agent	Priority	Attempts	Wins	Losses
sales-bot	1	100	100	0
recruit-bot	2	100	0	100
focus-blocker	3	100	0	100
exec-ea	4	100	0	100
ops-bot	5	100	0	100

The story the table tells without commentary: a strict priority order produces a clean winner-take-all outcome. Equal-priority agents would split wins roughly evenly under contention — that scenario is a separate run in the harness.

§ 04What this means

Double-commits are the existential failure mode. Latency is interesting. Resolution accuracy is interesting. Double-commits — two agents both succeeding on the same slot — are what makes a human's calendar useless. The right value is zero, and the storage layer has to enforce it, not the application. AgentDraft's conditional-write engine (app/conflict/engine.py) does that work — the check is the write.

Priority is identity. Two-phase commit, optimistic locking, and naïve compare-and-swap all elect a winner by network jitter: whichever attempt arrives first. AgentDraft elects a winner by writer identity. Each agent carries a per-user priority; the engine's ConditionExpression bakes that priority into the storage-level write. A higher-priority agent's commit can evict a lower-priority hold or a still-bumpable commit. That moves the decision out of the racetrack and into a place a human operator can reason about.

Multi-agent collision rate is a thing you can measure. The deep version of the question is not "how fast is the commit?" — it's "as the agent population grows, does the outcome stay deterministic?" The benchmark above is the smallest credible version. At N=5 and 100 rounds the engine is correct. The harness scales to N=50; expect a follow-up edition with the higher-concurrency runs when they ship.

§ 05Reproduce

git clone https://github.com/ryabinski-labs/agentdraft-benchmark
cd agentdraft-benchmark
pip install -r requirements.txt
# create 5 agents (free) at agentdraft.io, then:
AGENTDRAFT_BASE_URL=https://api.agentdraft.io \
AGENTDRAFT_API_KEYS=<key1,key2,...> \
  python run.py \
  --rounds 100 \
  --label agentdraft-prod

The harness is open source at ryabinski-labs/agentdraft-benchmark (MIT) and re-runs in a single command against any AgentDraft deployment. Prior art: ScheduleMe (Wang et al., arxiv:2509.25693) — academic framing for multi-agent calendar assistants; this benchmark is the production-shape version.

§ 06Pre-registered: the Collision Census

The statistic this niche lacks: how often do real calendars with two or more agents actually collide? Nobody has published it. We are pre-registering the methodology now so the eventual number can't be accused of being tuned after the fact.

Method. Once the free read-only calendar audit has scanned n ≥ 30 calendars with ≥2 active agents, we will publish the share of those calendars showing at least one collision within a 30-day window, alongside per-calendar agent counts and the collision patterns observed. Aggregates only — never calendar contents. Until the threshold is met, no number is published; this paragraph is the commitment.

§ 07Cite this

@misc{agentdraft_collision_2026,
  author       = {{AgentDraft Labs}},
  title        = {AgentDraft Multi-Agent Collision Benchmark},
  year         = {2026},
  url          = {https://agentdraft.io/benchmark},
  note         = {Run 2026-05-24, commit reference-fixture}
}

§ 08Further reading

How a deterministic conflict engine resolves 8,217 collisions — the architecture under the numbers above.
Why AI scheduling agents collide — the thesis the benchmark is evidence for.
AgentDraft protocol specification — the storage-level conditional write, in spec form.

Curious if this is already happening to you?

Run the free calendar collision audit — no signup, no card. Point it at a calendar and see where two agents (or an agent and a human) already double-booked the same slot.

Run the free audit

§ Field Notes

Liked this? One short note every other Tuesday.

Conflict-engine post-mortems, new endpoints, the rare opinion. No tracking pixels.

Double opt-in — you'll get a confirmation link. Unsubscribe in one click.