kazi

Proof, not vibes

Real goals a naive pipeline leaves subtly broken — the file looks created, the answer looks right, the parallel split looks done — that kazi drove to objective convergence. Each case below is a genuine kazi apply --json run on a released binary, with before/after evidence and a command you can re-run.

No unverifiable claims. Every iteration count, cost, and wall-clock figure is copied verbatim from a real run recorded in the devlog; the full reproduction steps — and the metered cost_usd per run — live in the methodology doc. If a number can’t be traced to a captured run, it isn’t on this page.

  1. #1kazi v1.64.2released binary · real claude harness

    Exact-content file

    Goal: Create VERSION.txt whose contents are exactly 1.0.0

    The subtle break: “The file exists” looks done — but the bytes have to be exact. A naive run is happy with a stray trailing line, v1.0.0, or 1.0.0\n\n.

    Without kazi

    File created, declared done. Content is plausible but not byte-exact — the version string silently wrong.

    With kazi

    kazi’s predicate compares the content exactly (test -f VERSION.txt && [ "$(cat VERSION.txt)" = "1.0.0" ]). “Exists” is never mistaken for “done”; VERSION.txt verified to be the bytes 1.0.0.

    status
    converged
    iterations
    2
    wall clock
    18.5 s
    tokens
    39,712
    Reproduce
    $ kazi apply version.goal.toml --workspace <ws> --harness claude --json

    Source: docs/devlog.md — 2026-06-26 · T26.6

  2. #2kazi v1.64.1released binary · real claude harness

    Self-correcting against an opaque oracle

    Goal: Write solution.py printing the sum of n<1,000,000 that are palindromes in base 10, 2 AND 8 (answer 610)

    The subtle break: The agent’s first multi-step attempt looks plausible but is wrong. With no objective check, that wrong answer ships.

    Without kazi

    First dispatch prints a plausible-but-wrong number. A prose pipeline accepts it — there’s nothing grading the result.

    With kazi

    kazi grades the output against a one-way sha256 oracle (the model can’t read the answer out of the checker), catches the wrong first attempt, and re-drives. The cheap model (Haiku 4.5) self-corrected on iteration 2; solution.py verified to print 610.

    status
    converged
    iterations
    2
    wall clock
    39.3 s
    model
    claude-haiku-4-5
    Reproduce
    $ kazi apply ./goal.toml --workspace ./ws --harness claude --model claude-haiku-4-5 --json

    Source: docs/devlog.md — 2026-06-26 · T30.4 (Run A)

  3. #3kazi v1.64.2released binary · real claude harness

    A real cross-group dependency, parallelized correctly

    Goal: Build three Go capabilities (contract, /healthz, streaming) where streaming consumes the contract type

    The subtle break: Split the work across parallel agents naively and the streaming endpoint compiles against a Widget type that doesn’t exist yet — a broken build that “looks done” per-agent.

    Without kazi

    Naive parallel split → streaming dispatched before the contract exists → broken compile, or a serial run that gives up all the parallelism.

    With kazi

    kazi computes the wave schedule from authored needs edges: result-contract and health dispatch in the SAME millisecond (disjoint blast radius), streaming waits 0.222 s until result-contract objectively converges, then all three merge into one workspace. Collective converged, exit 0.

    collective
    converged
    groups
    3 (2 concurrent, 1 gated)
    blocked
    0
    scheduler
    single-node, NATS-free
    Reproduce
    $ kazi apply priv/examples/predicate_graph_waves.toml --workspace <scratch> --parallel --harness claude --json

    Committed goal-file: priv/examples/predicate_graph_waves.toml

    Source: docs/devlog.md — 2026-06-26 · T21.12 + T23.9

Why the “converged” verdicts mean something

The inverse case is the founding dogfood: a deterministic test where a naive fix makes one predicate pass while regressing another, and kazi catches the green→red regression instead of declaring success. Two coupled predicates — fixing a.txt breaks b.txt as a side effect — and kazi attributes the regression to the dispatch that caused it and never declares done. A single exit code would have hidden it.

$ mix test test/kazi/slice1_dogfood_test.exs

Deterministic — no model, no network. Source: test/kazi/slice1_dogfood_test.exs (the Slice-1 acceptance dogfood, T1.8). Labelled as a test, not a live kazi apply run.

Read the full methodology →

Each new converged dogfood becomes a new entry here. To add one: capture a real kazi apply --json envelope, record it in the devlog, and append a case — never a number without a run behind it.