DarwinPatch

DarwinPatch is a runnable prototype for making coding-agent repair loops more reliable. It does not try to replace a coding agent. Instead, it wraps candidate code patches with a disciplined controller that verifies, rejects, routes, archives, and reports the repair process.

The core idea is simple:

The coding agent should not just generate a patch. It should generate a patch, test it through hard gates, learn only from bounded evidence, route the next attempt deliberately, and leave behind an audit trail that explains why a candidate was rejected or promoted.

DarwinPatch demonstrates that idea in a controlled benchmark:

              
single_shot:            solve@budget = 0.2
linear_retry:           solve@budget = 0.6
clean_context_review:   solve@budget = 0.6
archive_no_routing:     solve@budget = 0.6
evidence_aware_review:  solve@budget = 0.9
full_darwinpatch:       solve@budget = 0.9

The most important result is not just that full_darwinpatch reaches 0.9. The stronger interpretation is:

Ordered retry is not enough.
Bounded evidence plus route-compatible candidate metadata provides the main repair-selection lift.
Full DarwinPatch turns that lift into an auditable reliability system: gates, evidence packets, candidate archives, lineage, failure fingerprints, policy selection, and static reports.

The default benchmark is deterministic and offline so the behavior is reproducible. Optional LLM mode can be used for the hero task, but the controlled benchmark uses curated candidate pools with explicit metadata. Please note that, this is a small work not a full autonomous coding product.

The Problem

Modern coding agents can produce impressive patches to code, but the long-horizon coding work has a certain reliability problem. That is, a single generated patch may:

pass visible tests while breaking hidden behavior,
touch files outside the intended scope,
introduce syntax errors or secret-like material,
repeat the same failure mode,
give the user no clear evidence trail,
or require a human reviewer to reconstruct what happened after the fact.

The usual retry loop is oftentimes too weak:

generate patch -> run tests -> if failed, try again

That loop misses the real engineering issue. The question is not only "can the model eventually produce a working patch?" The better question is:

Can the system control the repair process so failed attempts become bounded evidence, unsafe patches are rejected, regressions stay hidden from the repair loop, and the final promotion is explainable?

DarwinPatch is the answer to that question.

Design Goals

DarwinPatch is designed around five goals:

Hard verification before promotion

A patch is promoted only after scope checks, patch application, AST parsing, secret scanning, visible tests, and release-gate regression tests pass.

Bounded evidence, not raw failure leakage

The controller receives compact EvidencePacket objects. Visible failures expose limited details; release-gate regressions deliberately withhold full output.

Route-aware repair selection

After a failure, the controller maps the evidence to a repair route such as behavior_repair, regression_repair, scope_repair, or syntax_repair.

Auditability

Every candidate is recorded with route, status, score, parent, generation, touched files, verifier results, failure fingerprint, and evidence ID.

Controlled evaluation

The repo includes a deterministic benchmark, ablations, confidence intervals, policy selection, and claim-to-artifact reporting.

System Overview

At a high level, DarwinPatch is a reliability layer around candidate patches.

  flowchart LR
    A["Task spec and source repo"] --> B["Candidate patch"]
    B --> C["DarwinPatch gates"]
    C --> D{"All gates pass?"}
    D -- "yes" --> E["Promote candidate"]
    D -- "no" --> F["Build bounded EvidencePacket"]
    F --> G["Route next repair"]
    G --> H["Select route-compatible candidate"]
    H --> C
    C --> I["Candidate archive and traces"]
    I --> J["HTML report and benchmark artifacts"]

The flow has four conceptual stages. First, DarwinPatch receives a task specification, an isolated source workspace, and one or more candidate patches. Second, each candidate is executed through hard verification gates so unsafe, malformed, or regressive edits are rejected before promotion. Third, failed candidates are compressed into bounded evidence and mapped to a repair route, which determines how the next candidate is selected. Finally, the system records candidate lineage, failure fingerprints, evidence IDs, benchmark results, and static report artifacts so the repair process is inspectable after the run.

Repair Loop

The full DarwinPatch loop begins with an ordered first candidate. If it fails, the failure is classified and converted into bounded evidence. The critic maps that evidence to a route, and the next candidate is selected using declared candidate metadata such as compatible_routes, generator, and intent.

Importantly, route-aware selection does not inspect patch filenames to guess which candidate is correct. Path-only inputs remain supported for simple CLI use, but without route metadata they fall back to original order.

sequenceDiagram
    participant C as Controller
    participant R as Runner
    participant G as Gates
    participant E as Evidence
    participant K as Critic
    participant A as Archive

    C->>R: Run candidate patch
    R->>G: Scope, apply, AST, secret scan, visible tests, release gate
    G-->>R: Gate results
    R-->>C: RunResult
    alt Candidate promoted
        C->>A: Record promoted candidate
        C-->>C: Stop with promoted status
    else Candidate rejected
        C->>E: Build bounded EvidencePacket
        E-->>C: Evidence summary and limited details
        C->>K: Route next attempt from evidence
        K-->>C: behavior_repair / regression_repair / scope_repair / ...
        C->>A: Record candidate, evidence, fingerprint, lineage
        C-->>C: Select next route-compatible candidate
    end

Verification Gates

DarwinPatch treats verification as a hard boundary. A candidate cannot be promoted unless every configured gate passes.

  flowchart TD
    A["Candidate patch"] --> B["Scope guard"]
    B --> C["Patch applies"]
    C --> D["Python AST parse"]
    D --> E["Secret scan"]
    E --> F["Visible developer tests"]
    F --> G["Release-gate regression tests"]
    G --> H["Promote"]

    B -- "fail" --> R["Reject and build evidence"]
    C -- "fail" --> R
    D -- "fail" --> R
    E -- "fail" --> R
    F -- "fail" --> R
    G -- "fail" --> R

The release gate is intentionally different from visible tests. Visible test failures can expose bounded stdout/stderr tails. Release-gate regression failures return limited diagnostics and withhold the full failure output from the repair loop. This simulates a realistic hidden-test boundary: the system can know that it regressed behavior without leaking the exact hidden assertion.

Evidence Packets

Each failed candidate produces an EvidencePacket with:

evidence_id
failing gate
failure type
human-readable summary
bounded details
packet IDs describing what kind of context is admitted

Example evidence summaries include:

              
visible developer tests failed
release-gate regressions failed with limited diagnostics
patch touched blocked paths: ...
patched Python did not parse
secret-like material detected
candidate patch did not apply

This is the important reliability distinction: DarwinPatch does not feed arbitrary logs back into the next attempt. It turns verifier output into a small evidence object with controlled information flow.

flowchart LR
    A["Raw gate result"] --> B["Failure classifier"]
    B --> C["EvidencePacket"]
    C --> D["Route decision"]
    C --> E["Candidate archive"]
    C --> F["HTML report"]

Candidate Metadata

The benchmark uses structured candidate metadata:

              
{
    "patch": "candidate.patch",
    "label": "spec_complete_repair",
    "generator": "synthetic_generator",
    "intent": "complete_spec_repair",
    "compatible_routes": [
      "behavior_repair",
      "regression_repair",
      "scope_repair"
    ]
}

This separation prevents the controller from inferring candidate quality from patch filenames. Route-aware selection relies only on declared candidate metadata, while generated reports record metadata-route matches for auditability.

Baseline

The benchmark includes the following six baselines:

Baseline	What It Tests
`single_shot`	One candidate, no correction loop.
`linear_retry`	Ordered retry up to the same candidate budget.
`clean_context_review`	Ordered retry without DarwinPatch route-aware selection or evidence archive.
`archive_no_routing`	Archive-compatible retry without route-aware selection.
`evidence_aware_review`	Uses bounded evidence plus candidate metadata, but no archive, lineage, parent scoring, fingerprint stopping, or policy machinery.
`full_darwinpatch`	Full controller with evidence packets, failure fingerprints, archive, route-aware selection, hard gates, reports, and policy experiment support.

The evidence_aware_review baseline is especially important. It answers the skeptical question:

Is DarwinPatch only beating dumb ordered retry?

The answer is no. The stronger baseline shows that bounded evidence plus route metadata is the main selection lift. Full DarwinPatch then adds the engineering system around that lift.

Benchmark Design

The default benchmark is a 40-case deterministic repair-search study across the following four task families:

Markdown table parsing
string slugification
config loading
numeric range parsing

Each task family includes:

successful first attempts,
visible-test failures,
withheld-regression failures,
scope-policy failures,
route-decoy cases,
repeated unsolved failures.

This keeps the benchmark small enough to run locally but broad enough to exercise the reliability story.

Please note again, that the default demo is offline and deterministic what is intentional, it keeps the evaluation reproducible and makes the benchmark suitable for controlled comparison before adding live LLM variability.

  flowchart TD
    A["40 benchmark cases"] --> B["4 task families"]
    B --> C["Markdown parser"]
    B --> D["Slugifier"]
    B --> E["Config loader"]
    B --> F["Range parser"]

    A --> G["Failure modes"]
    G --> H["Visible test failure"]
    G --> I["Withheld regression failure"]
    G --> J["Scope violation"]
    G --> K["Route decoy"]
    G --> L["Repeated unsolved failure"]

Results

The current deterministic benchmark result shape is:

Baseline	solve@budget	95% CI	Avg Attempts	Evidence Packets	Route Metadata Matches
`single_shot`	0.2	0.105-0.3476	1.0	0	0
`linear_retry`	0.6	0.446-0.7365	1.8	0	0
`clean_context_review`	0.6	0.446-0.7365	1.8	0	0
`archive_no_routing`	0.6	0.446-0.7365	1.8	0	0
`evidence_aware_review`	0.9	0.7695-0.9604	1.8	36	28
`full_darwinpatch`	0.9	0.7695-0.9604	1.8	36	28

The result is intentionally interpreted with care:

single_shot shows that one attempt is not reliable enough.
ordered retry baselines improve solve rate, but fail route-decoy cases.
evidence_aware_review shows that bounded evidence plus candidate metadata is the main repair-selection mechanism.
full_darwinpatch ties that solve rate while adding auditability, fingerprints, lineage, policy selection, and reportability.

That is a more honest result than claiming the full controller magically beats every ablation. DarwinPatch is a reliability system: the selection lift is isolated, and the full product value is in turning that lift into controlled, inspectable engineering behavior.

Evaluation Artifacts

After running the evaluation, the following artifacts are generated:

Policy Experiment

DarwinPatch also includes a guarded train/validation/test policy experiment.

  flowchart LR
    A["Source benchmark suite"] --> B["Train split"]
    A --> C["Validation split"]
    A --> D["Held-out test split"]

    B --> E["Policy tuning"]
    C --> F["Policy selection"]
    F --> G["Selected policy"]
    G --> D
    D --> H["Final report only"]

The included policies are:

Policy	Max Candidates	Max Repeated Failures
`one_attempt_conservative`	1	2
`two_attempt_budgeted`	2	2
`strict_repeat_stop`	3	1

The selected policy is:

two_attempt_budgeted

The held-out test path runs only the selected policy and is not used for tuning or policy selection.

What This Demonstrates

DarwinPatch demonstrates a compact but meaningful reliability pattern for coding agents:

Self-correction should be verifier-grounded: The next attempt should be routed from actual failure evidence, not just a generic "try again" prompt.
Hidden regression gates should stay hidden: A system can know that a release gate failed without leaking the entire hidden test output into the repair loop.
Repair attempts should be auditable: A promoted patch should come with a trail: what failed, what evidence was admitted, what route was chosen, what candidate was selected, and which gates passed.
Ablations matter: The project does not only compare against single-shot generation. It includes ordered retry, archive without routing, and evidence-aware review.
Such MVPs can be honest and still impressive: The benchmark is curated and deterministic. That limitation is stated directly. The result is not "we solved autonomous software engineering." The result is "we built and measured a small reliability controller that makes repair-search behavior inspectable and reproducible."

The same verification-grounded repair pattern applies directly to long-horizon coding agent loops where a subagent may propose file edits that pass surface tests but regress hidden behavior, exactly like the failure mode DarwinPatch's release-gate boundary is designed to contain.

Limitations

DarwinPatch is still a controlled MVP. The main limitations are:

The multi-task benchmark uses curated candidate patch pools rather than live generated candidates.
evidence_aware_review ties full_darwinpatch on code, so the full controller's added value is auditability and reliability infrastructure rather than a higher controlled solve rate.
The current tasks are small Python repair tasks, not large multi-file product changes.
The optional LLM mode is demonstrated on the hero task, not yet across the full benchmark suite.
Cost and token metrics are intentionally left out for now.

These limitations are acceptable for the current goal: build a strong, reproducible technical demo that tells a clear story and can be extended into a larger agent reliability system.

Future Work

The next steps are straightforward:

Add a cached live-LLM benchmark subset across multiple tasks.
Compare different model families on the same hard gates and report artifacts.
Add token and wall-clock cost accounting.
Expand tasks to larger multi-file repairs.
Add richer candidate generators so metadata comes from real generation traces instead of curated benchmark specs.

Conclusion

DarwinPatch is a small project, but the design target is serious: make coding-agent self-correction measurable, bounded, and inspectable.

The strongest version of the claim is:

DarwinPatch shows that a coding agent repair loop can be wrapped with hard verification, bounded evidence, route-aware selection, and auditable reporting. In the controlled benchmark, bounded evidence plus route metadata lifts solve@budget from ordered retry's 0.6 to 0.9, while the full DarwinPatch controller turns that behavior into a reproducible reliability system.

That's it, that's the story of DarwinPatch is meant to tell.

BibTex

@article{
        darwinpatch, 
        title={DarwinPatch: A Budgeted Repair-Search Controller for Reliable Coding Agents}, 
        author={Taneem Ullah Jan},
        year={2026}
      }