DarwinPatch is a runnable prototype for making coding-agent repair loops more reliable. It does not
try to replace a coding agent. Instead, it wraps candidate code patches with a disciplined controller that
verifies, rejects, routes, archives, and reports the repair process.
The core idea is simple:
The coding agent should not just generate a patch. It should generate a patch, test it through hard gates, learn only from bounded evidence, route the next attempt deliberately, and leave behind an audit trail that explains why a candidate was rejected or promoted.DarwinPatch demonstrates that idea in a controlled benchmark:
single_shot: solve@budget = 0.2
linear_retry: solve@budget = 0.6
clean_context_review: solve@budget = 0.6
archive_no_routing: solve@budget = 0.6
evidence_aware_review: solve@budget = 0.9
full_darwinpatch: solve@budget = 0.9
The most important result is not just that full_darwinpatch reaches 0.9. The
stronger interpretation is:
Modern coding agents can produce impressive patches to code, but the long-horizon coding work has a certain reliability problem. That is, a single generated patch may:
The usual retry loop is oftentimes too weak:
generate patch -> run tests -> if failed, try again
That loop misses the real engineering issue. The question is not only "can the model eventually produce a working patch?" The better question is:
Can the system control the repair process so failed attempts become bounded evidence, unsafe patches are rejected, regressions stay hidden from the repair loop, and the final promotion is explainable?
DarwinPatch is the answer to that question.
DarwinPatch is designed around five goals:
A patch is promoted only after scope checks, patch application, AST parsing, secret scanning, visible tests, and release-gate regression tests pass.
The controller receives compact EvidencePacket objects. Visible failures expose limited
details;
release-gate regressions deliberately withhold full output.
After a failure, the controller maps the evidence to a repair route such as
behavior_repair,
regression_repair, scope_repair, or syntax_repair.
Every candidate is recorded with route, status, score, parent, generation, touched files, verifier results, failure fingerprint, and evidence ID.
The repo includes a deterministic benchmark, ablations, confidence intervals, policy selection, and claim-to-artifact reporting.
At a high level, DarwinPatch is a reliability layer around candidate patches.
flowchart LR
A["Task spec and source repo"] --> B["Candidate patch"]
B --> C["DarwinPatch gates"]
C --> D{"All gates pass?"}
D -- "yes" --> E["Promote candidate"]
D -- "no" --> F["Build bounded EvidencePacket"]
F --> G["Route next repair"]
G --> H["Select route-compatible candidate"]
H --> C
C --> I["Candidate archive and traces"]
I --> J["HTML report and benchmark artifacts"]
The flow has four conceptual stages. First, DarwinPatch receives a task specification, an isolated source workspace, and one or more candidate patches. Second, each candidate is executed through hard verification gates so unsafe, malformed, or regressive edits are rejected before promotion. Third, failed candidates are compressed into bounded evidence and mapped to a repair route, which determines how the next candidate is selected. Finally, the system records candidate lineage, failure fingerprints, evidence IDs, benchmark results, and static report artifacts so the repair process is inspectable after the run.
The full DarwinPatch loop begins with an ordered first candidate. If it fails, the failure is classified
and converted into bounded evidence. The critic maps that evidence to a route, and the next candidate is
selected using declared candidate metadata such as compatible_routes, generator,
and intent.
Importantly, route-aware selection does not inspect patch filenames to guess which candidate is correct. Path-only inputs remain supported for simple CLI use, but without route metadata they fall back to original order.
sequenceDiagram
participant C as Controller
participant R as Runner
participant G as Gates
participant E as Evidence
participant K as Critic
participant A as Archive
C->>R: Run candidate patch
R->>G: Scope, apply, AST, secret scan, visible tests, release gate
G-->>R: Gate results
R-->>C: RunResult
alt Candidate promoted
C->>A: Record promoted candidate
C-->>C: Stop with promoted status
else Candidate rejected
C->>E: Build bounded EvidencePacket
E-->>C: Evidence summary and limited details
C->>K: Route next attempt from evidence
K-->>C: behavior_repair / regression_repair / scope_repair / ...
C->>A: Record candidate, evidence, fingerprint, lineage
C-->>C: Select next route-compatible candidate
end
DarwinPatch treats verification as a hard boundary. A candidate cannot be promoted unless every configured gate passes.
flowchart TD
A["Candidate patch"] --> B["Scope guard"]
B --> C["Patch applies"]
C --> D["Python AST parse"]
D --> E["Secret scan"]
E --> F["Visible developer tests"]
F --> G["Release-gate regression tests"]
G --> H["Promote"]
B -- "fail" --> R["Reject and build evidence"]
C -- "fail" --> R
D -- "fail" --> R
E -- "fail" --> R
F -- "fail" --> R
G -- "fail" --> R
The release gate is intentionally different from visible tests. Visible test failures can expose bounded stdout/stderr tails. Release-gate regression failures return limited diagnostics and withhold the full failure output from the repair loop. This simulates a realistic hidden-test boundary: the system can know that it regressed behavior without leaking the exact hidden assertion.
Each failed candidate produces an EvidencePacket with:
evidence_idExample evidence summaries include:
visible developer tests failed
release-gate regressions failed with limited diagnostics
patch touched blocked paths: ...
patched Python did not parse
secret-like material detected
candidate patch did not apply
This is the important reliability distinction: DarwinPatch does not feed arbitrary logs back into the next attempt. It turns verifier output into a small evidence object with controlled information flow.
flowchart LR
A["Raw gate result"] --> B["Failure classifier"]
B --> C["EvidencePacket"]
C --> D["Route decision"]
C --> E["Candidate archive"]
C --> F["HTML report"]
The benchmark uses structured candidate metadata:
{
"patch": "candidate.patch",
"label": "spec_complete_repair",
"generator": "synthetic_generator",
"intent": "complete_spec_repair",
"compatible_routes": [
"behavior_repair",
"regression_repair",
"scope_repair"
]
}
This separation prevents the controller from inferring candidate quality from patch filenames. Route-aware selection relies only on declared candidate metadata, while generated reports record metadata-route matches for auditability.
The benchmark includes the following six baselines:
| Baseline | What It Tests |
|---|---|
single_shot |
One candidate, no correction loop. |
linear_retry |
Ordered retry up to the same candidate budget. |
clean_context_review |
Ordered retry without DarwinPatch route-aware selection or evidence archive. |
archive_no_routing |
Archive-compatible retry without route-aware selection. |
evidence_aware_review |
Uses bounded evidence plus candidate metadata, but no archive, lineage, parent scoring, fingerprint stopping, or policy machinery. |
full_darwinpatch |
Full controller with evidence packets, failure fingerprints, archive, route-aware selection, hard gates, reports, and policy experiment support. |
The evidence_aware_review baseline is especially important. It answers the skeptical
question:
Is DarwinPatch only beating dumb ordered retry?
The answer is no. The stronger baseline shows that bounded evidence plus route metadata is the main selection lift. Full DarwinPatch then adds the engineering system around that lift.
The default benchmark is a 40-case deterministic repair-search study across the following four task families:
Each task family includes:
This keeps the benchmark small enough to run locally but broad enough to exercise the reliability story.
Please note again, that the default demo is offline and deterministic what is intentional, it keeps the evaluation reproducible and makes the benchmark suitable for controlled comparison before adding live LLM variability.
flowchart TD
A["40 benchmark cases"] --> B["4 task families"]
B --> C["Markdown parser"]
B --> D["Slugifier"]
B --> E["Config loader"]
B --> F["Range parser"]
A --> G["Failure modes"]
G --> H["Visible test failure"]
G --> I["Withheld regression failure"]
G --> J["Scope violation"]
G --> K["Route decoy"]
G --> L["Repeated unsolved failure"]
The current deterministic benchmark result shape is:
| Baseline | solve@budget | 95% CI | Avg Attempts | Evidence Packets | Route Metadata Matches |
|---|---|---|---|---|---|
single_shot |
0.2 | 0.105-0.3476 | 1.0 | 0 | 0 |
linear_retry |
0.6 | 0.446-0.7365 | 1.8 | 0 | 0 |
clean_context_review |
0.6 | 0.446-0.7365 | 1.8 | 0 | 0 |
archive_no_routing |
0.6 | 0.446-0.7365 | 1.8 | 0 | 0 |
evidence_aware_review |
0.9 | 0.7695-0.9604 | 1.8 | 36 | 28 |
full_darwinpatch |
0.9 | 0.7695-0.9604 | 1.8 | 36 | 28 |
The result is intentionally interpreted with care:
single_shot shows that one attempt is not reliable enough.evidence_aware_review shows that bounded evidence plus candidate metadata is the main
repair-selection mechanism.full_darwinpatch ties that solve rate while adding auditability, fingerprints, lineage,
policy selection, and reportability.That is a more honest result than claiming the full controller magically beats every ablation. DarwinPatch is a reliability system: the selection lift is isolated, and the full product value is in turning that lift into controlled, inspectable engineering behavior.
After running the evaluation, the following artifacts are generated:
DarwinPatch also includes a guarded train/validation/test policy experiment.
flowchart LR
A["Source benchmark suite"] --> B["Train split"]
A --> C["Validation split"]
A --> D["Held-out test split"]
B --> E["Policy tuning"]
C --> F["Policy selection"]
F --> G["Selected policy"]
G --> D
D --> H["Final report only"]
The included policies are:
| Policy | Max Candidates | Max Repeated Failures |
|---|---|---|
one_attempt_conservative |
1 | 2 |
two_attempt_budgeted |
2 | 2 |
strict_repeat_stop |
3 | 1 |
The selected policy is:
two_attempt_budgeted
The held-out test path runs only the selected policy and is not used for tuning or policy selection.
DarwinPatch demonstrates a compact but meaningful reliability pattern for coding agents:
The same verification-grounded repair pattern applies directly to long-horizon coding agent loops where a subagent may propose file edits that pass surface tests but regress hidden behavior, exactly like the failure mode DarwinPatch's release-gate boundary is designed to contain.
DarwinPatch is still a controlled MVP. The main limitations are:
evidence_aware_review ties full_darwinpatch on code, so the full
controller's added value is auditability and reliability infrastructure rather than a higher controlled
solve rate.
These limitations are acceptable for the current goal: build a strong, reproducible technical demo that tells a clear story and can be extended into a larger agent reliability system.
The next steps are straightforward:
DarwinPatch is a small project, but the design target is serious: make coding-agent self-correction measurable, bounded, and inspectable.
The strongest version of the claim is:
DarwinPatch shows that a coding agent repair loop can be wrapped with hard verification, bounded evidence, route-aware selection, and auditable reporting. In the controlled benchmark, bounded evidence plus route metadata lifts solve@budget from ordered retry's 0.6 to 0.9, while the full DarwinPatch controller turns that behavior into a reproducible reliability system.
That's it, that's the story of DarwinPatch is meant to tell.
@article{
darwinpatch,
title={DarwinPatch: A Budgeted Repair-Search Controller for Reliable Coding Agents},
author={Taneem Ullah Jan},
year={2026}
}