Virtuous cycle

How the four operations Penmark performs on a puzzle — generate, solve, grade, enumerate — and the discovery loop that connects them, fit together as one feedback system rather than four standalone tools.

The loop

flowchart LR
    GEN["Generate<br/><sub>Genre::generate</sub><br/><sub>scaffold → bare-solve →<br/>derive clues → reduce</sub>"]
    PUZ[(Puzzle corpus)]
    SLOW["Slow solve<br/><sub>SimpleSolver,<br/>OR-Tools CP-SAT</sub><br/><sub>reference truth</sub>"]
    FAST["Fast solve<br/><sub>FastSolver</sub><br/><sub>AC-3 + trial-1<br/>+ Propagator</sub>"]
    GRADE["Grade<br/><sub>StandardGrader</sub><br/><sub>fire named Techniques<br/>in priority order</sub>"]
    ENUM["Enumerate<br/><sub>Genre::enumerate</sub><br/><sub>stream unique puzzles<br/>at a given size</sub>"]
    DIFF["Difficulty<br/>label"]
    DISCOVER["Discover<br/><sub>mining pass over corpus:<br/>capture contradictions,<br/>cluster proof-shapes</sub>"]
    HUMAN["Human curation<br/><sub>inspect candidates,<br/>encode as Technique impl</sub>"]
    CATALOG[("Grader<br/>technique<br/>catalog")]

    GEN -->|new puzzle| PUZ
    ENUM -->|stream| PUZ

    PUZ --> SLOW
    PUZ --> FAST
    SLOW -->|cross-check| FAST

    FAST --> GRADE
    GRADE --> DIFF
    DIFF -->|filter / re-roll| GEN

    PUZ --> DISCOVER
    FAST -->|trial-k probes,<br/>contradiction proofs| DISCOVER
    DISCOVER --> HUMAN
    HUMAN -->|new Technique impl| CATALOG
    CATALOG --> GRADE
    CATALOG -->|finer-grained<br/>difficulty control| GEN

    classDef gen fill:#1f3b73,stroke:#3b5fa8,color:#fff
    classDef solve fill:#2d5a3d,stroke:#4a8a5a,color:#fff
    classDef grade fill:#5a4a2d,stroke:#8a7a4a,color:#fff
    classDef discover fill:#5a2d4a,stroke:#8a4a7a,color:#fff
    classDef store fill:#3a3a3a,stroke:#6a6a6a,color:#fff
    class GEN,ENUM gen
    class SLOW,FAST solve
    class GRADE,DIFF grade
    class DISCOVER,HUMAN discover
    class PUZ,CATALOG store

The four “boxes a user runs” are colored separately:

blue — sources of puzzles (generate, enumerate).
green — solvers (slow reference + fast production).
gold — grading (the technique catalog plus the resulting difficulty label).
purple — the discovery loop that grows the catalog.
grey — passive stores (the puzzle corpus, the technique catalog).

How each edge actually flows

Generate → corpus

genre.generate runs the universal scaffold_solve_derive recipe (or a genre-specific override): pick structural materials, bare-solve a completion with FastSolver, derive clues via genre.derive_clue_at (the typed ClueValue-producing hook; defaults through genre.clue_pattern for u8 genres), then reduce by symmetry orbit until clue density hits the floor. The output is one uniquely-solvable Puzzle (with its genre field set to the runtime handle). Repeated calls under varied seeds populate a corpus file (the per-genre data/puzzles/<genre>.jsonl[.gz]).

genre.enumerate is the same idea wholesale — instead of “give me one good puzzle for this config” it streams every uniquely-solvable puzzle at a given size. Either path lands in the same on-disk corpus.

Corpus → solve

Two solvers run in parallel against the same corpus, for different reasons:

Slow solve is the reference. SimpleSolver is brute-force DFS over HashMap<Coord, Vec<Mark>>; the obvious-correct impl every faster solver gets cross-checked against. OrToolsSolver (CP-SAT, behind the ortools feature) is the “real CP solver” point of comparison — answers the question “how close is our bespoke propagator to a production CP engine on this corpus?”.
Fast solve is production. FastSolver runs DFS plus AC-3 propagation through a Propagator over a FastGame<'p> (bitmask domains, per-constraint aggregates, per-cell counters), with a depth-1 trial-1 sweep at every node. This is what the grader and generator call internally.

The cross-check edge between them is what keeps fastsolve honest. Every simple_and_fast_agree_* test asserts the two return the same solution set on a fixed puzzle; the corpus-wide bench harness asserts the same across thousands of puzzles.

Fast solve → grade

StandardGrader fires named human-pattern Techniques in priority order against the same Propagator engine FastSolver uses. Techniques are rule-kind-driven (Propagation-phase ones participate in the inner fixpoint loop; Pattern-phase ones run when propagation stalls), so genres sharing rule shapes share deductions. The output is a difficulty label assigned by counting which technique tiers were needed to close the puzzle.

Difficulty → generate

The generator’s cfg.time_budget and cfg.clue_density knobs control raw structural shape, but the grader’s difficulty label is what makes the generator useful. A “produce a hard 6×6 sudoku” workflow is the generator producing candidates, the grader rating each, and the harness re-rolling anything that doesn’t land in the requested tier. Without the grader the generator is just a uniqueness oracle.

Corpus + fast solve → discover

The discovery loop is the part of this diagram that doesn’t yet exist as production code but is the point of the whole exercise. The shape: run trial-k preprocessing across the corpus at depth k > 1, capture each contradiction’s minimal-variable- set proof, cluster the proofs by structural shape (modulo symmetry), and surface the top-N candidate patterns.

The patterns that cluster cleanly are candidate techniques. Each one is a deduction the existing grader catalog doesn’t have a name for, surfaced from real puzzles where it would have mattered.

Discover → human → catalog

A human inspects the candidates. Patterns that read like something a sudoku setter would call a technique (“locked candidate,” “sashimi X-wing,” etc.) get encoded as Technique impls and added to the grader’s catalog. Patterns that turn out to be 12-variable monsters with no human-recognisable shape get discarded — they don’t pollute the catalog.

The grader stays a catalog of hand-coded named techniques. The discovery loop is a helper for the human writing techniques, not a generator of them. That matters for three reasons:

The grader API is unchanged; difficulty calibration stays anchored to human-recognisable techniques.
Solver internals (aux-var state, proof shapes, search bookkeeping) never have to be human-readable.
Auto-discovered candidates that aren’t interpretable get filtered at human review.

Catalog → generate (the long edge)

This is the edge that closes the cycle. A grader catalog with finer-grained tiers gives the generator finer-grained difficulty control: “produce a puzzle that needs technique X but not technique Y” becomes expressible. That produces a richer corpus that, on the next mining pass, surfaces a different set of candidate patterns — including some that the previous catalog was masking by always firing first.

Why this framing matters

Each operation in isolation is unremarkable: a CP solver, a DFS generator, a rule-fires grader, a difficulty tagger. The interesting part is that the four operations share the same propagation engine and so each improvement compounds:

A faster FastSolver makes the generator produce more candidates per second, makes trial-k tractable for larger k, and makes the grader reach harder puzzles within the same time budget.
A bigger grader catalog makes the generator’s difficulty output sharper and makes the discovery loop’s clustering pass see fewer false-positive “new” patterns.
A bigger corpus makes both the bench harness more meaningful and the discovery loop more likely to surface low-frequency patterns.

The cycle’s bottleneck moves around. When the existing catalog covers most of the easy patterns, the discovery loop becomes the rate-limiting step. When the discovery loop’s canonicalisation surfaces clean clusters, the human-curation step becomes the rate-limiting step. When the catalog is rich enough that the generator has fine-grained control, the next unlock is harder corpora that exercise the long tail of techniques.

Where we are in this picture

The blue / green / gold parts (generate, both solvers, grader) are all in tree and exercised by the bench matrix and the canonical corpus. The purple discovery loop is not yet built — every “Technique” in the grader catalog is hand- written from scratch, with the discovery side filled in by human intuition rather than corpus mining.

The aux-vars / sudoku-triads experiment on claude/penmark-algorithmic-speedups-9tMvr was a pre-discovery attempt to enrich the propagator’s primitive set so the discovery loop, when built, would have a richer vocabulary to cluster proofs in. The integration didn’t pay off on the current sudoku corpus (~2× slowdown, identical DFS node count across difficulty tiers — trial-1 already subsumes the patterns triads catch). The full writeup is in penmark/notes/aux-vars-postmortem.md; the takeaway is that trial-1 is the floor on the current corpus, so future discovery work should target either genuinely-hard puzzles (where trial-1 doesn’t close the gap and the search tree grows) or other genres entirely.

Keyboard shortcuts

Penmark