Virtuous cycle
How the four operations Penmark performs on a puzzle — generate, solve, grade, enumerate — and the discovery loop that connects them, fit together as one feedback system rather than four standalone tools.
The loop
flowchart LR
GEN["Generate<br/><sub>Genre::generate</sub><br/><sub>scaffold → bare-solve →<br/>derive clues → reduce</sub>"]
PUZ[(Puzzle corpus)]
SLOW["Slow solve<br/><sub>SimpleSolver,<br/>OR-Tools CP-SAT</sub><br/><sub>reference truth</sub>"]
FAST["Fast solve<br/><sub>FastSolver</sub><br/><sub>AC-3 + trial-1<br/>+ Propagator</sub>"]
GRADE["Grade<br/><sub>StandardGrader</sub><br/><sub>fire named Techniques<br/>in priority order</sub>"]
ENUM["Enumerate<br/><sub>Genre::enumerate</sub><br/><sub>stream unique puzzles<br/>at a given size</sub>"]
DIFF["Difficulty<br/>label"]
DISCOVER["Discover<br/><sub>mining pass over corpus:<br/>capture contradictions,<br/>cluster proof-shapes</sub>"]
HUMAN["Human curation<br/><sub>inspect candidates,<br/>encode as Technique impl</sub>"]
CATALOG[("Grader<br/>technique<br/>catalog")]
GEN -->|new puzzle| PUZ
ENUM -->|stream| PUZ
PUZ --> SLOW
PUZ --> FAST
SLOW -->|cross-check| FAST
FAST --> GRADE
GRADE --> DIFF
DIFF -->|filter / re-roll| GEN
PUZ --> DISCOVER
FAST -->|trial-k probes,<br/>contradiction proofs| DISCOVER
DISCOVER --> HUMAN
HUMAN -->|new Technique impl| CATALOG
CATALOG --> GRADE
CATALOG -->|finer-grained<br/>difficulty control| GEN
classDef gen fill:#1f3b73,stroke:#3b5fa8,color:#fff
classDef solve fill:#2d5a3d,stroke:#4a8a5a,color:#fff
classDef grade fill:#5a4a2d,stroke:#8a7a4a,color:#fff
classDef discover fill:#5a2d4a,stroke:#8a4a7a,color:#fff
classDef store fill:#3a3a3a,stroke:#6a6a6a,color:#fff
class GEN,ENUM gen
class SLOW,FAST solve
class GRADE,DIFF grade
class DISCOVER,HUMAN discover
class PUZ,CATALOG store
The four “boxes a user runs” are colored separately:
- blue — sources of puzzles (generate, enumerate).
- green — solvers (slow reference + fast production).
- gold — grading (the technique catalog plus the resulting difficulty label).
- purple — the discovery loop that grows the catalog.
- grey — passive stores (the puzzle corpus, the technique catalog).
How each edge actually flows
Generate → corpus
genre.generate runs the universal scaffold_solve_derive
recipe (or a genre-specific override): pick structural materials,
bare-solve a completion with FastSolver, derive clues via
genre.derive_clue_at (the typed ClueValue-producing hook;
defaults through genre.clue_pattern for u8 genres), then reduce
by symmetry orbit until clue density hits the floor. The output is one uniquely-solvable
Puzzle (with its genre field set to the runtime handle).
Repeated calls under varied seeds populate a corpus file (the
per-genre data/puzzles/<genre>.jsonl[.gz]).
genre.enumerate is the same idea wholesale — instead of
“give me one good puzzle for this config” it streams every
uniquely-solvable puzzle at a given size. Either path lands in
the same on-disk corpus.
Corpus → solve
Two solvers run in parallel against the same corpus, for different reasons:
- Slow solve is the reference.
SimpleSolveris brute-force DFS overHashMap<Coord, Vec<Mark>>; the obvious-correct impl every faster solver gets cross-checked against.OrToolsSolver(CP-SAT, behind theortoolsfeature) is the “real CP solver” point of comparison — answers the question “how close is our bespoke propagator to a production CP engine on this corpus?”. - Fast solve is production.
FastSolverruns DFS plus AC-3 propagation through aPropagatorover aFastGame<'p>(bitmask domains, per-constraint aggregates, per-cell counters), with a depth-1 trial-1 sweep at every node. This is what the grader and generator call internally.
The cross-check edge between them is what keeps fastsolve
honest. Every simple_and_fast_agree_* test asserts the two
return the same solution set on a fixed puzzle; the corpus-wide
bench harness asserts the same across thousands of puzzles.
Fast solve → grade
StandardGrader fires named human-pattern Techniques in
priority order against the same Propagator engine FastSolver
uses. Techniques are rule-kind-driven (Propagation-phase ones
participate in the inner fixpoint loop; Pattern-phase ones run
when propagation stalls), so genres sharing rule shapes share
deductions. The output is a difficulty label assigned by
counting which technique tiers were needed to close the puzzle.
Difficulty → generate
The generator’s cfg.time_budget and cfg.clue_density knobs
control raw structural shape, but the grader’s difficulty
label is what makes the generator useful. A “produce a hard
6×6 sudoku” workflow is the generator producing candidates, the
grader rating each, and the harness re-rolling anything that
doesn’t land in the requested tier. Without the grader the
generator is just a uniqueness oracle.
Corpus + fast solve → discover
The discovery loop is the part of this diagram that doesn’t
yet exist as production code but is the point of the whole
exercise. The shape: run trial-k preprocessing across the corpus
at depth k > 1, capture each contradiction’s minimal-variable-
set proof, cluster the proofs by structural shape (modulo
symmetry), and surface the top-N candidate patterns.
The patterns that cluster cleanly are candidate techniques. Each one is a deduction the existing grader catalog doesn’t have a name for, surfaced from real puzzles where it would have mattered.
Discover → human → catalog
A human inspects the candidates. Patterns that read like
something a sudoku setter would call a technique (“locked
candidate,” “sashimi X-wing,” etc.) get encoded as Technique
impls and added to the grader’s catalog. Patterns that turn out
to be 12-variable monsters with no human-recognisable shape get
discarded — they don’t pollute the catalog.
The grader stays a catalog of hand-coded named techniques. The discovery loop is a helper for the human writing techniques, not a generator of them. That matters for three reasons:
- The grader API is unchanged; difficulty calibration stays anchored to human-recognisable techniques.
- Solver internals (aux-var state, proof shapes, search bookkeeping) never have to be human-readable.
- Auto-discovered candidates that aren’t interpretable get filtered at human review.
Catalog → generate (the long edge)
This is the edge that closes the cycle. A grader catalog with finer-grained tiers gives the generator finer-grained difficulty control: “produce a puzzle that needs technique X but not technique Y” becomes expressible. That produces a richer corpus that, on the next mining pass, surfaces a different set of candidate patterns — including some that the previous catalog was masking by always firing first.
Why this framing matters
Each operation in isolation is unremarkable: a CP solver, a DFS generator, a rule-fires grader, a difficulty tagger. The interesting part is that the four operations share the same propagation engine and so each improvement compounds:
- A faster
FastSolvermakes the generator produce more candidates per second, makes trial-k tractable for larger k, and makes the grader reach harder puzzles within the same time budget. - A bigger grader catalog makes the generator’s difficulty output sharper and makes the discovery loop’s clustering pass see fewer false-positive “new” patterns.
- A bigger corpus makes both the bench harness more meaningful and the discovery loop more likely to surface low-frequency patterns.
The cycle’s bottleneck moves around. When the existing catalog covers most of the easy patterns, the discovery loop becomes the rate-limiting step. When the discovery loop’s canonicalisation surfaces clean clusters, the human-curation step becomes the rate-limiting step. When the catalog is rich enough that the generator has fine-grained control, the next unlock is harder corpora that exercise the long tail of techniques.
Where we are in this picture
The blue / green / gold parts (generate, both solvers, grader) are all in tree and exercised by the bench matrix and the canonical corpus. The purple discovery loop is not yet built — every “Technique” in the grader catalog is hand- written from scratch, with the discovery side filled in by human intuition rather than corpus mining.
The aux-vars / sudoku-triads experiment on
claude/penmark-algorithmic-speedups-9tMvr was a pre-discovery
attempt to enrich the propagator’s primitive set so the
discovery loop, when built, would have a richer vocabulary to
cluster proofs in. The integration didn’t pay off on the
current sudoku corpus (~2× slowdown, identical DFS node count
across difficulty tiers — trial-1 already subsumes the patterns
triads catch). The full writeup is in
penmark/notes/aux-vars-postmortem.md;
the takeaway is that trial-1 is the floor on the current
corpus, so future discovery work should target either
genuinely-hard puzzles (where trial-1 doesn’t close the gap and
the search tree grows) or other genres entirely.