Fast solver
A report on the architecture of the FastSolver (penmark-core),
the optimizations that landed, the ones we tried and reverted, and
ideas we discussed but haven’t pursued.
The engine lives under core/src/algorithm/ in two families:
algorithm/fast/— the production family.game.rs—FastGame, the bitmask-backed CSP state (per-coord candidate bitmasks + apply/undo + slowstatus()).propagator.rs—Propagator, the speed: every incremental cache (per-constraint aggregates, per-cell counters, dirty queue, region indices, trial order) plus theforced_moves_at/eval_constraint_atmachinery.solver.rs—FastSolver, the search loop. Drives aPropagatorwrapping aFastGame.
algorithm/simple/— the brute-force reference family.game.rs—SimpleGame, aHashMap<Coord, Vec<Mark>>-backed minimalGameimpl. No bitmask, no caches.solver.rs—SimpleSolver, plain DFS overSimpleGame. The correctness baseline — every other solver should beat it badly.
algorithm/geometry.rs— region/reach helpers shared between the two families.algorithm/eval.rs— generic constraint evaluator with both a fasteval_constraint(assumes pre-resolved regions) and a sloweval_constraint_full(handlesLosCross/Neighbors4/CellMinby walking the puzzle on the fly).algorithm/grader.rs—StandardGrader, the human-techniques grader. Drives aPropagator(same engine asFastSolver) and fires named techniques in priority order. Techniques are generic, keyed offRulekind + cached aggregate state:exact-count-zero,exact-count-saturated,exact-count-forced,at-most-saturated,at-least-one-witness,cell-min-witness, plustrial-1as backward escalation. Any genre using these rule kinds gets the techniques for free; genres needing bespoke patterns can append their ownTechniqueimpls.
1. What the solver does
At every DFS node:
- Forward propagation to fixpoint. AC-3-style worklist over
constraints, plus a parallel dirty-cell drain for
CellMin-style counter rules. Continues until both queues are empty. - Trial-1 sweep with intra-sweep cascade. Walk open coords in
degree-descending order (
game.trial_order()). At each coord, hypothesize every candidate; any commit that propagates to a contradiction is recorded as a forced elimination. Eliminations are applied immediately and followed by an inline forward- propagation drain, before moving to the next coord. - Branch. If still
InProgress, pick the firsttrial_order-ranked open coord and branch over its candidates. For binary domains (Akari) “smallest domain” MRV is degenerate (every open cell has 2 candidates) — degree is the heuristic that actually shrinks the search tree.
The undo log inside FastGame records every applied move; on exit
from each DFS frame we pop exactly the count of moves applied at
this level.
The same three-phase shape lives in the grader, except:
- the grader runs in an outer loop instead of recursing;
- the grader keeps a step trace tagged by
RuleKind; - the grader never branches (trial-1 is its terminal technique).
The shared propagation routines are intentionally close to literal copies — keeping them divergent for tiny gains is more cost than it’s worth.
2. The FastGame / Propagator split
FastGame is intentionally minimal: per-coord candidate bitmasks,
the coord_index / mark_to_bit lookups, the undo_stack, and an
empty_count counter that lets status() short-circuit O(1) when
no domain is empty. Apply and undo do the bit-flip + push/pop the
undo log + maintain empty_count. That’s it. status() walks all
constraints with a slow per-region resolver — fine for tests and
the brute-force Simple solver, slow on the hot path.
Propagator<'g, 'p> borrows &'g mut FastGame<'p> and owns
every incremental cache. Its constructor builds coord_constraints,
region_indices, trial_order, the per-counter inverse-reach
maps, the lifted CellMin rule table, the per-constraint
aggregates, and the per-cell counters — all from the game’s current
candidate masks. After that, the propagator is the sole driver of
apply / undo: each call flips the game’s bits, then walks the
incremental updates against the (pre, post) mask delta.
The propagator keeps a parallel transition_log so undo can
reverse cache deltas without inspecting the game’s undo stack —
two stacks, kept in sync by the rule “only the propagator drives
moves during a solve.” Anyone holding a &mut FastGame (player
UI, parser tests, future slow grader) can still apply moves; they
just don’t get the cache.
3. State that makes propagation cheap
The hot loops touch these structures:
- Bitmask candidates.
Vec<u32>per-coord, one bit per mark in the puzzle’s universe (cap 32 marks, seeMAX_MARKS). Apply / undo becomemask & !bit/mask | bit. Iteration is aBitIteron the mask. Replaces the originalVec<Vec<Mark>>. - Static region indices. Each constraint’s region is resolved
once at construction to a
Vec<usize>of dense coord indices. Removes the per-evalcoord → idxHashMap lookup. - First-class LOS regions.
Puzzle::los_blockslets the generic region walker handleLos,LosCross, andNeighbors4natively — the genre doesn’t reimplement ray walking. - Per-constraint aggregates. Each constraint caches
committed_with_markandpossible_with_markfor its primary mark.apply/undokeep these in sync;eval_constraint_atreads them in O(1) instead of rescanning the region. Rules without a primary mark (Decided,Distinct,NoTwoByTwo) carry a zero target mask — the update is a no-op and the generic scan handles them. - Per-cell counters with a dirty-cell queue.
CellMin(the Akari “lit at least once” rule) is lifted out of the AC-3 worklist and onto a dedicated coord-indexed counter with adirty_cellsVec. A 100×100 Akari has thousands ofCellMinrules; routing them through AC-3 would put thousands of constraints in the worklist. Routing through the counter+dirty- cell queue costs a single push per affected coord. trial_order. Coords sorted by descending degree (regular constraints + CellMin rules touching them). Drives the trial-1 sweep order and DFS branch ordering — the branch coord is the first one intrial_orderwith two or more candidates.FxHashMapforcoord_indexandmark_to_bit. Cheap hashing replacing the std SipHash default; not visible in any one profile but compounds across construction.- Undo log of bit-level deltas.
AddBack { idx, bit }andRestore { idx, mask }entries.applypushes;undopops. - Reusable
Vec<Move>buffer.forced_moves_atwrites into the caller’s buffer instead of returning a fresh allocation per call. Same trick forcell_min_forced_moves_at.
Every one of those is the result of a profiling round (mostly samply call-tree inspection, plus a couple of pprof FlameGraph runs) showing one specific hot frame and chasing it.
4. What worked, in order
Headline numbers below are wall-clock medians from the solve_one
example, plus cargo bench summaries. Where two numbers appear,
the left is “before the change” and the right is “after.”
Compile profile.
lto = "fat"andcodegen-units = 1in the workspace release profile. Modest but free win.
Profiling instrumentation.
- Optional
--pprofflag on the CLI for FlameGraph output. - Switched to samply for higher-fidelity call-tree timing on macOS.
- Throttled
Instant::now()sampling inside the propagation loop to 1-in-1000 calls; the original per-iter timestamping was measurably visible in the profile.
Data-structure swaps.
HashSet<usize>→Vec<bool>for the AC-3in_worklistand the dirty-cellin_dirtyflag. Hashing was a profile hot spot.std::collections::HashMap→FxHashMap(rustc-hash) for small-key tables.- Region indices materialized once at construction (Tier 1).
- LOS / LosCross / Neighbors4 handled by the generic walker via
los_blocks(Tier 2). - Per-cell counters with a dirty-cell queue (Tier 3) — replaces a per-coord linear scan that the original AtLeastOne path did.
Bitmask candidates. Per-coord u32 instead of Vec<Mark>.
- This was the biggest single mid-stage win on dense puzzles: the inner candidate loops collapse to a popcount + bit test, and apply/undo become two ops.
status() aggregate-driven evaluation. The “is the puzzle
solved / contradicted / in progress?” check was rescanning every
constraint. Reading the cached aggregates first short-circuits the
common cases. (The variant that also maintained per-constraint
“unsatisfied” / “starved” flags was tried and reverted — see §4.)
MRV trial ordering. Sorting trial-1’s coord walk by descending
degree (trial_order) was the largest single win on the grader.
On 100×100: ~31 s → ~2 s once trial-1 was MRV-ordered. The
intuition is that high-degree cells are the ones whose hypothesis
collapses the search space fastest; they surface forced
eliminations sooner, so trial-1 returns sooner with progress.
Intra-sweep propagation in trial-1. When trial-1 finds an elimination, drain forward propagation in place before moving to the next coord, instead of returning to the outer loop. This avoids the round-trip “trial-1 returns → outer re-runs forward → trial-1 restarts at coord 0.” On 100×100: ~2 s → ~280 ms after this single change. The cost of the inline drain is dwarfed by the work it removes.
MRV branching + trial-1 inside the solver. Lifting the
trial-1 sweep and trial_order-driven branch pick out of the
grader and into the solver was the last big win:
| size | solver before | solver now | grader |
|---|---|---|---|
| 50×50 | ~2 s | 3.7 ms | 4.5 ms |
| 100×100 | (didn’t terminate cleanly) | 43 ms | 64 ms |
| 17×17 corpus (criterion) | — | ~30% faster on naive | — |
The solver now beats the grader on big puzzles, which makes sense: solver stops at the first solution, grader records every step.
5. What we tried and reverted
status() with cached cell_min_unsatisfied / starved
counters. Idea: maintain bookkeeping so status() is O(1) by
counting how many CellMin rules are currently unsatisfied or
starved. Result: helped 100×100 by ~12% but regressed 50×50 by
~10% and the dataset by ~11%. The per-apply maintenance cost
swallowed the per-status() win on workloads where status() isn’t
actually the bottleneck. Reverted. The cached aggregate
fast-path remained; only the extra counters were dropped.
Batch trial-1 eliminations across coords without intervening propagation. Idea: skip the inline propagation drain in §3 and instead apply every elimination found on the full sweep at once. Result: 5–50× slowdown. The reason is subtle: forward propagation between trials was doing a lot of cheap deduction. If you batch, those cheap deductions surface as additional trial-1 hypotheses on the next sweep — paying for them at trial-1 prices instead of forward-propagation prices. Reverted.
Per-floor AtLeastOne constraints. Original encoding for
Akari’s “every floor must be lit” rule. Each floor became its own
AtLeastOne constraint, which means a 100×100 puzzle pushed
thousands of constraint visits through AC-3 per cycle. Replaced by
a single CellMin rule routed through the dirty-cell queue.
Several construct.rs constraint-count tests broke and were updated
to the new counts.
Janko parser without _ for holes. Bug discovered during a
round-trip test: to_janko writes _ but parse only accepted
- and .. Fixed by adding _ to the hole branch.
6. Things we discussed but haven’t done
Generic clue-cover and lit-witness propagators for Akari.
Open question: is there a clean way to express these as
genre-supplied propagation rules instead of baking them into the
generic constraint walker? Today the only Akari-specific
propagation is the CellMin rule; clue-cover (a clue with n
unlit neighbors that all need bulbs is fully determined) and
lit-witness (the only remaining floor that can light a forced cell
must be a bulb) are subsumed by trial-1 today, but a direct
propagator would cut the trial-1 cost.
MAC / GAC / AC-2001. Discussed as standard upgrades to the AC-3 worklist. The user leaned toward “make the naive solver as fast as possible” instead — the trial-1 + MRV + intra-sweep work that followed delivered most of the wins these would have. AC-2001 in particular is the natural next step if propagation throughput becomes the bottleneck again.
Trial-2 / trial-N. Higher-order trial search (commit two hypotheses, look for contradictions). Cost goes up combinatorially and most useful on harder puzzles where trial-1 stalls. Not yet needed — trial-1 + branching solves every dataset puzzle in the budget.
Smarter branch heuristics. Today the branch coord is the first
trial_order entry with ≥ 2 candidates. Possible refinements:
domain-size tiebreak (degenerate for binary, useful for richer
domains), least-constraining-value ordering on the candidate side,
random restarts. None implemented.
Refactor the grader to be pure human techniques. The
optimization machinery now lives in Propagator, separated from
FastGame — that work is done. The remaining piece is teaching
the grader to drive a different engine: instead of running AC-3
to fixpoint, it should fire one human technique at a time,
recording each step under its proper name. That refactor is the
next planned change.
Parallelism. rayon is a feature flag and is wired into a
couple of construction paths but not the solve loop itself. Per-
puzzle parallelism (across the corpus) is the obvious win; per-
solve parallelism (multiple branches concurrently) is a much
bigger lift and not on the table.
WASM and the editor surface. The editor (ordo/edit) is
supposed to consume the same core through WASM. The solver compiles
to WASM today but no measurements have been taken — we don’t know
where the slow path is in the browser yet.