Fast solver

A report on the architecture of the FastSolver (penmark-core), the optimizations that landed, the ones we tried and reverted, and ideas we discussed but haven’t pursued.

The engine lives under core/src/algorithm/ in two families:

algorithm/fast/ — the production family.
- game.rs — FastGame, the bitmask-backed CSP state (per-coord candidate bitmasks + apply/undo + slow status()).
- propagator.rs — Propagator, the speed: every incremental cache (per-constraint aggregates, per-cell counters, dirty queue, region indices, trial order) plus the forced_moves_at / eval_constraint_at machinery.
- solver.rs — FastSolver, the search loop. Drives a Propagator wrapping a FastGame.
algorithm/simple/ — the brute-force reference family.
- game.rs — SimpleGame, a HashMap<Coord, Vec<Mark>>-backed minimal Game impl. No bitmask, no caches.
- solver.rs — SimpleSolver, plain DFS over SimpleGame. The correctness baseline — every other solver should beat it badly.
algorithm/geometry.rs — region/reach helpers shared between the two families.
algorithm/eval.rs — generic constraint evaluator with both a fast eval_constraint (assumes pre-resolved regions) and a slow eval_constraint_full (handles LosCross/Neighbors4/CellMin by walking the puzzle on the fly).
algorithm/grader.rs — StandardGrader, the human-techniques grader. Drives a Propagator (same engine as FastSolver) and fires named techniques in priority order. Techniques are generic, keyed off Rule kind + cached aggregate state: exact-count-zero, exact-count-saturated, exact-count-forced, at-most-saturated, at-least-one-witness, cell-min-witness, plus trial-1 as backward escalation. Any genre using these rule kinds gets the techniques for free; genres needing bespoke patterns can append their own Technique impls.

1. What the solver does

At every DFS node:

Forward propagation to fixpoint. AC-3-style worklist over constraints, plus a parallel dirty-cell drain for CellMin-style counter rules. Continues until both queues are empty.
Trial-1 sweep with intra-sweep cascade. Walk open coords in degree-descending order (game.trial_order()). At each coord, hypothesize every candidate; any commit that propagates to a contradiction is recorded as a forced elimination. Eliminations are applied immediately and followed by an inline forward- propagation drain, before moving to the next coord.
Branch. If still InProgress, pick the first trial_order-ranked open coord and branch over its candidates. For binary domains (Akari) “smallest domain” MRV is degenerate (every open cell has 2 candidates) — degree is the heuristic that actually shrinks the search tree.

The undo log inside FastGame records every applied move; on exit from each DFS frame we pop exactly the count of moves applied at this level.

The same three-phase shape lives in the grader, except:

the grader runs in an outer loop instead of recursing;
the grader keeps a step trace tagged by RuleKind;
the grader never branches (trial-1 is its terminal technique).

The shared propagation routines are intentionally close to literal copies — keeping them divergent for tiny gains is more cost than it’s worth.

2. The FastGame / Propagator split

FastGame is intentionally minimal: per-coord candidate bitmasks, the coord_index / mark_to_bit lookups, the undo_stack, and an empty_count counter that lets status() short-circuit O(1) when no domain is empty. Apply and undo do the bit-flip + push/pop the undo log + maintain empty_count. That’s it. status() walks all constraints with a slow per-region resolver — fine for tests and the brute-force Simple solver, slow on the hot path.

Propagator<'g, 'p> borrows &'g mut FastGame<'p> and owns every incremental cache. Its constructor builds coord_constraints, region_indices, trial_order, the per-counter inverse-reach maps, the lifted CellMin rule table, the per-constraint aggregates, and the per-cell counters — all from the game’s current candidate masks. After that, the propagator is the sole driver of apply / undo: each call flips the game’s bits, then walks the incremental updates against the (pre, post) mask delta.

The propagator keeps a parallel transition_log so undo can reverse cache deltas without inspecting the game’s undo stack — two stacks, kept in sync by the rule “only the propagator drives moves during a solve.” Anyone holding a &mut FastGame (player UI, parser tests, future slow grader) can still apply moves; they just don’t get the cache.

3. State that makes propagation cheap

The hot loops touch these structures:

Bitmask candidates. Vec<u32> per-coord, one bit per mark in the puzzle’s universe (cap 32 marks, see MAX_MARKS). Apply / undo become mask & !bit / mask | bit. Iteration is a BitIter on the mask. Replaces the original Vec<Vec<Mark>>.
Static region indices. Each constraint’s region is resolved once at construction to a Vec<usize> of dense coord indices. Removes the per-eval coord → idx HashMap lookup.
First-class LOS regions. Puzzle::los_blocks lets the generic region walker handle Los, LosCross, and Neighbors4 natively — the genre doesn’t reimplement ray walking.
Per-constraint aggregates. Each constraint caches committed_with_mark and possible_with_mark for its primary mark. apply / undo keep these in sync; eval_constraint_at reads them in O(1) instead of rescanning the region. Rules without a primary mark (Decided, Distinct, NoTwoByTwo) carry a zero target mask — the update is a no-op and the generic scan handles them.
Per-cell counters with a dirty-cell queue. CellMin (the Akari “lit at least once” rule) is lifted out of the AC-3 worklist and onto a dedicated coord-indexed counter with a dirty_cells Vec. A 100×100 Akari has thousands of CellMin rules; routing them through AC-3 would put thousands of constraints in the worklist. Routing through the counter+dirty- cell queue costs a single push per affected coord.
trial_order. Coords sorted by descending degree (regular constraints + CellMin rules touching them). Drives the trial-1 sweep order and DFS branch ordering — the branch coord is the first one in trial_order with two or more candidates.
FxHashMap for coord_index and mark_to_bit. Cheap hashing replacing the std SipHash default; not visible in any one profile but compounds across construction.
Undo log of bit-level deltas. AddBack { idx, bit } and Restore { idx, mask } entries. apply pushes; undo pops.
Reusable Vec<Move> buffer. forced_moves_at writes into the caller’s buffer instead of returning a fresh allocation per call. Same trick for cell_min_forced_moves_at.

Every one of those is the result of a profiling round (mostly samply call-tree inspection, plus a couple of pprof FlameGraph runs) showing one specific hot frame and chasing it.

4. What worked, in order

Headline numbers below are wall-clock medians from the solve_one example, plus cargo bench summaries. Where two numbers appear, the left is “before the change” and the right is “after.”

Compile profile.

lto = "fat" and codegen-units = 1 in the workspace release profile. Modest but free win.

Profiling instrumentation.

Optional --pprof flag on the CLI for FlameGraph output.
Switched to samply for higher-fidelity call-tree timing on macOS.
Throttled Instant::now() sampling inside the propagation loop to 1-in-1000 calls; the original per-iter timestamping was measurably visible in the profile.

Data-structure swaps.

HashSet<usize> → Vec<bool> for the AC-3 in_worklist and the dirty-cell in_dirty flag. Hashing was a profile hot spot.
std::collections::HashMap → FxHashMap (rustc-hash) for small-key tables.
Region indices materialized once at construction (Tier 1).
LOS / LosCross / Neighbors4 handled by the generic walker via los_blocks (Tier 2).
Per-cell counters with a dirty-cell queue (Tier 3) — replaces a per-coord linear scan that the original AtLeastOne path did.

Bitmask candidates. Per-coord u32 instead of Vec<Mark>.

This was the biggest single mid-stage win on dense puzzles: the inner candidate loops collapse to a popcount + bit test, and apply/undo become two ops.

status() aggregate-driven evaluation. The “is the puzzle solved / contradicted / in progress?” check was rescanning every constraint. Reading the cached aggregates first short-circuits the common cases. (The variant that also maintained per-constraint “unsatisfied” / “starved” flags was tried and reverted — see §4.)

MRV trial ordering. Sorting trial-1’s coord walk by descending degree (trial_order) was the largest single win on the grader. On 100×100: ~31 s → ~2 s once trial-1 was MRV-ordered. The intuition is that high-degree cells are the ones whose hypothesis collapses the search space fastest; they surface forced eliminations sooner, so trial-1 returns sooner with progress.

Intra-sweep propagation in trial-1. When trial-1 finds an elimination, drain forward propagation in place before moving to the next coord, instead of returning to the outer loop. This avoids the round-trip “trial-1 returns → outer re-runs forward → trial-1 restarts at coord 0.” On 100×100: ~2 s → ~280 ms after this single change. The cost of the inline drain is dwarfed by the work it removes.

MRV branching + trial-1 inside the solver. Lifting the trial-1 sweep and trial_order-driven branch pick out of the grader and into the solver was the last big win:

size	solver before	solver now	grader
50×50	~2 s	3.7 ms	4.5 ms
100×100	(didn’t terminate cleanly)	43 ms	64 ms
17×17 corpus (criterion)	—	~30% faster on `naive`	—

The solver now beats the grader on big puzzles, which makes sense: solver stops at the first solution, grader records every step.

5. What we tried and reverted

status() with cached cell_min_unsatisfied / starved counters. Idea: maintain bookkeeping so status() is O(1) by counting how many CellMin rules are currently unsatisfied or starved. Result: helped 100×100 by ~12% but regressed 50×50 by ~10% and the dataset by ~11%. The per-apply maintenance cost swallowed the per-status() win on workloads where status() isn’t actually the bottleneck. Reverted. The cached aggregate fast-path remained; only the extra counters were dropped.

Batch trial-1 eliminations across coords without intervening propagation. Idea: skip the inline propagation drain in §3 and instead apply every elimination found on the full sweep at once. Result: 5–50× slowdown. The reason is subtle: forward propagation between trials was doing a lot of cheap deduction. If you batch, those cheap deductions surface as additional trial-1 hypotheses on the next sweep — paying for them at trial-1 prices instead of forward-propagation prices. Reverted.

Per-floor AtLeastOne constraints. Original encoding for Akari’s “every floor must be lit” rule. Each floor became its own AtLeastOne constraint, which means a 100×100 puzzle pushed thousands of constraint visits through AC-3 per cycle. Replaced by a single CellMin rule routed through the dirty-cell queue. Several construct.rs constraint-count tests broke and were updated to the new counts.

Janko parser without _ for holes. Bug discovered during a round-trip test: to_janko writes _ but parse only accepted - and .. Fixed by adding _ to the hole branch.

6. Things we discussed but haven’t done

Generic clue-cover and lit-witness propagators for Akari. Open question: is there a clean way to express these as genre-supplied propagation rules instead of baking them into the generic constraint walker? Today the only Akari-specific propagation is the CellMin rule; clue-cover (a clue with n unlit neighbors that all need bulbs is fully determined) and lit-witness (the only remaining floor that can light a forced cell must be a bulb) are subsumed by trial-1 today, but a direct propagator would cut the trial-1 cost.

MAC / GAC / AC-2001. Discussed as standard upgrades to the AC-3 worklist. The user leaned toward “make the naive solver as fast as possible” instead — the trial-1 + MRV + intra-sweep work that followed delivered most of the wins these would have. AC-2001 in particular is the natural next step if propagation throughput becomes the bottleneck again.

Trial-2 / trial-N. Higher-order trial search (commit two hypotheses, look for contradictions). Cost goes up combinatorially and most useful on harder puzzles where trial-1 stalls. Not yet needed — trial-1 + branching solves every dataset puzzle in the budget.

Smarter branch heuristics. Today the branch coord is the first trial_order entry with ≥ 2 candidates. Possible refinements: domain-size tiebreak (degenerate for binary, useful for richer domains), least-constraining-value ordering on the candidate side, random restarts. None implemented.

Refactor the grader to be pure human techniques. The optimization machinery now lives in Propagator, separated from FastGame — that work is done. The remaining piece is teaching the grader to drive a different engine: instead of running AC-3 to fixpoint, it should fire one human technique at a time, recording each step under its proper name. That refactor is the next planned change.

Parallelism. rayon is a feature flag and is wired into a couple of construction paths but not the solve loop itself. Per- puzzle parallelism (across the corpus) is the obvious win; per- solve parallelism (multiple branches concurrently) is a much bigger lift and not on the table.

WASM and the editor surface. The editor (ordo/edit) is supposed to consume the same core through WASM. The solver compiles to WASM today but no measurements have been taken — we don’t know where the slow path is in the browser yet.

Keyboard shortcuts