Puzzle dataset gathered by Puzzlehound
The engine reads puzzles from per-genre canonical files —
penmark/data/puzzles/<genre>.jsonl.gz, one tagged JSON blob per
line in the format described in Tagged serialization and dispatch.
The per-line shape is whatever Puzzle::to_tagged_json() produces;
FastSolver, StandardGrader, the eframe dataset browser, the
benchmarks, and the test fixtures all consume this format and only
this format.
The per-genre file split is for query patterns and reasonable file
sizes, not for type discrimination — every line is self-describing,
so a stray Sudoku blob in the Akari file is rejected at read time
with a TagError::GenreMismatch, never silently decoded into the
wrong type.
Strict mode at the boundary: every field in a canonical record maps to a typed variant struct, or the record is rejected at write time. No opaque escape hatch — if it can’t be encoded, it isn’t stored. Downstream consumers never branch on “this field might not be there”.
Where the puzzles come from: puzzlehound
Penmark doesn’t fetch puzzles. puzzlehound does — a sibling
project at threeemojis/puzzlehound/ (Python, has its own README)
that discovers, fetches, and collates puzzles from registered
sources (logic-masters.de, Cracking the Cryptic spreadsheets,
gmpuzzles RSS, swaroopg92 Atom, tdoku benchmarks, …) into a form
penmark can consume directly.
penmark import reads puzzlehound’s collated output, parses each
record with Puzzle::from_tagged_json, validates puzzle.genre.name
against the target file’s genre, and appends as a tagged blob. From
there the dataset-reading verbs (solve, grade-canon,
profile-eval) take over. Adding a new upstream source is
puzzlehound’s problem; adding a new genre to canonical-ize into is
penmark’s — one new entry in the GENRES slice + one new
data/puzzles/<genre>.jsonl.gz file.