Puzzle dataset gathered by Puzzlehound

The engine reads puzzles from per-genre canonical files — penmark/data/puzzles/<genre>.jsonl.gz, one tagged JSON blob per line in the format described in Tagged serialization and dispatch. The per-line shape is whatever Puzzle::to_tagged_json() produces; FastSolver, StandardGrader, the eframe dataset browser, the benchmarks, and the test fixtures all consume this format and only this format.

The per-genre file split is for query patterns and reasonable file sizes, not for type discrimination — every line is self-describing, so a stray Sudoku blob in the Akari file is rejected at read time with a TagError::GenreMismatch, never silently decoded into the wrong type.

Strict mode at the boundary: every field in a canonical record maps to a typed variant struct, or the record is rejected at write time. No opaque escape hatch — if it can’t be encoded, it isn’t stored. Downstream consumers never branch on “this field might not be there”.

Where the puzzles come from: puzzlehound

Penmark doesn’t fetch puzzles. puzzlehound does — a sibling project at threeemojis/puzzlehound/ (Python, has its own README) that discovers, fetches, and collates puzzles from registered sources (logic-masters.de, Cracking the Cryptic spreadsheets, gmpuzzles RSS, swaroopg92 Atom, tdoku benchmarks, …) into a form penmark can consume directly.

penmark import reads puzzlehound’s collated output, parses each record with Puzzle::from_tagged_json, validates puzzle.genre.name against the target file’s genre, and appends as a tagged blob. From there the dataset-reading verbs (solve, grade-canon, profile-eval) take over. Adding a new upstream source is puzzlehound’s problem; adding a new genre to canonical-ize into is penmark’s — one new entry in the GENRES slice + one new data/puzzles/<genre>.jsonl.gz file.

Keyboard shortcuts

Penmark

Puzzle dataset gathered by Puzzlehound

Where the puzzles come from: puzzlehound