Reading Baudrillard, Twice
Two Anki add-ons that turn any PDF into a deck of atomic, self-grading flashcards. The pipeline runs entirely inside Anki, calls OpenAI directly via stdlib urllib, and leaves a verbatim source quote on the back of every card. Tested end-to-end on a 172-page philosophy book: 320 cards, ~16 minutes, ~$0.50 in API charges.
The Stack
┌──────────────┐
│ PDF file │
└──────┬───────┘
│ pdfminer.six (vendored, pure-Python)
▼
┌─────────────────────────────────────────────────┐
│ extract → map → merge → reduce │ book_to_cards
│ (pages) (LLM (cosine (LLM, one │ pipeline
│ chunks) clusters) call per topic) │
└────────────────────────┬────────────────────────┘
│ cards.jsonl
▼
┌─────────────────────┐
│ precalibrate │ parallel QThread workers
│ + batch_embed │ smart_grader.api
└──────────┬──────────┘
│ calibrated.jsonl
▼
┌─────────────────────┐
│ insert into Anki │ main thread, no API calls
│ + apply fields │
└─────────────────────┘
Pipeline Stages
| Stage | What it does | API calls per run (172-page book) |
|---|---|---|
| extract | pdfminer → list[Page] | 0 |
| map | 5-page sliding chunks → list[Aspect] with verbatim quotes | 40 chat |
| merge | cosine clustering of topic strings | ~300 embeddings (cacheable) |
| reduce | one chat call per canonical topic → 0–5 atomic cards | 314 chat |
| precalibrate | per-card keywords + paraphrases + threshold (parallel) | 357 × (2 chat + 1 batched embed) |
| insert | write notes + apply pre-computed fields | 0 |
Components
-
smart_grader
- Typed-answer textarea injected into any card with a
Keywordsfield - Two-gate grading: strict AND keyword check → cosine similarity vs reference embedding
- Per-card calibrated threshold stored on the note (
_grader_threshold,_grader_ref_embedding,_grader_calibration) - JS handlers wired via
mw.reviewer.web.evalinreviewer_did_show_question(modern Anki strips<script>tags from card HTML) - Exposes
precompute_calibration/apply_calibrationso book_to_cards can call it off the main thread
- Typed-answer textarea injected into any card with a
-
book_to_cards — Extract (
pdf_text.py)- Vendored
pdfminer.six 20231228(last Python-3.9 compatible release; Anki ships 3.9) cryptographyimport patched optional so encrypted-PDF code stays out of the import path- Whitespace normalized; pages with <30 non-whitespace chars dropped
- Output:
extract.jsonkeyed bybook_hash = sha256(pdf_bytes)
- Vendored
-
book_to_cards — Map (
pipeline/map_chunks.py)- Sliding 5-page chunks with 1-page overlap
- One LLM call per chunk returns
{text, quote, topic, source_pages}per noteworthy fact source_pagesoverridden to the chunk’s actual pages — the LLM proved unreliable at echoing absolute page markers (88% mismatch rate on real-book runs)- Per-chunk failures isolated to
errors.jsonl; pipeline continues
-
book_to_cards — Merge (
pipeline/merge_tags.py)- Every distinct topic string embedded via
text-embedding-3-small - Agglomerative single-link clustering via union-find, cosine distance threshold 0.25
- Centroid-nearest tag becomes the canonical topic name
- Pure CPU + embedding cache; no chat calls
- Every distinct topic string embedded via
-
book_to_cards — Reduce (
pipeline/reduce_cards.py)- One LLM call per canonical topic
- Prompt embeds: topic + aliases + full text of every source page touched + verbatim quotes
- Model returns 0–5 atomic cards with
source_quote_indicesresolved to actual quote strings server-side - Prompt requires self-contained questions (rejects “according to the text” / floater references) and allows empty card lists for thin material
-
book_to_cards — Precalibrate (
runner._do_precalibrate)- Runs in a
QThreadworker withconcurrent.futures.ThreadPoolExecutor(default 8 workers, configurable viacalibrate_concurrency) - Per card: keyword generation → paraphrase generation → batched embedding of
[reference] + 12 good + 6 badin a single HTTP call - Threshold rule:
max(min(good_sims), max(bad_sims) + epsilon) - Lock-protected
append_jsonl;_input_indextags allow order-independent resume - SQLite cache uses
busy_timeout=10000for parallel writes
- Runs in a
-
book_to_cards — Insert (
deck_writer.py,runner._do_insert)- Runs on Anki’s main thread via
mw.taskman.run_on_main - Each card becomes a
Book Cardnote (Front / Back / Topic / Source / Keywords / threshold / ref_embedding / calibration_blob) - All API work was pre-computed; insert is field writes + duplicate filtering
- Runs on Anki’s main thread via
Setup & Infrastructure
build.sh—py_compilechecks + zips both add-ons with files at the .ankiaddon zip root (Anki rejects nested layouts)- Per-run user_files dir —
extract.json,map.jsonl,topics.json,cards.jsonl,calibrated.jsonl,errors.jsonl,manifest.json— every stage is independently resumable - Atomic writes —
tmp + os.replacefor whole files; per-recordflush()for.jsonl - Embedding cache — SQLite, keyed by
(model, text)SHA-256; deduplicates across runs - Cost ceiling — pre-run dialog; runner aborts cleanly when crossed, run resumes on next launch
- Tests — pytest, 36 passing, 1 skipped (the deck_writer integration test wants Anki’s bundled Python)
Technical Highlights
- Off-main-thread calibration: 38 min serial (Anki frozen) → 5–9 min parallel (Anki responsive). Achieved by splitting
_calibrate_one(note)intoprecompute_calibration(pure, threadsafe) andapply_calibration(Anki main thread, fast) - Batched embeddings: 19 individual HTTP requests per card collapsed into 1 by sending all reference + paraphrase texts in one
/v1/embeddingscall - Quote-grounded cards: every card carries the verbatim source quote on the back with a page reference. The model can be wrong because it shows its work
- Resumable across all five stages: a crash during reduce on topic 200 of 314 resumes at topic 200 after relaunch — no re-extraction, no re-mapping, no recomputed embeddings
- Two add-ons, optional coupling: book_to_cards detects smart_grader at runtime via
importlib.util.find_spec("smart_grader")and silently no-ops the calibration hand-off if it isn’t installed <script>stripping fix: modern Anki sanitises script tags from card HTML, so the typed-answer handler injection moved tomw.reviewer.web.eval()inreviewer_did_show_question— bypasses the sanitizer
Sample Output (Simulacra and Simulation)
| Pages | 172 |
| Chunks | 40 |
| Topics after clustering | 314 (36 multi-aspect, 278 singletons) |
| Cards generated | 346 |
| Cards in deck after Anki dedup | 320 |
| Wall-clock | ~16 min |
| OpenAI cost | ~$0.50 |
Q: What does Baudrillard mean by ‘the desert of the real’?
A: Baudrillard suggests that it is the real, rather than the map or representation, that has vestiges remaining in contemporary society. The desert is characterized by the absence of a substantial reality — a shift from the Empire’s territory to our current state where the real is diminished and fragmented.
Source: p. 7 — “It is the real, and not the map, whose vestiges persist here and there in the deserts that are no longer those of the Empire, but ours.”
Q: According to Baudrillard, what role does Disneyland play in the context of reality and fiction?
A: Baudrillard describes Disneyland as a “deterrence machine” set up to rejuvenate the fiction of the real in the opposite camp — blurring the lines between reality and imagination.
Tech Stack
- Python 3.9 (Anki’s bundled interpreter)
- PyQt6 via
aqtfor dialogs and the QThread runner - pdfminer.six 20231228, vendored as pure-Python
- OpenAI
gpt-4o-minifor chat,text-embedding-3-smallfor embeddings - stdlib
urllib+sqlite3(norequests, no third-party HTTP) - pytest, 36 passing tests
Repo Link
github.com/Mkrolick/Anki-Plugin — both .ankiaddon files attached to the v0.1.0 release. Install via Tools → Add-ons → Install from file… and paste your OpenAI API key into each add-on’s Config.