Reading Baudrillard, Twice

Two Anki add-ons that turn any PDF into a deck of atomic, self-grading flashcards. The pipeline runs entirely inside Anki, calls OpenAI directly via stdlib urllib, and leaves a verbatim source quote on the back of every card. Tested end-to-end on a 172-page philosophy book: 320 cards, ~16 minutes, ~$0.50 in API charges.

The Stack

┌──────────────┐
│   PDF file   │
└──────┬───────┘
       │ pdfminer.six (vendored, pure-Python)
       ▼
┌─────────────────────────────────────────────────┐
│  extract  →  map  →  merge  →  reduce           │   book_to_cards
│  (pages)    (LLM    (cosine    (LLM, one        │   pipeline
│              chunks) clusters)  call per topic) │
└────────────────────────┬────────────────────────┘
                         │ cards.jsonl
                         ▼
              ┌─────────────────────┐
              │  precalibrate       │   parallel QThread workers
              │  + batch_embed      │   smart_grader.api
              └──────────┬──────────┘
                         │ calibrated.jsonl
                         ▼
              ┌─────────────────────┐
              │  insert into Anki   │   main thread, no API calls
              │  + apply fields     │
              └─────────────────────┘

Pipeline Stages

Stage	What it does	API calls per run (172-page book)
extract	pdfminer → list[Page]	0
map	5-page sliding chunks → list[Aspect] with verbatim quotes	40 chat
merge	cosine clustering of topic strings	~300 embeddings (cacheable)
reduce	one chat call per canonical topic → 0–5 atomic cards	314 chat
precalibrate	per-card keywords + paraphrases + threshold (parallel)	357 × (2 chat + 1 batched embed)
insert	write notes + apply pre-computed fields	0

Components

smart_grader
- Typed-answer textarea injected into any card with a Keywords field
- Two-gate grading: strict AND keyword check → cosine similarity vs reference embedding
- Per-card calibrated threshold stored on the note (_grader_threshold, _grader_ref_embedding, _grader_calibration)
- JS handlers wired via mw.reviewer.web.eval in reviewer_did_show_question (modern Anki strips <script> tags from card HTML)
- Exposes precompute_calibration / apply_calibration so book_to_cards can call it off the main thread
book_to_cards — Extract (pdf_text.py)
- Vendored pdfminer.six 20231228 (last Python-3.9 compatible release; Anki ships 3.9)
- cryptography import patched optional so encrypted-PDF code stays out of the import path
- Whitespace normalized; pages with <30 non-whitespace chars dropped
- Output: extract.json keyed by book_hash = sha256(pdf_bytes)
book_to_cards — Map (pipeline/map_chunks.py)
- Sliding 5-page chunks with 1-page overlap
- One LLM call per chunk returns {text, quote, topic, source_pages} per noteworthy fact
- source_pages overridden to the chunk’s actual pages — the LLM proved unreliable at echoing absolute page markers (88% mismatch rate on real-book runs)
- Per-chunk failures isolated to errors.jsonl; pipeline continues
book_to_cards — Merge (pipeline/merge_tags.py)
- Every distinct topic string embedded via text-embedding-3-small
- Agglomerative single-link clustering via union-find, cosine distance threshold 0.25
- Centroid-nearest tag becomes the canonical topic name
- Pure CPU + embedding cache; no chat calls
book_to_cards — Reduce (pipeline/reduce_cards.py)
- One LLM call per canonical topic
- Prompt embeds: topic + aliases + full text of every source page touched + verbatim quotes
- Model returns 0–5 atomic cards with source_quote_indices resolved to actual quote strings server-side
- Prompt requires self-contained questions (rejects “according to the text” / floater references) and allows empty card lists for thin material
book_to_cards — Precalibrate (runner._do_precalibrate)
- Runs in a QThread worker with concurrent.futures.ThreadPoolExecutor (default 8 workers, configurable via calibrate_concurrency)
- Per card: keyword generation → paraphrase generation → batched embedding of [reference] + 12 good + 6 bad in a single HTTP call
- Threshold rule: max(min(good_sims), max(bad_sims) + epsilon)
- Lock-protected append_jsonl; _input_index tags allow order-independent resume
- SQLite cache uses busy_timeout=10000 for parallel writes
book_to_cards — Insert (deck_writer.py, runner._do_insert)
- Runs on Anki’s main thread via mw.taskman.run_on_main
- Each card becomes a Book Card note (Front / Back / Topic / Source / Keywords / threshold / ref_embedding / calibration_blob)
- All API work was pre-computed; insert is field writes + duplicate filtering

Setup & Infrastructure

build.sh — py_compile checks + zips both add-ons with files at the .ankiaddon zip root (Anki rejects nested layouts)
Per-run user_files dir — extract.json, map.jsonl, topics.json, cards.jsonl, calibrated.jsonl, errors.jsonl, manifest.json — every stage is independently resumable
Atomic writes — tmp + os.replace for whole files; per-record flush() for .jsonl
Embedding cache — SQLite, keyed by (model, text) SHA-256; deduplicates across runs
Cost ceiling — pre-run dialog; runner aborts cleanly when crossed, run resumes on next launch
Tests — pytest, 36 passing, 1 skipped (the deck_writer integration test wants Anki’s bundled Python)

Technical Highlights

Off-main-thread calibration: 38 min serial (Anki frozen) → 5–9 min parallel (Anki responsive). Achieved by splitting _calibrate_one(note) into precompute_calibration (pure, threadsafe) and apply_calibration (Anki main thread, fast)
Batched embeddings: 19 individual HTTP requests per card collapsed into 1 by sending all reference + paraphrase texts in one /v1/embeddings call
Quote-grounded cards: every card carries the verbatim source quote on the back with a page reference. The model can be wrong because it shows its work
Resumable across all five stages: a crash during reduce on topic 200 of 314 resumes at topic 200 after relaunch — no re-extraction, no re-mapping, no recomputed embeddings
Two add-ons, optional coupling: book_to_cards detects smart_grader at runtime via importlib.util.find_spec("smart_grader") and silently no-ops the calibration hand-off if it isn’t installed
<script> stripping fix: modern Anki sanitises script tags from card HTML, so the typed-answer handler injection moved to mw.reviewer.web.eval() in reviewer_did_show_question — bypasses the sanitizer

Sample Output (Simulacra and Simulation)


Pages	172
Chunks	40
Topics after clustering	314 (36 multi-aspect, 278 singletons)
Cards generated	346
Cards in deck after Anki dedup	320
Wall-clock	~16 min
OpenAI cost	~$0.50

Q: What does Baudrillard mean by ‘the desert of the real’?

A: Baudrillard suggests that it is the real, rather than the map or representation, that has vestiges remaining in contemporary society. The desert is characterized by the absence of a substantial reality — a shift from the Empire’s territory to our current state where the real is diminished and fragmented.

Source: p. 7 — “It is the real, and not the map, whose vestiges persist here and there in the deserts that are no longer those of the Empire, but ours.”

Q: According to Baudrillard, what role does Disneyland play in the context of reality and fiction?

A: Baudrillard describes Disneyland as a “deterrence machine” set up to rejuvenate the fiction of the real in the opposite camp — blurring the lines between reality and imagination.

Tech Stack

Python 3.9 (Anki’s bundled interpreter)
PyQt6 via aqt for dialogs and the QThread runner
pdfminer.six 20231228, vendored as pure-Python
OpenAI gpt-4o-mini for chat, text-embedding-3-small for embeddings
stdlib urllib + sqlite3 (no requests, no third-party HTTP)
pytest, 36 passing tests

Repo Link

github.com/Mkrolick/Anki-Plugin — both .ankiaddon files attached to the v0.1.0 release. Install via Tools → Add-ons → Install from file… and paste your OpenAI API key into each add-on’s Config.