Reading Baudrillard, Twice

Two Anki add-ons that turn any PDF into a deck of atomic, self-grading flashcards. The pipeline runs entirely inside Anki, calls OpenAI directly via stdlib urllib, and leaves a verbatim source quote on the back of every card. Tested end-to-end on a 172-page philosophy book: 320 cards, ~16 minutes, ~$0.50 in API charges.

The Stack

┌──────────────┐
│   PDF file   │
└──────┬───────┘
       │ pdfminer.six (vendored, pure-Python)

┌─────────────────────────────────────────────────┐
│  extract  →  map  →  merge  →  reduce           │   book_to_cards
│  (pages)    (LLM    (cosine    (LLM, one        │   pipeline
│              chunks) clusters)  call per topic) │
└────────────────────────┬────────────────────────┘
                         │ cards.jsonl

              ┌─────────────────────┐
              │  precalibrate       │   parallel QThread workers
              │  + batch_embed      │   smart_grader.api
              └──────────┬──────────┘
                         │ calibrated.jsonl

              ┌─────────────────────┐
              │  insert into Anki   │   main thread, no API calls
              │  + apply fields     │
              └─────────────────────┘

Pipeline Stages

StageWhat it doesAPI calls per run (172-page book)
extractpdfminer → list[Page]0
map5-page sliding chunks → list[Aspect] with verbatim quotes40 chat
mergecosine clustering of topic strings~300 embeddings (cacheable)
reduceone chat call per canonical topic → 0–5 atomic cards314 chat
precalibrateper-card keywords + paraphrases + threshold (parallel)357 × (2 chat + 1 batched embed)
insertwrite notes + apply pre-computed fields0

Components

  1. smart_grader

    • Typed-answer textarea injected into any card with a Keywords field
    • Two-gate grading: strict AND keyword check → cosine similarity vs reference embedding
    • Per-card calibrated threshold stored on the note (_grader_threshold, _grader_ref_embedding, _grader_calibration)
    • JS handlers wired via mw.reviewer.web.eval in reviewer_did_show_question (modern Anki strips <script> tags from card HTML)
    • Exposes precompute_calibration / apply_calibration so book_to_cards can call it off the main thread
  2. book_to_cards — Extract (pdf_text.py)

    • Vendored pdfminer.six 20231228 (last Python-3.9 compatible release; Anki ships 3.9)
    • cryptography import patched optional so encrypted-PDF code stays out of the import path
    • Whitespace normalized; pages with <30 non-whitespace chars dropped
    • Output: extract.json keyed by book_hash = sha256(pdf_bytes)
  3. book_to_cards — Map (pipeline/map_chunks.py)

    • Sliding 5-page chunks with 1-page overlap
    • One LLM call per chunk returns {text, quote, topic, source_pages} per noteworthy fact
    • source_pages overridden to the chunk’s actual pages — the LLM proved unreliable at echoing absolute page markers (88% mismatch rate on real-book runs)
    • Per-chunk failures isolated to errors.jsonl; pipeline continues
  4. book_to_cards — Merge (pipeline/merge_tags.py)

    • Every distinct topic string embedded via text-embedding-3-small
    • Agglomerative single-link clustering via union-find, cosine distance threshold 0.25
    • Centroid-nearest tag becomes the canonical topic name
    • Pure CPU + embedding cache; no chat calls
  5. book_to_cards — Reduce (pipeline/reduce_cards.py)

    • One LLM call per canonical topic
    • Prompt embeds: topic + aliases + full text of every source page touched + verbatim quotes
    • Model returns 0–5 atomic cards with source_quote_indices resolved to actual quote strings server-side
    • Prompt requires self-contained questions (rejects “according to the text” / floater references) and allows empty card lists for thin material
  6. book_to_cards — Precalibrate (runner._do_precalibrate)

    • Runs in a QThread worker with concurrent.futures.ThreadPoolExecutor (default 8 workers, configurable via calibrate_concurrency)
    • Per card: keyword generation → paraphrase generation → batched embedding of [reference] + 12 good + 6 bad in a single HTTP call
    • Threshold rule: max(min(good_sims), max(bad_sims) + epsilon)
    • Lock-protected append_jsonl; _input_index tags allow order-independent resume
    • SQLite cache uses busy_timeout=10000 for parallel writes
  7. book_to_cards — Insert (deck_writer.py, runner._do_insert)

    • Runs on Anki’s main thread via mw.taskman.run_on_main
    • Each card becomes a Book Card note (Front / Back / Topic / Source / Keywords / threshold / ref_embedding / calibration_blob)
    • All API work was pre-computed; insert is field writes + duplicate filtering

Setup & Infrastructure

  • build.shpy_compile checks + zips both add-ons with files at the .ankiaddon zip root (Anki rejects nested layouts)
  • Per-run user_files dirextract.json, map.jsonl, topics.json, cards.jsonl, calibrated.jsonl, errors.jsonl, manifest.json — every stage is independently resumable
  • Atomic writestmp + os.replace for whole files; per-record flush() for .jsonl
  • Embedding cache — SQLite, keyed by (model, text) SHA-256; deduplicates across runs
  • Cost ceiling — pre-run dialog; runner aborts cleanly when crossed, run resumes on next launch
  • Tests — pytest, 36 passing, 1 skipped (the deck_writer integration test wants Anki’s bundled Python)

Technical Highlights

  • Off-main-thread calibration: 38 min serial (Anki frozen) → 5–9 min parallel (Anki responsive). Achieved by splitting _calibrate_one(note) into precompute_calibration (pure, threadsafe) and apply_calibration (Anki main thread, fast)
  • Batched embeddings: 19 individual HTTP requests per card collapsed into 1 by sending all reference + paraphrase texts in one /v1/embeddings call
  • Quote-grounded cards: every card carries the verbatim source quote on the back with a page reference. The model can be wrong because it shows its work
  • Resumable across all five stages: a crash during reduce on topic 200 of 314 resumes at topic 200 after relaunch — no re-extraction, no re-mapping, no recomputed embeddings
  • Two add-ons, optional coupling: book_to_cards detects smart_grader at runtime via importlib.util.find_spec("smart_grader") and silently no-ops the calibration hand-off if it isn’t installed
  • <script> stripping fix: modern Anki sanitises script tags from card HTML, so the typed-answer handler injection moved to mw.reviewer.web.eval() in reviewer_did_show_question — bypasses the sanitizer

Sample Output (Simulacra and Simulation)

Pages172
Chunks40
Topics after clustering314 (36 multi-aspect, 278 singletons)
Cards generated346
Cards in deck after Anki dedup320
Wall-clock~16 min
OpenAI cost~$0.50

Q: What does Baudrillard mean by ‘the desert of the real’?

A: Baudrillard suggests that it is the real, rather than the map or representation, that has vestiges remaining in contemporary society. The desert is characterized by the absence of a substantial reality — a shift from the Empire’s territory to our current state where the real is diminished and fragmented.

Source: p. 7 — “It is the real, and not the map, whose vestiges persist here and there in the deserts that are no longer those of the Empire, but ours.”

Q: According to Baudrillard, what role does Disneyland play in the context of reality and fiction?

A: Baudrillard describes Disneyland as a “deterrence machine” set up to rejuvenate the fiction of the real in the opposite camp — blurring the lines between reality and imagination.

Tech Stack

  • Python 3.9 (Anki’s bundled interpreter)
  • PyQt6 via aqt for dialogs and the QThread runner
  • pdfminer.six 20231228, vendored as pure-Python
  • OpenAI gpt-4o-mini for chat, text-embedding-3-small for embeddings
  • stdlib urllib + sqlite3 (no requests, no third-party HTTP)
  • pytest, 36 passing tests

github.com/Mkrolick/Anki-Plugin — both .ankiaddon files attached to the v0.1.0 release. Install via Tools → Add-ons → Install from file… and paste your OpenAI API key into each add-on’s Config.