Keeping Birr AI current: the nightly document ingestion pipeline

A regulatory assistant goes stale quickly if its document base is manual. Birr AI needed to keep up with National Bank of Ethiopia documents without me babysitting uploads, re-running scripts, or fixing half-finished ingestion runs.

The production path is intentionally plain: a nightly GitHub Actions workflow runs one Django management command, sync_nbe_docs. That command harvests NBE seed pages, downloads and extracts pending PDFs one at a time, embeds extracted Markdown into pgvector, and summarizes embedded documents that do not have summaries yet.

There is no Celery in the MVP. No Redis. No separate ETL service. The pipeline is a synchronous command with durable handoff state stored in PostgreSQL.

That sounds less sophisticated than a worker architecture, but for this project it was the right kind of boring.

The state machine is the model

The key design is that ingestion state lives on the Document row:

extracted_markdown = NULL, embedded = false, summary = NULL
  -> discovered, not extracted

extracted_markdown = text, embedded = false, summary = NULL
  -> extracted, waiting for embedding

extracted_markdown = text, embedded = true, summary = NULL
  -> embedded, waiting for summary

extracted_markdown = text, embedded = true, summary = text
  -> fully processed

Each stage selects only rows in its input state. Harvest skips documents that already have Markdown. Embedding selects documents with Markdown and embedded=False. Summarization selects embedded documents with no summary.

Django’s ORM becomes the work queue. source_url is unique, so rediscovering the same official file is safe. get_or_create registers documents idempotently. Targeted update() calls move a row forward without rewriting unrelated fields.

This taught me that a separate job system is not always the first step. If the work is sequential, scheduled, and resumable, a database-backed state machine can be enough. The important part is not whether you have workers. The important part is whether a failed run can restart without corrupting state.

Memory and quota shaped the design

The harvester walks eleven configured NBE seed pages, including WordPress Search & Filter pagination via ?sf_paged=N. That pagination detail mattered because the initial HTML only showed a small slice of some categories. Without walking result pages, the corpus looked complete while silently missing many directives.

PDF extraction is deliberately one-file-at-a-time. The code downloads to a fixed temp path, extracts Markdown with pymupdf4llm, deletes the temp file in a finally block, and then moves on. That design came from running on constrained free infrastructure. Holding many PDFs in memory or building a parallel downloader would have been technically fun and operationally worse.

Embedding has another constraint: provider quota. If the Gemini embedding provider signals daily quota exhaustion, the embed stage stops cleanly. Already extracted Markdown remains in the database, and the next run resumes from the embed stage instead of re-downloading and re-extracting the same PDFs.

That is the tradeoff I like most in this pipeline. Persisting extracted Markdown costs database storage, but it buys resumability, debugging, summaries, and re-embedding. For this product, that was a good trade.

Why a management command

Putting the pipeline behind python manage.py sync_nbe_docs gave me one production entry point that also works locally. The same command can run in Docker, on a laptop, or in GitHub Actions. Flags like --skip-harvest and --skip-embed make partial runs possible without adding operator-only HTTP endpoints as the primary scheduler.

Django management commands are underrated for this kind of product work. They sit inside the app, use the same settings, models, logging, and provider clients, and remain easy to test. They also make deployment simpler: GitHub Actions does not need to know the internals of the ingestion process. It just runs the command.

The general lesson from Birr AI’s ingestion pipeline is that reliability often comes from small constraints. One command. One temp PDF. One durable row state. One unique source URL. One stage at a time. Those constraints made the system easier to reason about than a more “scalable” architecture would have been at this stage.