EvalBoard: treating prompts like tests

EvalBoard came from a pattern I kept seeing in AI projects: a prompt works once, everyone gets excited, and then it fails on the next realistic input. I wanted a small tool that made prompts feel more like tests. Give it examples, run a model, score the outputs, and keep the results so changes can be compared.

The domain model is intentionally simple. A Dataset has many DatasetItem rows, each with an input and ideal_output. A Run points at a dataset and stores the prompt template, provider, model, temperature, status, aggregate score, failed item count, and latency. Each RunItemResult stores one model output, one score, and one row latency.

The important modeling decision is that Run.prompt_template is copied text, not just a foreign key to a saved prompt. That is reproducibility by snapshot. If I edit or delete a saved prompt later, the old run still records exactly what was tested.

That was one of the project lessons: evaluation records should be historical facts, not live references to mutable configuration. E-commerce orders snapshot prices for the same reason. EvalBoard snapshots prompts.

The execution path

Creating a run is a POST to the Django API. The view validates the provider and dataset, creates a Run with pending status, starts execution in a background thread, and returns 202 Accepted. The frontend polls the run detail endpoint until the status becomes completed or failed.

The execution service is the heart of the app:

Mark the run as running.
Load the dataset items.
Render the prompt by replacing {{input}}.
Call the selected LLM provider.
Score the output against the ideal answer.
Persist a RunItemResult for each row.
Store aggregate score, failure count, total items, and latency on the Run.

The scoring function is deliberately modest. It normalizes punctuation, case, and whitespace, then returns exact or partial word-overlap scores. That will not solve all evaluation problems, but it made the first version useful without hiding complexity behind a fake “AI judge” abstraction.

The tradeoff is clear: exact and partial matching are explainable but limited. They work for short factual outputs and structured answers. They are weaker for open-ended writing quality, multi-step reasoning, or safety evaluation. The next step would be adding scorer types, not replacing the simple scorer. A good eval tool should let the task choose the scorer.

How Django helped

DRF serializers made the dataset flow clean. DatasetSerializer accepts nested items and creates a dataset plus all rows in one request. On update, it replaces the row set. That maps well to the product: a dataset is edited as a complete test table, not as scattered row mutations.

The run API uses two serializers for the same model. The list serializer omits the nested results array so the run history stays light. The detail serializer includes every row result because the run detail page needs comparison data. That is a small DRF pattern, but it matters. APIs should shape payloads around screens and workflows, not expose the heaviest version of a model everywhere.

The project also taught me when not to overbuild. The original architecture docs imagined Celery, Redis, and PostgreSQL. The running version uses a background thread and SQLite for the portfolio/demo scope. For long jobs, retries, multi-user concurrency, or production reliability, Celery would be the right move. For a small single-process evaluation harness, the simpler choice kept the code understandable and cheap to run.

That is the general lesson I got from EvalBoard: architecture should match the pressure the product actually has. The pressure here was not million-row throughput. It was making LLM behavior visible, repeatable, and comparable with a codebase small enough to understand end to end.